how to search for NON ASCII data so i can delete it

Thu Jan 17 10:12:07 PST 2019

This is how I do it in a url-encoding routine.
I take advantage of the fact that the "safe" characters are all in a
contiguous range of ascii values.
So, rather than testing for a bunch of string matches (probably
cpu-expensive when you're doing it sixty zillion times as you loop though
and examine each and every byte of data individually...)
I just get the ascii value and test if it's either lt or gt certain values.
A numerical gt/lt test should be more efficient than a string comparison.

This example is pretty aggressive and checks 3 ranges to allow ONLY numbers
and letters through, and EVERYTHING else gets url-encoded.
You may want something a little differen, and I'll show simple
modifications afterwards.

Lines 101-end are a gosub.
Line 50 is how I use it.

50   -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆
If:                                                                   ◆
       Then: x = raw_data ; gosub ure ; urlencoded_data = x

.......

101  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆ If: '       position    inlen       inchr     inchr_dec   outchr
out  ◆
       Then: declare ur_p(8,.0), ur_l(8,.0), ur_ic(1), ur_d(3,.0), ur_oc,
ur_o ◆
102  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
ure    ◆ If:                                        '--- URL-Encode
---        ◆
       Then: x = ""{x{"" ; ur_l = len(x) ; ur_p = "1" ; ur_o =
""              ◆
103  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆ If: x eq
""                                                           ◆
       Then:
return                                                            ◆
104  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
ure1   ◆
If:                                                                   ◆
       Then: ur_ic = mid(x,ur_p,"1") ;ur_oc = ur_ic ;ur_d =
asc(ur_ic)         ◆
105  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆ If: ur_d ge "97" and ur_d le "122"    '
a-z                           ◆
       Then: goto
ure_                                                         ◆
106  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆ If: ur_d ge "48" and ur_d le "57"     '
0-9                           ◆
       Then: goto
ure_                                                         ◆
107  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆ If: ur_d ge "65" and ur_d le "90"     '
A-Z                           ◆
       Then: goto
ure_                                                         ◆
108  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆
If:                                                                   ◆
       Then: ur_oc = "%" {
base(ur_d,"10","16")                                ◆
109  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
ure_   ◆
If:                                                                   ◆
       Then: ur_o = ur_o &
ur_oc                                               ◆
110  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆ If: ur_p lt
ur_l                                                      ◆
       Then: ur_p = ur_p + "1" ; goto
ure1                                     ◆
111  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆
If:                                                                   ◆
       Then: x = ur_o ; return '
ure                                           ◆

The order of lines 105-107 is optimized so that the most likely condition
is matched first, so that most of the time the routine is skipped while
doing as little work as possible.
IE, most random characters will be lowercase letters, so most of the
iterations of the loop skip to the next char after doing only a single If:
test.
Then next-most likely is probably numbers, so that's the next test, and
least-most frequent is capital letters, so it's the last thing tested.
It needs to be as tight as possible like that because that block runs once
for every character in "x", and the entire gosub is run for every field you
need processed. You want to do as little work as possible in there.

In this case, all bytes that aren't in one of those 3 ranges of safe ascii
values, is replaced with it's url-encoded equivalent.
That may be more agressive than you want, or it might be perfect. I can't
say because I don't know what you are doing with the data.
Some characters, like most punctuation and symbols, don't need to be
encoded if you're just displaying them or printing them,
but they do need to be encoded if you're using them in a shell command or a
url.
Some characters, like everything below 32 and above 126 maybe you just want
to delete them instead of encode them.

Encoding is good in general because it makes your data safe yet preserves
the original information just in another form.

If you want to strip the unsafe bytes instead of encode them, you would
replace line 108 with
108: Then: ur_oc = ""

You will probably want to adjust lines 105-107 to allow some punctuation
characters through.
As it's written above, it's ONLY letting 0-9, a-z, and A-Z through, and
url-encoding everything else.
That includes all high ascii, control characters, and even all punctuation.
I can't say whether you want to delete things like "@", or urlencode them,
or let them through un-touched.
It depends on what you're doing with the data.

To allow more characters through without deleting or encoding, one way is
you could insert one new line between 107 and 108,
test for a bunch of punctuation that you want to allow though, like this:

If: "$&+,/:;=?@ <>#%{}|^~[]`'"{chr("92"){chr("34") co ur_ic
Then: goto ure_

I think that might be unnecessarily cpu-expensive.
Here's another simpler option.
If you want to simply allow all the "printable" characters, and strip
everything else, no encoding, just replace lines 105-108 with these 2 lines:

105  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆ If: ur_d ge 32" and ur_d le "126"    ' printable 7-bit
ascii          ◆
       Then: goto
ure_                                                         ◆
106  -------   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
       ◆
If:                                                                   ◆
       Then: ur_oc = ""
                               ◆

Google "ascii table" to look at any ascii chart to see what the ascii
numbers are for the different characters, to figure out how to handle any
special cases that these examples don't cover exactly the way you need.

-- 
bkw

On Thu, Jan 17, 2019 at 10:13 AM oldtony via Filepro-list <
filepro-list at lists.celestial.com> wrote:

> seeking coaching - a customer has some data in a stock file that is
> displaying non ASCII data. How do i search for non ASCII data so i can
> delete it? - partial screen shot below- thanks for the help - Old Tony
>
> --
> tony at ynotsoftware.com
> Tony Freehauf (Old Tony)
> YNOT Software & PC Support
> 815.467.2179
> YNOT sounds like "Why Not."
> YNOT let us help you.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mailman.celestial.com/pipermail/filepro-list/attachments/20190117/bbec35a7/attachment.html
> >
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: nldachnbcbekeici.png
> Type: image/png
> Size: 6080 bytes
> Desc: not available
> URL: <
> http://mailman.celestial.com/pipermail/filepro-list/attachments/20190117/bbec35a7/attachment.png
> >
> _______________________________________________
> Filepro-list mailing list
> Filepro-list at lists.celestial.com
> Subscribe/Unsubscribe/Subscription Changes
> http://mailman.celestial.com/mailman/listinfo/filepro-list
>

-- 
bkw
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.celestial.com/pipermail/filepro-list/attachments/20190117/5a17665b/attachment.html>