extended characters & UTF-8

Bruce Easton bruce at stn.com
Tue Sep 11 13:31:29 PDT 2007


Thanks Brian, Ken and Bob for the replies.  I obviously am
missing some basics on how these codings work. Thanks Ken
for the cliff notes. :)

I'm not 100% on what the requirement is yet, but I think the
desire is to try to convert to the English equivalent and
I can process the file prior to using from filePro, so
maybe I will try a conversion utility before anything else.

I see that there are only about 4% of the total data has
this problem, and it seems that most of that are these
Latin characters vs. Cyrillic and other alphabets, so
I could probably get the 4% way down if I coded to
convert the Latin vowels and the c with the dangly thing
to their obvious counterparts.

For reference, I'm gluing at the bottom Ken's explanation
from his reply about UTF-8.

Thanks again!

Bruce


Bruce Easton
STN, Inc.


Brian K. White wrote Tuesday, September 11, 2007 3:38 PM:

> Do you need to preserve this data as UTF-8? or merely display it
> and it's ok
> to transform it into the closest approximation your terminals
> current font
> and encoding can manage?
>
> iconv is the most common util for converting. There is also
> something called
> recode, both open source.
> And probably others as well.
> iconv is probably already installed, source and a sco binary for
> recode is
> available at http://www.aljex.com/bkw/sco/#recode linux packages
> must be out
> there.
>
> Assuming input.txt is a text file where the text is in utf-8 format.
>
> iconv -f utf8 <input.txt >output.txt
>
> That will use "the systems current locale" for the output
> encoding, which is
> probably best, or you can specify the output encoding:
>
> iconv -f utf8 -t cp437 <input.txt >output.txt
>
> If you are not free to just use system() to convert the whole file in one
> shot and use the converted file in filepro, but need to convert
> many small
> random strings, it's probably possible to use iconv as a user()
> command. The
> only difficulty is that as ever with user() it may be tricky and prone to
> getting out of step and locking up.
> The command would just be "user iconv = iconv -f utf8" unless you
> wanted to
> write a wrapper shell script around it to try to ensure the script and fp
> stay in sync.
>
> Finally, since this is xml, I beleive it's possible for
> individual fields to
> have their own encoding unrelated to that of the xml file itself
> (the tags,
> definitions of the tags, field metadata, everything in the file
> that isn't
> actual field content/data), so converting the whole file with one
> transformation may not be a technically correct handling of the data. You
> may really need an xml parser that handles each field
> correctly/individually.
>
> Brian K. White    brian at aljex.com    http://www.myspace.com/KEYofR
>
> ----- Original Message -----
> From: "Bruce Easton"
>
> >I recently imported a file (via readline command in clerk)
> > from an XML file that states at the end of the top line:
> > "encoding="UTF-8".  I then write out the line as-is to
> > a record.
> >
> > In filePro (SCO 5.14), the line is stored with the funny
> > characters here and there.  (The data has names in several
> > different languages including eastern & western european
> > and middle eastern as well.)
> >
[..]
> > my errorbox is telling me [via my
> > code:  asc(mid(xx,currpos,"1"))] for two characters in
> > a row that filePro is storing this as decimal 195 followed
> > by decimal 169.
> >
> > I don't see how these decimal numbers correlate to any
> > common character set.  Am I missing something obvious?
> >
[..]


Ken Brody wrote Tuesday, September 11, 2007 2:57 PM:

> You're missing that the file was UTF-8 encoded, meaning that the
> "e with an accute accent" (Unicode 233 decimal, 0000E9 hex) is
> stored as two bytes.
>
> 0000E9 hex == 0000 0000 0000 0000 1110 1001 binary
>
> To encode to UTF-8 for values from 000080 through 0007ff, you split
> the low-order 11 bits into 5/6 and preceed 110/10 to the bytes:
>
>      000011101001 --> 00011/101001 --> [110]00011/[10]101001 --> C3/A9
>
> C3/A9 hex are 195/169 decimal.





More information about the Filepro-list mailing list