UTF-8 (was Re: extended characters)

Tue Sep 11 11:57:22 PDT 2007

Quoting Bruce Easton (Tue, 11 Sep 2007 14:17:58 -0400):

> I recently imported a file (via readline command in clerk)
> from an XML file that states at the end of the top line:
> "encoding="UTF-8".  I then write out the line as-is to
> a record.
>
> In filePro (SCO 5.14), the line is stored with the funny
> characters here and there.  (The data has names in several
> different languages including eastern & western european
> and middle eastern as well.)
>
> I coded an errorbox to come up on @key that will scan
> each line character by character so that I can see the
> decimal values of each character that is above 127.
>
> When I am on, for instance, the third character in the
> expression "President du commandement" which I'm thinking
> must be French, and therefore an e with an accute accent
> (avec accent aigu), my errorbox is telling me [via my
> code:  asc(mid(xx,currpos,"1"))] for two characters in
> a row that filePro is storing this as decimal 195 followed
> by decimal 169.
>
> I don't see how these decimal numbers correlate to any
> common character set.  Am I missing something obvious?
[...]

You're missing that the file was UTF-8 encoded, meaning that the
"e with an accute accent" (Unicode 233 decimal, 0000E9 hex) is
stored as two bytes.

0000E9 hex == 0000 0000 0000 0000 1110 1001 binary

To encode to UTF-8 for values from 000080 through 0007ff, you split
the low-order 11 bits into 5/6 and preceed 110/10 to the bytes:

     000011101001 --> 00011/101001 --> [110]00011/[10]101001 --> C3/A9

C3/A9 hex are 195/169 decimal.

-- 
KenBrody at BestWeb dot net        spamtrap: <g8ymh8uf001 at sneakemail.com>
http://www.hvcomputer.com
http://www.fileProPlus.com