extended characters
Bob Rasmussen
ras at anzio.com
Tue Sep 11 11:35:39 PDT 2007
On Tue, 11 Sep 2007, Bruce Easton wrote:
> I recently imported a file (via readline command in clerk)
> from an XML file that states at the end of the top line:
> "encoding="UTF-8". I then write out the line as-is to
> a record.
>
> In filePro (SCO 5.14), the line is stored with the funny
> characters here and there. (The data has names in several
> different languages including eastern & western european
> and middle eastern as well.)
>
> I coded an errorbox to come up on @key that will scan
> each line character by character so that I can see the
> decimal values of each character that is above 127.
>
> When I am on, for instance, the third character in the
> expression "President du commandement" which I'm thinking
> must be French, and therefore an e with an accute accent
> (avec accent aigu), my errorbox is telling me [via my
> code: asc(mid(xx,currpos,"1"))] for two characters in
> a row that filePro is storing this as decimal 195 followed
> by decimal 169.
>
> I don't see how these decimal numbers correlate to any
> common character set. Am I missing something obvious?
This is a baby step into deep waters. The essential info site is
www.unicode.org. There are also hardcopy manuals available, published by
this org and also by others.
Here's an overview, using hexadecimal:
Unicode characters are inherently represented as 16-bit entities, although
some way-out-there characters require two 16-bit words. Characters in the
00 to 7F range map straight across (note that this includes cntrol
characters). There are NO characters mapped into 80 - 9F, unlike in
Windows codepages. Characters in A0 to FF map straight across to ISO
8859-1, Latin-1. Characters above FF are defined by Unicode. There are
mappings for Greek, Hebrew, Chinese, line-drawing characters like in DOS,
smiley faces, and 40,000 other things.
UTF-8 is a "transformation". Each 16-bit (or 32-bit) Unicode character is
transformed, using a fairly simple algorithm, into one or more 8-bit
bytes. The slick part is that 00 to 7F go straight across. Again, 80 - 9F
are not used. A0 and up are the start of a 2-or-more byte sequence.
Another simple algorithm turns UTF-8 back into 16/32-bit Unicode. These
algorithms are explained on the website.
Note that in UTF-8, one byte no longer occupies one cell on the screen. A
1-to-4 byte sequence represents a character, and a character can be
single-wide or double-wide; most Far East characters are double-wide.
Hope that helps to get you started.
Regards,
....Bob Rasmussen, President, Rasmussen Software, Inc.
personal e-mail: ras at anzio.com
company e-mail: rsi at anzio.com
voice: (US) 503-624-0360 (9:00-6:00 Pacific Time)
fax: (US) 503-624-0760
web: http://www.anzio.com
More information about the Filepro-list
mailing list