extended characters

Bob Rasmussen ras at anzio.com
Tue Sep 11 11:35:39 PDT 2007


On Tue, 11 Sep 2007, Bruce Easton wrote:

> I recently imported a file (via readline command in clerk)
> from an XML file that states at the end of the top line:
> "encoding="UTF-8".  I then write out the line as-is to 
> a record.
> 
> In filePro (SCO 5.14), the line is stored with the funny 
> characters here and there.  (The data has names in several
> different languages including eastern & western european 
> and middle eastern as well.)
> 
> I coded an errorbox to come up on @key that will scan 
> each line character by character so that I can see the 
> decimal values of each character that is above 127.
> 
> When I am on, for instance, the third character in the 
> expression "President du commandement" which I'm thinking 
> must be French, and therefore an e with an accute accent 
> (avec accent aigu), my errorbox is telling me [via my 
> code:  asc(mid(xx,currpos,"1"))] for two characters in 
> a row that filePro is storing this as decimal 195 followed 
> by decimal 169.
> 
> I don't see how these decimal numbers correlate to any 
> common character set.  Am I missing something obvious?

This is a baby step into deep waters. The essential info site is 
www.unicode.org. There are also hardcopy manuals available, published by 
this org and also by others.

Here's an overview, using hexadecimal:

Unicode characters are inherently represented as 16-bit entities, although 
some way-out-there characters require two 16-bit words. Characters in the 
00 to 7F range map straight across (note that this includes cntrol 
characters). There are NO characters mapped into 80 - 9F, unlike in 
Windows codepages. Characters in A0 to FF map straight across to ISO 
8859-1, Latin-1. Characters above FF are defined by Unicode. There are 
mappings for Greek, Hebrew, Chinese, line-drawing characters like in DOS, 
smiley faces, and 40,000 other things.

UTF-8 is a "transformation". Each 16-bit (or 32-bit) Unicode character is 
transformed, using a fairly simple algorithm, into one or more 8-bit 
bytes. The slick part is that 00 to 7F go straight across. Again, 80 - 9F 
are not used. A0 and up are the start of a 2-or-more byte sequence. 
Another simple algorithm turns UTF-8 back into 16/32-bit Unicode. These 
algorithms are explained on the website.

Note that in UTF-8, one byte no longer occupies one cell on the screen. A 
1-to-4 byte sequence represents a character, and a character can be 
single-wide or double-wide; most Far East characters are double-wide.

Hope that helps to get you started.

Regards,
....Bob Rasmussen,   President,   Rasmussen Software, Inc.

personal e-mail: ras at anzio.com
 company e-mail: rsi at anzio.com
          voice: (US) 503-624-0360 (9:00-6:00 Pacific Time)
            fax: (US) 503-624-0760
            web: http://www.anzio.com


More information about the Filepro-list mailing list