Subject: Re: Strings, Was: profile results for new UT_* implementations?
From: Dom Lachowicz (dominicl@seas.upenn.edu)
Date: Tue Jun 19 2001 - 22:55:05 CDT
> Actually, I find UT_Bytebuf useful for strings.  I use them in the text
> importer and exporter so I can have one set of functions regardless of
> whether I'm handling 8-bit or 16-bit text.  And it'll work if and when
> we have to handle 32-bit text.
Regardless of their apparent usefulness, UT_Bytebufs are not strings. They 
don't look or behave like strings - they represent a block of memory and 
nothing more. So is this useful for importing text? Yes. Optimal? Probably not. 
And (AFAIK) the UCS2 string class can properly handle appending a 'char' or a 
UCSChar to its buffer, if this makes any impact on the discussion.
 
> This must have been discussed at some point, but I'll bring it up since
> I've not seen it here yet.  I read all of the Unicode mailing lists
> and newsgroups I can and it seems everybody *hates* UCS-2.  Except
> maybe Microsoft (:  The rest of the world are coming to grips with
> using UTF-8 for interchange, and UTF-32 (UCS-4) internally.  If you
> know anything about surrogates you'll understand why.  Many people
> believe that using UCS-2, a character can always fit into one UCS-2
> char.  Some believe that if they pretend surrogates don't exist
> they can keeping using UCS-2.  But this is not true.  Many characters
> take more than one codepoint even in UTF-32.  The major concern with
> UTF-32 is that it doubles the amount of memory needed over UCS-2 ):
> 
> What's our position?  We're going to have to look into it sooner or
> later and it won't be fun.
Abi historically has always used UCS-2 internally to represent strings, and as 
you note, we're beginning to run into problems with that. Dealing with UTF-8 is 
no more pleasant than dealing with UCS-2 in my experience, but perhaps it is 
(much) more common in the programming communtiy as a whole. As you have noted, 
we do store data as UTF-8 in our file formats.
So I don't know what position to take. They all look like they suck a lot. My 
vote was for the "eveyone use english" solution, but that didn't go over too 
well ;-) So those persons more knowledegable than I on the subject are 
encouraged to step up to the mike.
And, yes, converting Abi to use anything but UCS2 will be a PITA.
Dom
This archive was generated by hypermail 2b25 : Tue Jun 19 2001 - 22:55:28 CDT