Board index » delphi » Need Unicode strings

Need Unicode strings


2006-12-11 05:02:13 AM
delphi197
Hello,
First, I must say I am aware that CodeGear is working to bring Unicode to
Delphi. However it is not clear what Unicode support will be in next
version. The bad thing is current Delphi (I have 2005, but this conforms to
2006 too) strings can't handle all unicode characters. Here's explanation:
Delphi uses WideStrings to handle UCS-2 characters, those in range from 0 to
65535 because UCS-2 char (WideChar) always occupy two bytes (or word if you
like). Unfortunately range of unicode characters is 0 to $10FFFF (1114111).
UCS2 was native "unicode" format in Win9x/NT. From Windows 2000 native
unicode format changed to UTF-16, but UCS-2 remain for backward
compatibility. We can compare UCS to UTF like fixed-length-of-char to
varying-length-of-char. UTF-16 uses pair of words to encode characters
beyond UCS-2.
All I need is having new string type(s) in addition to
short/long/widestring. For example Windows's native varying length
UTF16Char/UTF16String or fixed length UCS4Char/UCS4String. UCS4 is similar
to UCS2 but char is fixed at 4 bytes, and it can handle all current and
future Unicode characters.
Thank you
 
 

Re:Need Unicode strings

Quote
All I need is having new string type(s) in addition to
I vote UTF8 :-)
best regards
Thomas
 

Re:Need Unicode strings

IMO the dream solution would be to get not one, but three new types: UTF8, UTF16
and UTF32, with automatic casting between them, but *without* automatic casting
to the non-Unicode string types (AnsiString, WideString).
Silent AnsiString <->WideString conversions cause more pain than they ease.
And no array accessor of type character on the UTF strings types, only
Byte/Word/Cardinal. If you need to browse the characters, you would have to use
an iterator of some kind.
Eric
 

Re:Need Unicode strings

Quote
IMO the dream solution would be to get not one, but three new types: UTF8,
UTF16 and UTF32, with automatic casting between them, but *without*
automatic casting to the non-Unicode string types (AnsiString,
WideString).
Yes, casting between true unicode types would be piece of cake. However
casting from WideString to UTFxx will work, when casting in opposite
direction there should be an exception when converting character is outside
BMP.
Quote
Silent AnsiString <->WideString conversions cause more pain than they
ease.
What about converting to AnsiString using UTF/WideString encoding so
AnsiString:=UTF8String;
would be equal to
AnsiString:=Utf8Encode(UTF8String) ?
Quote
And no array accessor of type character on the UTF strings types, only
Byte/Word/Cardinal. If you need to browse the characters, you would have
to use an iterator of some kind.
I expect to have fully supported string type, including using Copy, Delete,
Pos, with array accessor.
 

Re:Need Unicode strings

"Eric Grange" <XXXX@XXXXX.COM>writes
Quote
IMO the dream solution would be to get not one, but three new types: UTF8,
UTF16 and UTF32, with automatic casting between them, but *without*
automatic casting to the non-Unicode string types (AnsiString,
WideString).
Silent AnsiString <->WideString conversions cause more pain than they
ease.

And no array accessor of type character on the UTF strings types, only
Byte/Word/Cardinal. If you need to browse the characters, you would have
to use an iterator of some kind.
Basicly a CodePoint iterator ?
However in some plains a full glyph(Character) is composed out of more then
one codepoint.
For example e with tremma might be two codepoints .
 

Re:Need Unicode strings

Quote
Basicly a CodePoint iterator ?
However in some plains a full glyph(Character) is composed out of more then
one codepoint.
For example e with tremma might be two codepoints .
There is nothing preventing having a GlyphIterator too :)
Ideally, you could have filtering iterators too, f.i. an iterator that
returns UpperCase versions of the glyphs, that skips particular
characters, etc.
The whole rationale would be to move away from the old "a string is an
array of characters" concept, to something more like "a string is a
stream of characters", with random access being made hard rather than
simple.
String libraries should ideally also be modernized, probably to be based
on iterators, streamers and filters rather than string functions/methods.
Eric
 

Re:Need Unicode strings

"Eric Grange" <XXXX@XXXXX.COM>writes
Quote
>Basicly a CodePoint iterator ?
>However in some plains a full glyph(Character) is composed out of more
>then one codepoint.
>For example e with tremma might be two codepoints .

There is nothing preventing having a GlyphIterator too :)
Ideally, you could have filtering iterators too, f.i. an iterator that
returns UpperCase versions of the glyphs, that skips particular
characters, etc.

The whole rationale would be to move away from the old "a string is an
array of characters" concept, to something more like "a string is a stream
of characters", with random access being made hard rather than simple.

String libraries should ideally also be modernized, probably to be based
on iterators, streamers and filters rather than string functions/methods.
Yes.
However this will rub the backwards compatible crowd the wrong way.
Altough i don't really see how you can do this backwards compatible at all
anyways, so maybe this is a good thing.
As a concession you could add implicit utf8<->cp_acp conversions when
assigning a normal string or utf8<->ucs2 (or even better utf16) for
widestrings.
there is even a case for using ucs4, to ensure nobody passes a pchar with
utf8 as a cp_acp string.
greets, Bart
 

Re:Need Unicode strings

"Piotr Szturmaj" <XXXX@XXXXX.COM>writes:
Quote
Hello,
Hi, Piotr.
Do you want to know how does Borl, erg, CodeGear is going to
handle Unicode ?
I suggest DON'T WAIT FOR CodeGear's Unicode support,
just been practical,
I think that is going to TAKE SOME TIME time,
wheter for CodeGear management or technical reasons...
There is several unicode support libraries and controls, some
of them open source, some of the paidware, but, its investment
on those paidware tools, deserve the price...
Quote
varying-length-of-char. UTF-16 uses pair of words to encode
characters beyond UCS-2.
I have tryied to make my customs controls with Unicode, and my
experience and suggestion to you, is that you use the
FIXED LENGTH OF CHAR Unicode versions (UCS-2, UCS-4, UCS-8),
also known as the:
"all characters use the same number of bytes" approach,
they're more easily to handle and FAST
in your applications...
Quote
future Unicode characters.
Wheter you choose UCS2 or UCS4 I suggest you "wrap" them,
with classes, records or pointers, and when
CodeGear gets ready the unicode support, you may
migrate to CodeGear types.
My approach was using "Null Terminated Unicode Strings":
--------------
...
type
TUCS2Char = word;
PUCS2Char = ^TUCS2Char;
TUCS2String = array [65535] of TUCS2Char;
PUCS2String = ^TUCS2String;
function UCS2_Assign(const Value: string): PUCS2String;
// returns a NEW zero character terminated string pointer
// from dynamic memory string
procedure UCS2_Release(var Value: PUCS2String);
// releases an EXISTING zero character terminated string
// pointer from dynamic memory string
--------------
And later, you may migrate easily migrate
your existing "unicode" libraries
to CodeGear Unicode Support.
Quote
Thank you
Just my 2 cents...
maramirezc
 

Re:Need Unicode strings

Quote
However this will rub the backwards compatible crowd the wrong way.
AnsiString and WideString could remain the way they are, just legacy and
flaggable as "deprecated" via a compiler option.
Quote
Altough i don't really see how you can do this backwards compatible at all
anyways, so maybe this is a good thing.
New functionality should ideally be only the new string types. Old code
will keep working, and you will have conversion function to interface
old code with new types and vice-versa.
Quote
As a concession you could add implicit utf8<->cp_acp conversions when
assigning a normal string or utf8<->ucs2 (or even better utf16) for
widestrings.
Implicit conversions are error nests in general, and in this case, IMO
having implicit conversions with the legacy types would just allow the
issues that are nowadays plaguing AnsiString and WideString to survive.
Namely old code that garbles character because it messes between System,
User and String/Charset locales for AnsiStrings (ie. most of the
AnsiString code out there) or code that confuses UCS2 with UTF-16 (ie.
most of the WideString code out there).
Eric
 

Re:Need Unicode strings

Eric Grange writes:
Quote
IMO the dream solution would be to get not one, but three new types:
UTF8, UTF16 and UTF32, with automatic casting between them, but
*without* automatic casting to the non-Unicode string types (AnsiString,
WideString).
Silent AnsiString <->WideString conversions cause more pain than they
ease.

And no array accessor of type character on the UTF strings types, only
Byte/Word/Cardinal. If you need to browse the characters, you would have
to use an iterator of some kind.
I like this proposal quite a bit. Are they in QC? I'd certainly
vote for this if it were.
--
Brian Moelk
Brain Endeavor LLC
XXXX@XXXXX.COM
 

Re:Need Unicode strings

Quote
I like this proposal quite a bit. Are they in QC? I'd certainly
vote for this if it were.
I've just found one:
qc.borland.com/wc/qcmain.aspx
 

Re:Need Unicode strings

Piotr Szturmaj writes:
Quote
I've just found one:
qc.borland.com/wc/qcmain.aspx
That one is good in terms of fixing the current UTF8String
implementation, but it is a bit different than what Eric is proposing.
Nonetheless, it is worth voting for IMO.
--
Brian Moelk
Brain Endeavor LLC
XXXX@XXXXX.COM