Board index » cppbuilder » unicode question

unicode question


2004-03-03 09:03:37 PM
cppbuilder102
Hi !
I need to load a unicode file to a WideString, but this is the first time I
am doing this. I had created a small code, which working perfectly with
AnsiString, but as soon as I am trying to make it work with WideString, I
fail.
The problem is, that after the routine, the WideString does not contain the
data I had loaded.
Could somebody help me ?
Big thank you in advance,
The code :
//--------------------------------------------------------------------------
-
void __fastcall Tmain_form::Button1Click(TObject *Sender)
{
//---------------------
TMemoryStream * file_buffer = new TMemoryStream();
TStringStream * buffer = new TStringStream("");
WideString data = "";
//---------------------
//-----------------
try
{
file_buffer->Clear();
file_buffer->Position = 0;
buffer->Position = 0;
if (FileExists("C:\\rules.reg"))
{
file_buffer->LoadFromFile("C:\\rules.reg");
file_buffer->Position = 0;
buffer->CopyFrom(file_buffer,file_buffer->Size);
buffer->Position = 0;
data = buffer->DataString.c_str();
}
}
__finally
{
delete buffer;
delete file_buffer;
}
//-----------------
//---------------------
}
//--------------------------------------------------------------------------
-
 
 

Re:unicode question

Sorry if this is the wrong forum, but I have a question that doesn't really
seem to fit in any of the forums. My problem is more of a linguistic nature
than a technical one:
Given a string in Unicode, in any foreign language, is there an easy way
to find out
if it is pre{*word*109}ly a left-to-right or a right-to-left string? I need
this info to align
the string properly.
--
Arthur Hoornweg
(In order to reply per e-mail, please just remove the ".net"
from my e-mail address. Leave the rest of the address intact
including the "antispam" part. I had to take this measure to
counteract unsollicited mail.)
 

Re:unicode question

Arthur Hoornweg wrote:
Quote
Given a string in Unicode, in any foreign language, is there an easy
way to find out if it is pre{*word*109}ly a left-to-right or a
right-to-left string? I need this info to align the string properly.
Have a look at the IMultiLanguage stuff - specifically things like
IMultiLanguage3.DetectOutboundCodePage
--
Colin
 

{smallsort}

Re:unicode question

On 19/05/2005 10:44:16, Arthur Hoornweg wrote:
Quote
Sorry if this is the wrong forum, but I have a question that
doesn't really seem to fit in any of the forums.
Actually, the right forum would be
borland.public.delphi.internationalization.general. I've set
follow-ups to that group.
Quote
Given a string in Unicode, in any foreign language, is there an
easy way to find out if it is pre{*word*109}ly a left-to-right or a
right-to-left string? I need this info to align the string properly.
Well, there are serveral languages that are read from right to left
(Arabic, Hebrew, Thai, and several others), so you could scan the
text for characters from these langauges and assume right-to-left
alignment under certain conditions. The problem is that you can't
really get a deterministic model for this.
All right-to-left scripts I'm familiar with are actually "complex
scripts" - while the alignment is RTL, the reading order can actually
change (for example, numbers are usually read from left to right). A
block of text can include RTL blocks, LTR blocks, neutral blocks, and
context-dependant characters (for example, puctuation marks are
usually reversed in a right-to-left block).
The larger your text block is, the better your chance is to infer the
proper alignment. If your text contains mostly Hebrew characters, for
example, it's probably safe to align it to the right. With small
blocks of text, though, it becomes much harder.
In my experience, aligning text based on content only works when the
algorithm is strongly biased towards one direction. If you can assume
most of your text is in English, for example, then always align the
text to the left unless you find enough Hebrew/Arabic/other RTL
language characters. It would also make sense to check the first word
of the sentence - it's usually a good indicator of the langauge.
If you can't assume a default direction, my advice would be to leave
it to the user.
--
Yorai Aminov (TeamB)
(TeamB cannot answer questions received via email.)
Shorter Path - www.shorterpath.com
Yorai's Page - www.yoraispage.com
 

Re:unicode question

When choosing a unicode encoding, what factors would cause you to pick
UTF-8 vs UTF-16? or any of the other encoding schemes for that matter?
Which one is most popular (utf8?)
Also, if you have an input stream, is it possible to figure out what
encoding scheme was used or do you need to know this ahead of time?
--
 

Re:unicode question

Bob wrote:
Quote
When choosing a unicode encoding, what factors would cause you to pick
UTF-8 vs UTF-16? or any of the other encoding schemes for that matter?
Which one is most popular (utf8?)
If you and the entities you communicate with mainly use the core ASCI
characters, UTF-8 is by far the most compact, because most code points
only take up 1 byte. It only "bumps up" to 2 or 4 bytes if it absolutely
has to. If you use lots of characters other than the first 127, then
UTS-16 will take up about as much space as UTF-8. UTF-32 is most
efficient in one sense, in that everything takes 4 bytes. Makes it a lot
easier to scan for the n'th character.
Most popular? Well, .Net consolidated around UTF-16. Delphi 2008 will do
the same. Unless you have a very specific reason to go otherwise, I
suggest you follow suit.
Loren sZendre
 

Re:unicode question

Bob wrote:
Quote
When choosing a unicode encoding, what factors would cause you to pick
UTF-8 vs UTF-16? or any of the other encoding schemes for that matter?
Which one is most popular (utf8?)

Also, if you have an input stream, is it possible to figure out what
encoding scheme was used or do you need to know this ahead of time?


My experience.
If I save text to file, I use UTF8. The most common used XML is
in UTF8, and UTF8 is more compact than UTF16 or UTF32.
If I process text in memory, I use UTF16 when possible. It's easy
to seek for number N character.
--
Denomo: memory leak detection tool for Delphi, free open source
CodeHook: Win32 code/API hook for Delphi & C++, free open source
www.kbasm.com/
 

Re:unicode question

Qi a écrit :
Quote
My experience.

If I save text to file, I use UTF8. The most common used XML is
in UTF8, and UTF8 is more compact than UTF16 or UTF32.

If I process text in memory, I use UTF16 when possible. It's easy
to seek for number N character.
Then it's not UTF-16 but UCS-2.
- Florent
 

Re:unicode question

Qi a écrit :
Quote
Just looked up "WideString" in Delphi 7 document.
You are right, it's UCS-2, not UTF-16. :-)
Tiburon UnicodeString type will be UTF-16, seeking will not be as easy
as in UCS-2 strings.
 

Re:unicode question

Uffe Kousgaard wrote:
Quote
UTF-8 is the choice for XML and the most compact for western languages,
while UTF-16 is what the windows API uses and what the next Delphi will use.

Look for more details here: unicode.org/faq/utf_bom.html


A lot of databases also use UTF8 such as PostgreSQL and MySQL
 

Re:unicode question

"Adem" < XXXX@XXXXX.COM >wrote in message
Quote
I don't see why CodeGear selected UTF-16, and not USC-4 strings which
would be the natural path.
There will be functions provided for selecting and iterating through the
characters of a UTF-16 string. I've used similar functions for working
with UTF-8, and its is not so bad. But yes, string processing will never be
quite the same. I believe that UTF-16 was the best choice, since it is the
windows standard - for example, when we read/write a TEdit.Text property we
are actually getting/setting text by a thin layer of code mapping straight
to Windows API calls. So we simply must work in the same way that Windows
does, or we will drown in conversions.
In the early days of Unicode, UCS-2 was intended to hold the entire range of
characters - in the same way that UCS-4 does now. Therefore, Windows NT4
had UCS-2. Windows upgraded that UTF-16 for Win2000, which was a sensible
decision, since applications built for NT4 continued to work with on Win2000
and later. So we use UTF-16 for historical reasons.
Quote
I do hope Tiburon will *also* have UCS-4 *reference* *counted* strings
so that *we* can choose what to use.
It is only a function call to convert between unicode formats, so you can
process in whatever format you choose, although I doubt you will have a
UCS-4 reference counted string type - you will have to use a class or
dynamic array etc. Whatever you do, you will end up converting formats,
since Windows displays in UTF-16, while a number of formats are common for
text files.
Fortunately, we mostly move strings around, rather than process the
individual characters. For string processing, I imagine we will come to
rely upon some optimised library functions. Life goes on.
Roger Lascelles
 

Re:unicode question

On 2008-02-02, Bob < XXXX@XXXXX.COM >wrote:
Quote
When choosing a unicode encoding, what factors would cause you to pick
UTF-8 vs UTF-16?
- Communicating with anything unix
- size over speed.
- a significant portion of latin based sets.
 

Re:unicode question

Eric Grange wrote:
Quote
Funny how the least practical encoding ended up being the standard...
:/
Precisely.
UTF-8 for static storage and transport, UTF-32 (a.k.a UCS-4) for
anything else...
I cannot help question CodeGear's parrot-fashion loyalty to MS's
mistakes.
 

Re:unicode question

Eric Grange wrote:
Quote
UTF-16 (where characters are 2, 4 or 6 bytes long)
There are no 6-byte code units in UTF-16, only 2 and 4.
Unless you count BOM as part of the first character in a string.
But you shouldn't do that, IMHO.
--
Regards,
Aleksander Oven
 

Re:unicode question

Adem wrote :
Quote
Eric Grange wrote:

>Funny how the least practical encoding ended up being the standard...
>:/
I cannot help question CodeGear's parrot-fashion loyalty to MS's
mistakes.
Well it would be a waste of resources to move away from the windows API
default encoding poluting every call to the api with automatick
translation between encodings when they can avoid it all together. If
it is such a big problem for you I would bet that you will be able to
hold the strings in any encoding you want and just translate them when
is needed.
Regards
Yannis.
--
You talk a great deal about building a better world for your children,
but when you are young you can no more envision a world inherited by
your children than you can conceive of dying. The society you mold, you
mold for yourself.
----Russell Baker-------