Author Topic: Support UTF-8 in String? (->GNU/gettext) (Read 31393 times)

christoph · « **on:** June 05, 2008, 07:11:28 pm »

Hi

would it be possible to add UTF-8 support to SFMLs String class so one can use it easily with the gettext library for producing multilingual software?

I can offer you an transfer method to get from std::basic_string<char> holding UTF-8 data to std::basic_string<uint32_t> holding widened data (which should be changeable to an wchar_t implementation) although this piece of code is kind of obscure.

SirJulio · « **Reply #1 on:** June 05, 2008, 07:31:23 pm »

Hi Christoph,

I think that String class already handles unicode text (ctor, wstring GetText(), SetText(wstring)), no ?

christoph · « **Reply #2 on:** June 05, 2008, 07:36:49 pm »

Yes but with gettext you'll get char* containing UTF-8 encoded data and this does not work currently. You can do a wraper around this transforming this char* / UTF-8 to an std::basic_string<wchar_t> but it would, I guess, be better in SFML and not in the application.

And I think the term multibyte for char* data is really confusing as it does not handle strings with multibyte char*s as far as I can tell

Laurent · « **Reply #3 on:** June 05, 2008, 08:46:58 pm »

SFML uses UTF-16 for unicode (in String class and in TextEntered event), if you have any other kind of input encoding it's up to you to write a conversion function. I won't put in SFML functions to convert between all kind of character encodings...

SirJulio · « **Reply #4 on:** June 05, 2008, 08:50:52 pm »

Ok christoph, I understand.

If you have a solution, maybe you could create an entry in the wiki. =)

christoph · « **Reply #5 on:** June 05, 2008, 09:09:37 pm »

Quote from: "Laurent"

SFML uses UTF-16 for unicode (in String class and in TextEntered event), if you have any other kind of input encoding it's up to you to write a conversion function. I won't put in SFML functions to convert between all kind of character encodings...

OK we'll have to live with this then. But looking at string/font sources the encoding looks more like UCS-2 and less like UTF-16 to me but you should know this better than me anyway

Laurent · « **Reply #6 on:** June 06, 2008, 09:56:57 am »

Quote

But looking at string/font sources the encoding looks more like UCS-2 and less like UTF-16

I'm still not an expert in Unicode encodings, so I can make mistakes sometimes

Why do you say so ?

christoph · « **Reply #7 on:** June 06, 2008, 04:28:18 pm »

I was thinking about the encoding to not be UTF-16 as you assume that all characters go into an wchar_t. But UTF-16 has some characters (UNICODE Charmap is 24bit AFAIR) that need 2*16bit (wchar_t most probably is 8, 16 or 32 bit in size) so it will not be UTF-16 but (depending of the size of wchar_t) UCS-2 (16bit) or UCS-4/UTF-32 (32bit).

Laurent · « **Reply #8 on:** June 07, 2008, 01:13:59 am »

Ok I see. You're probably right

Miriam Ruiz · « **Reply #9 on:** June 07, 2008, 06:40:39 pm »

I think that gettext() and UTF8 is so widespread in POSIX world (and is also supported in Windows world) that it would be a mistake not to support it someway if the idea is to do a multi-platform library.

Miriam Ruiz · « **Reply #10 on:** June 09, 2008, 04:19:38 pm »

After thinking a bit about it, there would be 2 satisfying alternative solutions:

1) add a void String::SetText(const std::string& Text, const std::locale& Locale) method
2) make std::wstring myText protected instead of private, and thus solvable through inheritance

What about that? would that be admisible for you? I just don't want to cut&paste sf::String into my app just to be able to add that functionality.

Any other ideas?

Greetings,
Miry

workmad3 · « **Reply #11 on:** June 09, 2008, 05:55:55 pm »

UTF-16 is UCS2 with support for surrogate pairs (which are 32 bits wide) in order to extend the unicode support of the encoding from the BMP(basic multilingual plane) to the entire unicode set of code points.

T.T.H. · « **Reply #12 on:** June 10, 2008, 03:16:23 pm »

Unicode, UTF-8, UTF-16, BOM, etc. - trying to understand all those things somewhat decent took me a serious amount time.

Anyway, four things I want to throw in here:

1) a short but good article about "wchar_t: unsafe at any size"

2) in UTF-16 all characters of the so called "basic multilingual plane" are encoded in 2 bytes (= 16 bit). Those pretty much include all the characters you need to write pretty much every "living" language on this planet. Plus a lot of other symbol like characters, pretty much "Wingdings on steroids". Beyond that are some more characters, a lot of fancy stuff like old Germanic rune symbols and ancient Greek numbers and the music clef. Those characters need 4 bytes to be encoded. So please do not be fooled that "UTF-16" means "2 byte things only" or that "half number of bytes of a UTF-16 string == number of characters" which is simply wrong.

3) UTF-32 instead is "every single defined Unicode character there is in 4 bytes, not more, not less".

4) decodeunicode - a life safer...

workmad3 · « **Reply #13 on:** June 10, 2008, 03:42:52 pm »

As I haven't posted it here yet, there's also:
http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html
which is nicely described as 'The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)'

That was a very useful document for me when I had to deal with UTF-16 to UTF-8 stuff at work.

Laurent · « **Reply #14 on:** June 10, 2008, 04:01:28 pm »

Thanks for the links, I'm going to read them all and try to think about a more robust and flexible implementation for Unicode in SFML