SFML community forums

General => Feature requests => Topic started by: christoph on June 05, 2008, 07:11:28 pm

Title: Support UTF-8 in String? (->GNU/gettext)
Post by: christoph on June 05, 2008, 07:11:28 pm
Hi

would it be possible to add UTF-8 support to SFMLs String class so one can use it easily with the gettext library for producing multilingual software?

I can offer you an transfer method to get from std::basic_string<char> holding UTF-8 data to std::basic_string<uint32_t> holding widened data (which should be changeable to an wchar_t implementation) although this piece of code is kind of obscure.
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: SirJulio on June 05, 2008, 07:31:23 pm
Hi Christoph,

I think that String class already handles unicode text (ctor, wstring GetText(), SetText(wstring)), no ?
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: christoph on June 05, 2008, 07:36:49 pm
Yes but with gettext you'll get char* containing UTF-8 encoded data and this does not work currently. You can do a wraper around this transforming this char* / UTF-8 to an std::basic_string<wchar_t> but it would, I guess, be better in SFML and not in the application.

And I think the term multibyte for char* data is really confusing as it does not handle strings with multibyte char*s as far as I can tell
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: Laurent on June 05, 2008, 08:46:58 pm
SFML uses UTF-16 for unicode (in String class and in TextEntered event), if you have any other kind of input encoding it's up to you to write a conversion function. I won't put in SFML functions to convert between all kind of character encodings... ;)
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: SirJulio on June 05, 2008, 08:50:52 pm
Ok christoph, I understand.

If you have a solution, maybe you could create an entry in the wiki. =)
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: christoph on June 05, 2008, 09:09:37 pm
Quote from: "Laurent"
SFML uses UTF-16 for unicode (in String class and in TextEntered event), if you have any other kind of input encoding it's up to you to write a conversion function. I won't put in SFML functions to convert between all kind of character encodings... ;)


OK we'll have to live with this then. But looking at string/font sources the encoding looks more like UCS-2 and less like UTF-16 to me but you should know this better than me anyway ;)
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: Laurent on June 06, 2008, 09:56:57 am
Quote
But looking at string/font sources the encoding looks more like UCS-2 and less like UTF-16

I'm still not an expert in Unicode encodings, so I can make mistakes sometimes ;)
Why do you say so ?
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: christoph on June 06, 2008, 04:28:18 pm
I was thinking about the encoding to not be UTF-16 as you assume that all characters go into an wchar_t. But UTF-16 has some characters (UNICODE Charmap is 24bit AFAIR) that need 2*16bit (wchar_t most probably is 8, 16 or 32 bit in size) so it will not be UTF-16 but (depending of the size of wchar_t) UCS-2 (16bit) or UCS-4/UTF-32 (32bit).
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: Laurent on June 07, 2008, 01:13:59 am
Ok I see. You're probably right :D
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: Miriam Ruiz on June 07, 2008, 06:40:39 pm
I think that gettext() and UTF8 is so widespread in POSIX world (and is also supported in Windows world) that it would be a mistake not to support it someway if the idea is to do a multi-platform library.
Title: Alternatives
Post by: Miriam Ruiz on June 09, 2008, 04:19:38 pm
After thinking a bit about it, there would be 2 satisfying alternative solutions:

1) add a void String::SetText(const std::string& Text, const std::locale& Locale) method
2) make std::wstring myText protected instead of private, and thus solvable through inheritance

What about that? would that be admisible for you? I just don't want to cut&paste sf::String into my app just to be able to add that functionality.

Any other ideas?

Greetings,
Miry
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: workmad3 on June 09, 2008, 05:55:55 pm
UTF-16 is UCS2 with support for surrogate pairs (which are 32 bits wide) in order to extend the unicode support of the encoding from the BMP(basic multilingual plane) to the entire unicode set of code points.
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: T.T.H. on June 10, 2008, 03:16:23 pm
Unicode, UTF-8, UTF-16, BOM, etc. - trying to understand all those things somewhat decent took me a serious amount time.

Anyway, four things I want to throw in here:

1) a short but good article about "wchar_t: unsafe at any size (http://www.losingfight.com/blog/2006/07/28/wchar_t-unsafe-at-any-size/)"

2) in UTF-16 all characters of the so called "basic multilingual plane" are encoded in 2 bytes (= 16 bit). Those pretty much include all the characters you need to write pretty much every "living" language on this planet. Plus a lot of other symbol like characters, pretty much "Wingdings on steroids". Beyond that are some more characters, a lot of fancy stuff like old Germanic rune symbols and ancient Greek numbers and the music clef. Those characters need 4 bytes to be encoded. So please do not be fooled that "UTF-16" means "2 byte things only" or that "half number of bytes of a UTF-16 string == number of characters" which is simply wrong.

3) UTF-32 instead is "every single defined Unicode character there is in 4 bytes, not more, not less".

4) decodeunicode (http://decodeunicode.org/) - a life safer...
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: workmad3 on June 10, 2008, 03:42:52 pm
As I haven't posted it here yet, there's also:
http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html
which is nicely described as 'The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)'

That was a very useful document for me when I had to deal with UTF-16 to UTF-8 stuff at work.
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: Laurent on June 10, 2008, 04:01:28 pm
Thanks for the links, I'm going to read them all and try to think about a more robust and flexible implementation for Unicode in SFML :)
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: christoph on June 10, 2008, 08:44:06 pm
Quote from: "Wavesonics"
This may be out of the scope of SFML, but man oh man it would be sweet,
SFML should provide a way to do toUpper() and toLowwer() type functions for at least English language strings in the sf::string class.

Just because new programmers (hell even older programmers like my self) find ti frustrating to do this in C++


boost has an decent implementation of this for std::string it their stringutils if you are interested
Title: Support UTF-8 in String? (->GNU/gettext)
Post by: T.T.H. on June 11, 2008, 09:33:25 am
...and I totally forgot point 5:

5) if you intend to seriously work with Unicode and UTF-8 and UTF-16 in your next C++ or Java application then probably your best choice is to use the ICU library from IBM ( http://www.icu-project.org/ ) which is open source, feature rich, mature and -probably most important- considered to be working correctly. Yes, it's a huge monster of a library, but if you want to create a truly international application you need such a monster of a huge library.

(ever asked yourself what a invisible, zero width, text direction changing character will do to your text render engine? ever asked yourself how selecting text with the mouse will work in a text editor being able to seamlessy mix left-to-right and right-to-left text? ever asked yourself how to compare two visibly totally identical strings which contain several, different control characters? No? Oh, what a pitty, but welcome to the wonderful world of Unicode...)


P.S.: have fun selecting the text below (it's from a Arabic news site, probably something about soccer):

English left to right text here ديفيد فيا يسطع في سماء يورو 2008 وصدمة لليونان في بداية رحلة الدفاع عن again some left to right text