Welcome, Guest. Please login or register. Did you miss your activation email?

Author Topic: Support UTF-8 in String? (->GNU/gettext)  (Read 31393 times)

0 Members and 1 Guest are viewing this topic.

christoph

  • Full Member
  • ***
  • Posts: 102
    • View Profile
    • http://www.christoph-egger.org
Support UTF-8 in String? (->GNU/gettext)
« on: June 05, 2008, 07:11:28 pm »
Hi

would it be possible to add UTF-8 support to SFMLs String class so one can use it easily with the gettext library for producing multilingual software?

I can offer you an transfer method to get from std::basic_string<char> holding UTF-8 data to std::basic_string<uint32_t> holding widened data (which should be changeable to an wchar_t implementation) although this piece of code is kind of obscure.

SirJulio

  • Full Member
  • ***
  • Posts: 241
    • View Profile
Support UTF-8 in String? (->GNU/gettext)
« Reply #1 on: June 05, 2008, 07:31:23 pm »
Hi Christoph,

I think that String class already handles unicode text (ctor, wstring GetText(), SetText(wstring)), no ?

christoph

  • Full Member
  • ***
  • Posts: 102
    • View Profile
    • http://www.christoph-egger.org
Support UTF-8 in String? (->GNU/gettext)
« Reply #2 on: June 05, 2008, 07:36:49 pm »
Yes but with gettext you'll get char* containing UTF-8 encoded data and this does not work currently. You can do a wraper around this transforming this char* / UTF-8 to an std::basic_string<wchar_t> but it would, I guess, be better in SFML and not in the application.

And I think the term multibyte for char* data is really confusing as it does not handle strings with multibyte char*s as far as I can tell

Laurent

  • Administrator
  • Hero Member
  • *****
  • Posts: 32498
    • View Profile
    • SFML's website
    • Email
Support UTF-8 in String? (->GNU/gettext)
« Reply #3 on: June 05, 2008, 08:46:58 pm »
SFML uses UTF-16 for unicode (in String class and in TextEntered event), if you have any other kind of input encoding it's up to you to write a conversion function. I won't put in SFML functions to convert between all kind of character encodings... ;)
Laurent Gomila - SFML developer

SirJulio

  • Full Member
  • ***
  • Posts: 241
    • View Profile
Support UTF-8 in String? (->GNU/gettext)
« Reply #4 on: June 05, 2008, 08:50:52 pm »
Ok christoph, I understand.

If you have a solution, maybe you could create an entry in the wiki. =)

christoph

  • Full Member
  • ***
  • Posts: 102
    • View Profile
    • http://www.christoph-egger.org
Support UTF-8 in String? (->GNU/gettext)
« Reply #5 on: June 05, 2008, 09:09:37 pm »
Quote from: "Laurent"
SFML uses UTF-16 for unicode (in String class and in TextEntered event), if you have any other kind of input encoding it's up to you to write a conversion function. I won't put in SFML functions to convert between all kind of character encodings... ;)


OK we'll have to live with this then. But looking at string/font sources the encoding looks more like UCS-2 and less like UTF-16 to me but you should know this better than me anyway ;)

Laurent

  • Administrator
  • Hero Member
  • *****
  • Posts: 32498
    • View Profile
    • SFML's website
    • Email
Support UTF-8 in String? (->GNU/gettext)
« Reply #6 on: June 06, 2008, 09:56:57 am »
Quote
But looking at string/font sources the encoding looks more like UCS-2 and less like UTF-16

I'm still not an expert in Unicode encodings, so I can make mistakes sometimes ;)
Why do you say so ?
Laurent Gomila - SFML developer

christoph

  • Full Member
  • ***
  • Posts: 102
    • View Profile
    • http://www.christoph-egger.org
Support UTF-8 in String? (->GNU/gettext)
« Reply #7 on: June 06, 2008, 04:28:18 pm »
I was thinking about the encoding to not be UTF-16 as you assume that all characters go into an wchar_t. But UTF-16 has some characters (UNICODE Charmap is 24bit AFAIR) that need 2*16bit (wchar_t most probably is 8, 16 or 32 bit in size) so it will not be UTF-16 but (depending of the size of wchar_t) UCS-2 (16bit) or UCS-4/UTF-32 (32bit).

Laurent

  • Administrator
  • Hero Member
  • *****
  • Posts: 32498
    • View Profile
    • SFML's website
    • Email
Support UTF-8 in String? (->GNU/gettext)
« Reply #8 on: June 07, 2008, 01:13:59 am »
Ok I see. You're probably right :D
Laurent Gomila - SFML developer

Miriam Ruiz

  • Newbie
  • *
  • Posts: 2
    • View Profile
Support UTF-8 in String? (->GNU/gettext)
« Reply #9 on: June 07, 2008, 06:40:39 pm »
I think that gettext() and UTF8 is so widespread in POSIX world (and is also supported in Windows world) that it would be a mistake not to support it someway if the idea is to do a multi-platform library.

Miriam Ruiz

  • Newbie
  • *
  • Posts: 2
    • View Profile
Alternatives
« Reply #10 on: June 09, 2008, 04:19:38 pm »
After thinking a bit about it, there would be 2 satisfying alternative solutions:

1) add a void String::SetText(const std::string& Text, const std::locale& Locale) method
2) make std::wstring myText protected instead of private, and thus solvable through inheritance

What about that? would that be admisible for you? I just don't want to cut&paste sf::String into my app just to be able to add that functionality.

Any other ideas?

Greetings,
Miry

workmad3

  • Jr. Member
  • **
  • Posts: 71
    • View Profile
Support UTF-8 in String? (->GNU/gettext)
« Reply #11 on: June 09, 2008, 05:55:55 pm »
UTF-16 is UCS2 with support for surrogate pairs (which are 32 bits wide) in order to extend the unicode support of the encoding from the BMP(basic multilingual plane) to the entire unicode set of code points.

T.T.H.

  • Full Member
  • ***
  • Posts: 112
    • View Profile
Support UTF-8 in String? (->GNU/gettext)
« Reply #12 on: June 10, 2008, 03:16:23 pm »
Unicode, UTF-8, UTF-16, BOM, etc. - trying to understand all those things somewhat decent took me a serious amount time.

Anyway, four things I want to throw in here:

1) a short but good article about "wchar_t: unsafe at any size"

2) in UTF-16 all characters of the so called "basic multilingual plane" are encoded in 2 bytes (= 16 bit). Those pretty much include all the characters you need to write pretty much every "living" language on this planet. Plus a lot of other symbol like characters, pretty much "Wingdings on steroids". Beyond that are some more characters, a lot of fancy stuff like old Germanic rune symbols and ancient Greek numbers and the music clef. Those characters need 4 bytes to be encoded. So please do not be fooled that "UTF-16" means "2 byte things only" or that "half number of bytes of a UTF-16 string == number of characters" which is simply wrong.

3) UTF-32 instead is "every single defined Unicode character there is in 4 bytes, not more, not less".

4) decodeunicode - a life safer...

workmad3

  • Jr. Member
  • **
  • Posts: 71
    • View Profile
Support UTF-8 in String? (->GNU/gettext)
« Reply #13 on: June 10, 2008, 03:42:52 pm »
As I haven't posted it here yet, there's also:
http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html
which is nicely described as 'The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)'

That was a very useful document for me when I had to deal with UTF-16 to UTF-8 stuff at work.

Laurent

  • Administrator
  • Hero Member
  • *****
  • Posts: 32498
    • View Profile
    • SFML's website
    • Email
Support UTF-8 in String? (->GNU/gettext)
« Reply #14 on: June 10, 2008, 04:01:28 pm »
Thanks for the links, I'm going to read them all and try to think about a more robust and flexible implementation for Unicode in SFML :)
Laurent Gomila - SFML developer

 

anything