Welcome, Guest. Please login or register. Did you miss your activation email?

Author Topic: basic_string<Unsigned Char> to basic_string<char> for UTF-8  (Read 10903 times)

0 Members and 1 Guest are viewing this topic.

Ant

  • Newbie
  • *
  • Posts: 28
    • View Profile
    • Email
basic_string<Unsigned Char> to basic_string<char> for UTF-8
« on: January 07, 2017, 07:13:58 am »
Ever since Microsoft announced that they added support Unicode literals in Visual Studio 2015, I've been slowly replacing my wide strings with UTF-8 basic strings.
https://msdn.microsoft.com/en-us/library/hh409293.aspx

I ran into an issue with std::basic_string<unsigned char>
With std::basic_string<unsigned char> (instead of  std::basic_string<char>) I lose some functionality to interact with std.  For example, I can't do this:
std::basic_string<unsigned char> something;
std::cout << something << std::endl; //error here
 

The sf::String::toUtf8 function returns std::basic_string<Uint8> which is results to std::basic_string<unsigned char>

Here's an example of an issue I'm running into:
void LogIt (const std::string& log)
{
        std::cout << log;
        OutputDebugString(log.c_str());
}

void LogIt (const std::basic_string<unsigned char>& log)
{
        //Need to convert basic_string<sf::Uint8> to string<char>
}

void DoStringTest ()
{
        //Value of string copied from:  http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
        std::string utf8String = u8"z\u00df\u6c34\U0001d10b \n";
        std::string normalString = "z\u00df\u6c34\U0001d10b \n"; //invalid chars and compiler warnings
        std::basic_string<char> utf8Basic = u8"z\u00df\u6c34\U0001d10b \n";
        std::basic_string<char> normalBasic = "z\u00df\u6c34\U0001d10b \n"; //invalid chars and compiler warnings

        sf::String sfString = utf8String;
        //std::string something = sfString.toUtf8(); //error
        std::basic_string<sf::Uint8> fromSF = sfString.toUtf8();

        LogIt(utf8String); //produces "zÃ...." The remaining characters cannot be displayed in this forum post.
        LogIt(utf8Basic); //produces "zÃ...." The remaining characters cannot be displayed in this forum post.
        LogIt(sfString); //produces "zÃ...." The remaining characters cannot be displayed in this forum post.
        LogIt(fromSF); //Goes to overloaded function with param std::basic_string<unsigned char>
}
 
Compiled in Microsoft Visual Studio 2015 Update 1
Using SFML version 2.4.0.

I'm not sure if I should have created a feature request to have that function return a std::basic_string<char> instead, or I should look at this problem at a different angle.

Thanks for your help.
« Last Edit: January 07, 2017, 07:16:59 am by Ant »

Laurent

  • Administrator
  • Hero Member
  • *****
  • Posts: 32498
    • View Profile
    • SFML's website
    • Email
Re: basic_string<Unsigned Char> to basic_string<char> for UTF-8
« Reply #1 on: January 07, 2017, 09:36:58 am »
If you want to convert a sf::String to an UTF-8 std::string, then you have two options: either convert the std::basic_string<unsigned char> to a std::string (with std::transform, or a simple loop -- it's just a type conversion), or reimplement sf::String::toUtf8 yourself, internally it's just a single call to Utf32::toUtf8 after all (and this one allows any type of output characters).

Note that your UTF-8 to sf::String conversion is broken: you should use sf::String::toUtf8, otherwise it assumes that the input std::string is encoded using the current locale (which, by default, is certainly not UTF-8 on Windows).
« Last Edit: January 07, 2017, 09:39:01 am by Laurent »
Laurent Gomila - SFML developer

Ant

  • Newbie
  • *
  • Posts: 28
    • View Profile
    • Email
Re: basic_string<Unsigned Char> to basic_string<char> for UTF-8
« Reply #2 on: January 23, 2017, 08:27:35 am »
Sorry about the late reply.  I come to realize that I didn't conduct enough research to continue on with that task.

The following week, I've done some reading.  Here are a few things I've messed up on when making the first post.

  • Firstly, I was reading the strings incorrectly.  I've learned that there's a neat trick in Visual Studio's watcher window to interpret the byte array differently.  For example, when you append a ",s8" to the variable name in the watcher window, it'll read it as a UTF-8 string.  Other format specifiers can be found here:  https://msdn.microsoft.com/en-us/library/75w45ekt.aspx
  • I found out that the standard library doesn't fully support Unicode strings.  For example, if the UTF-8 string contains a character that possesses more than 1 byte, the std library would treat that character as 2 or more characters.
The following weeks, I was implementing a bidirectional string iterator that will jump across one or more bytes based on the size of the current character.  I've used the utf8 library for validation and conversions.  The useful library can be found here:  http://utfcpp.sourceforge.net/

Although this was a pleasant learning experience, rewriting many of the string libraries and their associated unit tests were not so pleasant.  In the end, I'm glad I took on this task!

Sorry for bumping this topic.  But I wanted to bring closure to this, and to share my experience for future readers.


Laurent

  • Administrator
  • Hero Member
  • *****
  • Posts: 32498
    • View Profile
    • SFML's website
    • Email
Re: basic_string<Unsigned Char> to basic_string<char> for UTF-8
« Reply #3 on: January 23, 2017, 08:53:34 am »
Quote
I've used the utf8 library for validation and conversions.  The useful library can be found here:  http://utfcpp.sourceforge.net/
Why not sf::Utf8?
Laurent Gomila - SFML developer

Ant

  • Newbie
  • *
  • Posts: 28
    • View Profile
    • Email
Re: basic_string<Unsigned Char> to basic_string<char> for UTF-8
« Reply #4 on: January 24, 2017, 07:50:12 am »
There's functionality in utf-8 library that's missing from SFML.

The find_invalid function determines if the string of bytes is formatted correctly in UTF-8.  This is helpful for unit testing.  This is also helpful when checking if a file is encoded in utf-8 or not.

I also liked the ability to go backwards in the iterator (since some functions may be searching from right to left).  This required a few utilities such as determining if the current byte is starting a new character or not.