Author Topic: Request: Problem with accents (Read 5877 times)

game_maker · « **on:** September 23, 2013, 04:07:53 am »

Good night, mates!

After multiple tests, I am sure the problem is request.setBody().

It accepts a parameter (std::string). My program gets the text value of clipboard and puts it in request.setBody().

Ex:

std::string Blablabla(std::string str)
{
sf::Http http("http://translate.google.com.br/");
sf::Http::Request request("/translate_a/t", sf::Http::Request::Post);
str = "client=t&sl=en&sc=2&prev=btn&ssel=0&tsel=0&q="+str+"&tl="+Idioms[option];
request.setBody(str);
...
std::string resposta(response.getBody());
return resposta;
}

If I pass "água" (water) as parameter, the request is sent without the 'special' chars ('á', 'ã','ó').

Where is the problem?

Thanks in advance,

Pedro Henrique.

Laurent · « **Reply #1 on:** September 23, 2013, 07:52:48 am »

There are at least 3 encodings involved:

1. your clipboard encoding
2. your std::string encoding
3. your HTTP request encoding

Chances are that 1. and 2. are UTF-8 if you're on Linux, so no problem here. So I'd say your problem is that your request doesn't expect the encoding used in 2. -- by default I think it expects a URL-encoded string, and you have to use the Content-Type field if you need to define the encoding more explicitly.

game_maker · « **Reply #2 on:** September 23, 2013, 11:37:36 pm »

(I am using Windows 7.)
Thanks Laurent.

I translate the text in the clipboard.

I did this process:

1 - Open Notepad.
2 - Type "%c3%a1gua".
3 - Copy it.
4 - Translate it (using the program).
5 - The result is: water.

So, I know I must encode it to Hexadecimal UTF-8 format.

Does anyone know how to convert it (obviously in C++)?

I made a good research, but I did not find a solution.

Thanks in advance.

Ixrec · « **Reply #3 on:** September 24, 2013, 12:09:34 am »

I can't figure out exactly what it is you need since you left out things like what you need to convert from, why you need to do any conversion in the first place (see below), but hopefully some of this helps.

If you have C++11, there are a lot of options. http://stackoverflow.com/questions/7232710/convert-between-string-u16string-u32string looks like a good starting point for this.

On Linux, like Laurent said everything is in UTF-8 so there's no problem. On Windows, you need to be using std::wstring (or one of the new C++11 string classes) instead of std::string, which gets you a UTF-16 string before doing any conversion yourself.

Personally the only conversion I've ever had to do within my program is a UTF-16 std::wstring to a UTF-8 std::string and back (I can link the code for this if you want). I've never had to go from a non-UTF-8 std::string to a UTF-8 std::string. And I'm not using C++11 yet.

Since I don't know where you're getting your input from, I should point out that if you use a non-terrible text editor (like Notepad++; plain old Notepad is quite abysmal these days) then you can just pick the encoding in the menus at the top effortlessly.

game_maker · « **Reply #4 on:** September 24, 2013, 12:25:28 am »

I wanna convert from std::string to UTF-8 hexadecimal.

I do not use C++11.

I created a tool for translating texts anywhere.

sf::Http http("http://translate.google.com.br/");
sf::Http::Request request("/translate_a/t", sf::Http::Request::Post);

std::string str_base = "ie=UTF-8&oe=UTF-8&hl=pt-BR&client=t&sc=2&sl=";

string Idiomas[] = { "Portuguese", "Spanish", "English", "Estonian", "Arabic", "Russian", "Japanese", "French", "German", "Italian" };

string Idioms[] = { "pt", "es", "en", "et", "ar", "ru", "ja", "fr", "de", "it"};

std::string PegarClipboard()
{
  if (!OpenClipboard(NULL)) return " ";
  HANDLE hData = GetClipboardData(CF_TEXT);
  if (hData == NULL) return " ";
  char * pszText = static_cast<char*>( GlobalLock(hData) );
  if (pszText == NULL) return " ";
  std::string text( pszText );
  GlobalUnlock( hData );
  CloseClipboard();
  return text;
}

void SetClipboard(std::string txt)
{
    OpenClipboard(NULL);
        EmptyClipboard();
        HGLOBAL hg=GlobalAlloc(GMEM_MOVEABLE,txt.size()+1);
        if (!hg){
                CloseClipboard();
                return;
        }
        memcpy(GlobalLock(hg),txt.c_str(),txt.size()+1);
        GlobalUnlock(hg);
        SetClipboardData(CF_TEXT,hg);
        CloseClipboard();
        GlobalFree(hg);
}

std::string Traduzir(std::string str, unsigned short language1, unsigned short language2)
{
    str = str_base+Idioms[language1]+"&tl="+Idioms[language2]+"&q="+str;
    request.setField("Content-Language", Idioms[language2]);
    request.setBody(str);


    sf::Http::Response response = http.sendRequest(request);

    if (response.getStatus() == sf::Http::Response::Ok)
    {
        std::string resposta(response.getBody());
        for(unsigned short i=6; i<resposta.size(); i+=1)
        {
            if (resposta.at(i) == '"')
            {
                std::string sub = resposta.substr(4, i - 4);
                cout <<sub;
                return sub;
            }
        }
    }
    else
        cout<<response.getStatus();
    return "Translation error.";
}
 

This is my almost complete code.

Ixrec · « **Reply #5 on:** September 24, 2013, 01:48:05 am »

You're still not specifying the initial encoding. std::string just means a string of chars. The string can be encoded in UTF-8 or ANSI or anything else. That's the whole reason you need to be using wchar_t and std::wstring on Windows.

I also have no idea what you mean by "UTF-8 hexadecimal". There are no decimal, octal or hexadecimal versions of character encodings. That just doesn't make any sense. If you don't know what "UTF-8" or "hexadecimal" actually mean you should probably go google and read about them right away.

game_maker · « **Reply #6 on:** September 24, 2013, 02:29:27 am »

Ixrec, let's say I wanna translate "água" to "%c3%a9gua".

I found a javascript algorithm. I am trying to adapt it.

Can you help me?

#include <windows.h>
#include <iostream>

std::string gethex(char decimal) {
  std::string hexchars = "0123456789ABCDEFabcdef";
  return "%" + hexchars.at(decimal >> 4) + hexchars.at(decimal & 0xF);
  }

std::string encode(std::string str)
{
std::string unreserved = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~";
std::string r = "";

    for (int i = 0; i < str.size(); i++ )
    {
      char charcode = str.at(i);

      if (unreserved.find(charcode) != std::string::npos) {
        r += gethex(charcode);
      } else {
        if (charcode < 128) {
          r += gethex(charcode);
        }
        else
        if (charcode > 127 && charcode < 2048) {
          r += gethex((charcode >> 6) | 0xC0);
          r += gethex((charcode & 0x3F) | 0x80);
        }
        else
        if (charcode > 2047 && charcode < 65536) {
          r += gethex((charcode >> 12) | 0xE0);
          r += gethex(((charcode >> 6) & 0x3F) | 0x80);
          r += gethex((charcode & 0x3F) | 0x80);
        }
        else
        if (charcode > 65535) {
          r += gethex((charcode >> 18) | 0xF0);
          r += gethex(((charcode >> 12) & 0x3F) | 0x80);
          r += gethex(((charcode >> 6) & 0x3F) | 0x80);
          r += gethex((charcode & 0x3F) | 0x80);
        }

      }

    }
return r;
}

Ixrec · « **Reply #7 on:** September 24, 2013, 03:00:57 am »

Ooooooh that's what you meant. I had no idea you'd have to do that kind of conversion.

Well, converting algorithms from one language to another is usually really easy if the languages share similar syntax (unless you simply don't understand the algorithm or one of the languages involved). To be honest what you posted already looks so much like C++ to me that I have no way of knowing what parts of it you'd need help with. The gethex function maybe? (I assume you'd use either snprintf with %x or a stringstream and std::hex to emulate that)

game_maker · « **Reply #8 on:** September 24, 2013, 03:13:52 am »

I translated the algorithm to C++ (it compiles fine).

Now I have a problem... out of range:

std::string gethex(char decimal) {
  std::string hexchars = "0123456789ABCDEFabcdef";
  return "%" + hexchars.at(decimal>>4) + hexchars.at(decimal & 0xF);
}

Quote

I have no way of knowing what parts of it you'd need help with.

I don't have experience with bitwise operators. Here you go,

.

Thanks.

Ixrec · « **Reply #9 on:** September 24, 2013, 03:19:27 am »

Gonna have to tell me more than just "out of range," I have no idea what part of that code caused it or why or what the input was at the time. You really need to work on asking complete questions that we can actually answer. If you're not using a debugger, start using one, because it gives you all of this info and more.

The bitwise operators appear to be mostly the same in javascript as in C++ so I don't think that should be a problem.

Edit: Actually, you probably shouldn't be using that function at all. The standard library has at least two ways to convert numbers to hexadecimal strings (which I mentioned in my last post), and it's always better to use the standard functions for something when possible. Maybe you'd have to swap a "0x" prefix for a "%" prefix, but that's probably easier than writing/debugging the actual conversion function yourself.

game_maker · « **Reply #10 on:** September 24, 2013, 03:26:15 am »

Trying to use your solution.

#include <windows.h>
#include <iostream>
#include <sstream>

std::string gethex(int decimal) {
  std::stringstream stream;
  stream << "%" << std::hex << decimal;

  return stream.str();
  }

std::string encode(std::string str)
{
std::string unreserved = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~";
std::string r = "";

    for (int i = 0; i < str.length(); i++ )
    {
      char charcode = str.at(i);

      if (unreserved.find(charcode) != -1) {
        r += charcode;
      } else {
        if (charcode < 128) {
          r += gethex(charcode);
        }
        else
        if (charcode > 127 && charcode < 2048) {
          r += gethex((charcode >> 6) | 0xC0);
          r += gethex((charcode & 0x3F) | 0x80);
        }
        else
        if (charcode > 2047 && charcode < 65536) {
          r += gethex((charcode >> 12) | 0xE0);
          r += gethex(((charcode >> 6) & 0x3F) | 0x80);
          r += gethex((charcode & 0x3F) | 0x80);
        }
        else
        if (charcode > 65535) {
          r += gethex((charcode >> 18) | 0xF0);
          r += gethex(((charcode >> 12) & 0x3F) | 0x80);
          r += gethex(((charcode >> 6) & 0x3F) | 0x80);
          r += gethex((charcode & 0x3F) | 0x80);
        }

      }
    }

return r;
}
 

Now the problem is that the accentued characters become UTF-16 encoded.

Let's say 'á' is 'ff%ff%e1', but I need it be '%c3%a1'.

I need to convert utf-16 (hex) to utf-8 (hex), like above.

Pedro Henrique.

Ixrec · « **Reply #11 on:** September 27, 2013, 02:38:24 am »

Then I guess what you have there is a function to output UTF-16. Though UTF-16 to UTF-8 is something I've actually done (and the code below shows how), it seems pointless to do two logically separate conversions. Unfortunately since I still have no idea what the encoding of the original std::string is, I don't know exactly how to go from that to UTF-8 directly.

At this point I think I should just paste the functions I've been using to convert between UTF-8 and UTF-16 and let you figure out the details yourself.

//this function was written by http://www.cplusplus.com/user/Disch/
//taken from his post http://www.cplusplus.com/forum/general/31270/#msg169285
std::wstring utf8_to_wstring(const char* str) {
        const unsigned char* s = reinterpret_cast<const unsigned char*>(str);
    static const wchar_t badchar = '?';
    std::wstring ret;
    unsigned i = 0;
    while(s[i]) {
        try {
                        // 00-7F: 1 byte codepoint
            if(s[i] < 0x80)                     { ret += s[i]; ++i; }
                        // 80-BF: invalid for midstream
            else if(s[i] < 0xC0)    throw 0;
                        // C0-DF: 2 byte codepoint
            else if(s[i] < 0xE0) {
                if((s[i+1] & 0xC0) != 0x80)             throw 1;
                ret += ((s[i  ] & 0x1F) << 6) |
                                           ((s[i+1] & 0x3F));
                i += 2;
            }
                        // E0-EF: 3 byte codepoint
            else if(s[i] < 0xF0) {
                if((s[i+1] & 0xC0) != 0x80)             throw 1;
                if((s[i+2] & 0xC0) != 0x80)             throw 2;
                wchar_t ch = 
                        ((s[i  ] & 0x0F) << 12) |
                        ((s[i+1] & 0x3F) <<  6) |
                        ((s[i+2] & 0x3F));
                i += 3;
                // make sure it isn't a surrogate pair
                if((ch & 0xF800) == 0xD800)
                    ch = badchar;
                ret += ch;
            }
                        // F0-F7: 4 byte codepoint
            else if(s[i] < 0xF8) {
                if((s[i+1] & 0xC0) != 0x80)             throw 1;
                if((s[i+2] & 0xC0) != 0x80)             throw 2;
                if((s[i+3] & 0xC0) != 0x80)             throw 3;
                unsigned long ch = 
                        ((s[i  ] & 0x07) << 18) |
                        ((s[i+1] & 0x3F) << 12) |
                        ((s[i+2] & 0x3F) <<  6) |
                        ((s[i+3] & 0x3F));
                i += 4;

                // make sure it isn't a surrogate pair
                if((ch & 0xFFF800) == 0xD800)
                    ch = badchar;
                if(ch < 0x10000)        // overlong encoding -- but technically possible
                    ret += static_cast<wchar_t>(ch);
                else if(std::numeric_limits<wchar_t>::max() < 0x110000)
                {
                    // wchar_t is too small for 4 byte code point
                    //  encode as UTF-16 surrogate pair
                    ch -= 0x10000;
                    ret += static_cast<wchar_t>( (ch >> 10   ) | 0xD800 );
                    ret += static_cast<wchar_t>( (ch & 0x03FF) | 0xDC00 );
                }
                else
                    ret += static_cast<wchar_t>(ch);
            }
                        // F8-FF: invalid
            else throw 0;
        }
        catch(int skip) {
            if(!skip) {
                do { ++i; }
                                while((s[i] & 0xC0) == 0x80);
            }
            else
                i += skip;
        }
    }
    return ret;
}
 

//this function I made myself based on Disch's function above
std::string wstring_to_utf8(std::wstring line) {
        const wchar_t* s = line.c_str();
    static const wchar_t badchar = '?';
    std::string ret;
    unsigned i = 0;
    while(s[i]) {
        
                        //notes:
                        //the first few bits tell what kind of byte it is
                        //0_______ means a one-byte character
                        //10______ means part of a multi-byte character, but not the start, hence this is "invalid for midstream"
                        //110_____ means the start of a 2-byte character
                        //1110____ means the start of a 3-byte character
                        //1111____ means the start of a 4-byte character

                        // 00-7F: 1 byte character converts to a 7 digit binary codepoint, so max is 0x7F
            if(s[i] < 0x80)                     { ret += static_cast<char>(s[i]); }
                        
                        // C0-DF: 2 byte character converts to an 11 digit binary codepoint, so max is 0x7FF
                        //xxxxx xxxxxx -> 110xxxxx 10xxxxxx
                        else if(s[i] < 0x800) {
                                ret += (s[i] >> 6) | 0xC0;
                                ret += (s[i] & 0x3F) | 0x80; 
                        }
                        
                        //UTF-16 surrogate pair used because wchar_t wasn't big enough for a 4-byte code point
                        //11011xxxxxxxxxxx 110111xxxxxxxxxx -> 1111xxxx 10xxxxxx 10xxxxxx 10xxxxxx
                        else if( ((s[i] & 0xF800) == 0xD800) && ((s[i+1] & 0xFC00) == 0xDC00) ) {
                                unsigned long ch = ((s[i  ] & 0x07FF) << 10) |
                                                   ((s[i+1] & 0x03FF) << 0);
                                
                                ch += 0x10000; //I have no idea why you have to subtract 0x10000 when making a surrogate pair
                                
                                //now it's the same as any other 4-byte character
                                ret += static_cast<char>((ch >> 18) | 0xF0);
                                ret += static_cast<char>(((ch >> 12) & 0x3F) | 0x80);
                                ret += static_cast<char>(((ch >> 6)  & 0x3F) | 0x80);
                                ret += static_cast<char>(((ch >> 0)  & 0x3F) | 0x80);                           
                                
                                i++; //this is the only reason we'd ever go through more than one wchar_t at a time
                        }
                        
                        // E0-EF: 3 byte character converts to a 16 digit binary codepoint, so max is 0xFFFF
                        // xxxx xxxxxx xxxxxx -> 1110xxxx 10xxxxxx 10xxxxxx
                        else if(s[i] < 0x10000) {
                                if((s[i] & 0xF800) == 0xD800) {
                                        // make sure it isn't a surrogate pair
                                        ret += badchar;
                                } else {
                                        ret += (s[i] >> 12) | 0xE0;
                                        ret += ((s[i] >> 6) & 0x3F) | 0x80;
                                        ret += ((s[i] >> 0) & 0x3F) | 0x80;
                                }
                        }
                        
                        // F0-F7: 4 byte character converts to a 22 digit binary codepoint, so max is 0x3FFFFF
                        // xxxx xxxxxx xxxxxx xxxxxx -> 1111xxxx 10xxxxxx 10xxxxxx 10xxxxxx
                        //  100 001111 111111 111111 is apparently *the* maximum value atm
                        else if(s[i] < 0x400000) {
                                ret += (s[i] >> 18) | 0xF0; //ignore the warning here; if wchar_t's too small this block of code will never be used anyway
                                ret += ((s[i] >> 12) & 0x3F) | 0x80;
                                ret += ((s[i] >> 6)  & 0x3F) | 0x80;
                                ret += ((s[i] >> 0)  & 0x3F) | 0x80;
                        }
                        
                        else throw(L"Failure to convert internal UTF-16 string '" + line + L"' back to UTF-8.");
                
                i++;
    }
    return ret;
}
 

Laurent · « **Reply #12 on:** September 27, 2013, 07:38:09 am »

Don't forget the sf::Utf classes for UTF conversions.

The original (clipboard) encoding is probably the system encoding, so using the system locale should be enough. Again, SFML has built-in conversion functions for that. Have a little look at the API documentation (sf::String, sf::Utf).

Author Topic: Request: Problem with accents (Read 5877 times)

game_maker

Request: Problem with accents

Laurent

Re: Request: Problem with accents

game_maker

Re: Request: Problem with accents

Ixrec

Re: Request: Problem with accents

game_maker

Re: Request: Problem with accents

Ixrec

Re: Request: Problem with accents

game_maker

Re: Request: Problem with accents

Ixrec

Re: Request: Problem with accents

game_maker

Re: Request: Problem with accents

Ixrec

Re: Request: Problem with accents

game_maker

Re: Request: Problem with accents

Ixrec

Re: Request: Problem with accents

Laurent

Re: Request: Problem with accents