Unicode, UTF-8, UTF-16, BOM, etc. - trying to understand all those things somewhat decent took me a serious amount time.
Anyway, four things I want to throw in here:
1) a short but good article about "
wchar_t: unsafe at any size"
2) in UTF-16 all characters of the so called "basic multilingual plane" are encoded in 2 bytes (= 16 bit). Those pretty much include all the characters you need to write pretty much every "living" language on this planet. Plus a lot of other symbol like characters, pretty much "Wingdings on steroids". Beyond that are some more characters, a lot of fancy stuff like old Germanic rune symbols and ancient Greek numbers and the music clef. Those characters need 4 bytes to be encoded. So please do not be fooled that "UTF-16" means "2 byte things only" or that "half number of bytes of a UTF-16 string == number of characters" which is simply wrong.
3) UTF-32 instead is "every single defined Unicode character there is in 4 bytes, not more, not less".
4)
decodeunicode - a life safer...