Author Topic: Unicode-files and Normalization Forms (Read 29482 times)

Badestrand · « **on:** November 30, 2008, 12:55:05 am »

Hi,

I was currently looking whether a new version of SFML has been released and saw the Unicode-class in the SVN-code. Very beautiful, the interface as well as the implementation

I'd like to ask: Are you planning to implement the loading and saving of unicode-files (and if so, please, use a BOM for UTF-8 ) and an (at least rudimentary) implementation of the unicode normalization forms? The file-thingy should be pretty straight-forward, while correct NFs might be really much work.
However, at least a very basic implementation (maybe just a few european combinations such as ä=>a¨) should be able to manage, if you need help, you know where you can find the community

Laurent · « **Reply #1 on:** November 30, 2008, 12:48:17 pm »

Hi

Quote

Are you planning to implement the loading and saving of unicode-files

It's more or less planned, but with a very low priority.

Quote

implementation of the unicode normalization forms

I don't know much about normalization forms. I understand the concept, but what are they used for?

T.T.H. · « **Reply #2 on:** November 30, 2008, 02:32:39 pm »

One warning: implementing Unicode right is a really, really, really big task and is definitely out of bounds of SFML. As far as I know the only library doing Unicode stuff completely right is ICU, an open source library powered by IBM.

Badestrand · « **Reply #3 on:** November 30, 2008, 02:37:11 pm »

Quote from: "Laurent"

I don't know much about normalization forms. I understand the concept, but what are they used for?

With having combining characters displayed in their combined form, it's for example easier to determine a string's length (currently only the number of codepoints, not the number of characters are counted). Additionally, many programs can't handle them correctly and comparison get's messy, too (ok, unicode-string-comparisons are messy either way).

edit: Maybe NFs are too complicated but a plain helper-function which combines the most basic european combining characters could be achievable. With this, we don't claim supporting NFs with supporting only parts of it.

edit2: One could provide combination-support for the languages covered by Latin-1, which are quite a few. This also would make a clean and simple implementation possible since there'd be only ~80 characters to combine.

Badestrand · « **Reply #4 on:** November 30, 2008, 02:40:06 pm »

Quote from: "T.T.H."

One warning: implementing Unicode right is a really, really, really big task and is definitely out of bounds of SFML. As far as I know the only library doing Unicode stuff completely right is ICU, an open source library powered by IBM.

You are right, there is no way to provide a fully correct unicode-implementation in SFML. However, nevertheless SFML can provide some basic functionality which covers at least the most common european unicode-operations.

Badestrand · « **Reply #5 on:** November 30, 2008, 06:44:24 pm »

By the way, the Windows Filesystem doesn't combine characters

Code: [Select]


#include <fstream>
int main()
{
	std::ofstream(L"\u00e4");
	std::ofstream(L"\u0061\u0308");
}

Laurent · « **Reply #6 on:** November 30, 2008, 07:19:38 pm »

Quote from: "T.T.H."

One warning: implementing Unicode right is a really, really, really big task and is definitely out of bounds of SFML. As far as I know the only library doing Unicode stuff completely right is ICU, an open source library powered by IBM.

You're perfectly right, my aim is just to provide conversions functions to be able to manipulate easily the most common Unicode encodings. This is necessary because SFML provides some features related to Unicode.
But for more advanced Unicode usage, I'd encourage people to use ICU