SFML community forums

General => Feature requests => Topic started by: Haze on November 16, 2011, 01:58:04 am

Title: Extended operations for sf::String
Post by: Haze on November 16, 2011, 01:58:04 am: SFML 2.0 introduces sf::String, which is really convenient for handling strings.
Since this class already has string manipulation methods (Erase, Insert, Find and iterators), that would be very useful if sf::String provides more of these.

I am thinking about basic operations such as:

- Removing leading and trailing whitespaces:
sf::String Trim() const;
And why not LTrim (leading only) and RTrim (trailing only)

- Extracting a sub-string:
sf::String Substr(size_t Index, size_t Length) const;

- Replacing occurences:
sf::String Replace(const sf::String& LookFor, const sf::String& ReplaceBy) const;

- Converting to lowercase:
sf::String ToLowerCase() const;

- Converting to uppercase:
sf::String ToUpperCase() const;
Title: Extended operations for sf::String
Post by: Laurent on November 16, 2011, 07:42:28 am: First, I have to say that I hate sf::String, and I don't think that I'll make it a super powerful string class.

Quote
sf::String Trim() const;
And why not LTrim (leading only) and RTrim (trailing only)

What can be considered a whitespace? In the Unicode world, there are probably more than the usual {tab, space, line feed} set.

Quote
- Extracting a sub-string:
sf::String Substr(size_t Index, size_t Length) const;

It is already pending in the task tracker.

Quote
- Replacing occurences:
sf::String Replace(const sf::String& LookFor, const sf::String& ReplaceBy) const;

Why not.

Quote
- Converting to lowercase:
sf::String ToLowerCase() const;

- Converting to uppercase:
sf::String ToUpperCase() const;

Definitely not. These operations are super complicated when dealing with the full Unicode range.

As a conclusion, I may add more functions in the future, but none that requires to interpret the characters.

You can search the task that already exists in the issue tracker, and add your suggestions to it (only Replace please, so that I won't have to duplicate my answer there :D).

Thanks for your feedback.
Title: Extended operations for sf::String
Post by: Haikarainen on November 16, 2011, 09:11:29 am: Quote from: "Laurent"
What can be considered a whitespace? In the Unicode world, there are probably more than the usual {tab, space, line feed} set.

When I made this for libMy, i simply copied the php functions default arguments found here; http://se2.php.net/manual/en/function.trim.php

Also, I made it a single function, by replacing LTrim and RTrim with an extra optional argument that could be either my::Both, my::Left or my::Right (default to both).

Quote
Why not.

If you do this, you might want to also add a "recursive" bool option, to decide wether to replace only once, or replace until needle isn't found anymore.. This has proven to be very useful for me when I wrote and used my own functions for this.
Title: Extended operations for sf::String
Post by: Laurent on November 16, 2011, 09:20:39 am: The definition of "whitespace" is not clear at all, for example the '\n' character may or may not be included in the definition.
http://en.wikipedia.org/wiki/Whitespace_character

If we extend it to the full Unicode range, it gets a lot more complicated.
http://en.wikipedia.org/wiki/Space_(punctuation)#Spaces_in_Unicode

(sorry about the last link, I can't get it right with BBCode)
Title: Extended operations for sf::String
Post by: Haze on November 17, 2011, 01:54:55 am: Quote from: "Laurent"
First, I have to say that I hate sf::String, and I don't think that I'll make it a super powerful string class.

Sure, I bet you won't turn it into something like QString.
But std::string objects are so limited...

Quote from: "Laurent"
What can be considered a whitespace? In the Unicode world, there are probably more than the usual {tab, space, line feed} set.

I understand your concern when dealing with Unicode, that's why I suggest to keep it fast & simple (:wink:) and just use the good old standard isspace function:
int isspace ( int c ); (http://www.cplusplus.com/reference/clibrary/cctype/isspace/)

Let's face it, that would cover most of the cases people care about.

Quote from: "Laurent"
You can search the task that already exists in the issue tracker, and add your suggestions to it (only Replace please, so that I won't have to duplicate my answer there :D).

Done. Sorry for not notifying the github issue (https://github.com/SFML/SFML/issues/21) first!
Title: Extended operations for sf::String
Post by: Laurent on November 17, 2011, 07:48:16 am: Quote
Sure, I bet you won't turn it into something like QString.
But std::string objects are so limited...

My main concern in SFML was initially to allow easy conversions between encodings and character types. That's all I needed. But then the only solution that came up was to rewrite a string class, and thus I had to duplicate a lot of functionality -- that's why I don't like sf::String.
I'd be a lot happier if people could just use their preferred string class and use SFML only to convert from/to the desired encoding. So no, I won't make something better than std::string.

Quote
just use the good old standard isspace function

A trim function would already be a step in the direction of QString. If I implement this function, why not implementing truncate, chop, left, mid, right, justified, ... ? :P
Title: Extended operations for sf::String
Post by: Haze on November 18, 2011, 12:40:55 am: I get your point; I guess it's all about finding the right balance.

Thanks for your time by the way, I always enjoy reading those brainstorming discussions.
Title: Extended operations for sf::String
Post by: Nexus on November 19, 2011, 12:16:45 pm: Quote from: "Haze"
But std::string objects are so limited...
std::string's interface is already heavily bloated. It is even a classic example of a monolith class, see here (http://www.gotw.ca/gotw/084.htm). I don't think it would be clever to add even more member functions.

If the functionality were added in form of free, generic functions operating on iterator ranges, they could also be applied to other containers than std::string. You should take a look at the Boost.StringAlgorithm (http://www.boost.org/doc/libs/1_48_0/doc/html/string_algo.html) library. If you make sf::String accessible via Boost.Range, you can apply any Boost algorithms on it, just like on std::string. That's generic programming :)
Title: Extended operations for sf::String
Post by: Klaim on November 20, 2011, 02:54:30 pm: Just a suggestion : wouldn't it be simpler to still use std::string and add a copy of boost string algorithms in a SFML namespace? That way you don't loose people who rely on std::string, you get algorithms for free by stripping boost algorithms functions and you don't have to bother anymore about this whole string thing.

I don't know if it's a good solution but i feels like it might be a good compromise.
Title: Extended operations for sf::String
Post by: Nexus on November 20, 2011, 04:33:39 pm: What's the point of moving Boost functions to the SFML namespace?

As I mentioned, Boost.StringAlgorithms are generic. Therefore, they can also work with sf::String, as long as you define the interface for Boost.Range (e.g. free begin() and end() functions). So, there's no sense in copying the algorithms.

Apart from that, std::string is not suitable for SFML because it lacks unicode support.
Title: Extended operations for sf::String
Post by: Klaim on November 20, 2011, 05:55:48 pm: Simply to avoid adding boost as a dependency. It wouldn't be the first project that takes the code of a library in boost and add it in it's own (Ogre, CppCms are two good examples).

About Unicode, does sfml::String managge anything unicode or does it just ask the user to make sure the data encoding is unicode? AFAIK it's the second, so I don't see what unicode support would be required. Assuming that every string entering SFML is UTF-8 is far enough and makes things simple.
Title: Extended operations for sf::String
Post by: Laurent on November 20, 2011, 06:21:27 pm: SFML works with UTF-32, so std::string is not a good candidate. And SFML does convert automatically between encodings, that's exactly why sf::String exists.
Title: Extended operations for sf::String
Post by: Klaim on November 20, 2011, 10:40:33 pm: Oh, Ok, didn't know that.

These times experts tends to say that in the end using UTF-8 makes everything simpler for some reasons, than other Unicode encodings, so I thought it was the same with SFML.

I tend to use UTF8-CPP (http://utfcpp.sourceforge.net/) to manage conversions between encodings, as it works with std::basic_string and I know I'll have only UTF-N text anyway in my projects.
Title: Extended operations for sf::String
Post by: Laurent on November 21, 2011, 07:52:23 am: Using std::basic_string for UTF-8 or UTF-16 is really a bad idea in my opinion. Many functions of its public API don't make sense and cannot be used, because it considers that each item of the container represents one character -- which is false with these encodings.

A real UTF-8 string class must be able to hide the fact that some characters may be made of several bytes.

Quote
These times experts tends to say that in the end using UTF-8 makes everything simpler for some reasons, than other Unicode encodings

I think it's true when you consider size only. English strings, for example, only use the ASCII range and so a UTF-8 english string will be equivalent to an ASCII string -- no space wasted.
But for other languages, UTF-8 makes things more complicated to handle because one character can be composed by multiple bytes, and you have to hide this from the end user. So it makes your code more complicated and more expensive for the CPU.
UTF-32 is very easy to handle (one uint32 is always one character), it just costs a little more memory but this is not a problem at all for SFML's target platforms and applications.
Title: Extended operations for sf::String
Post by: Klaim on November 21, 2011, 11:11:34 am: Quote from: "Laurent"
Using std::basic_string for UTF-8 or UTF-16 is really a bad idea in my opinion. Many functions of its public API don't make sense and cannot be used, because it considers that each item of the container represents one character -- which is false with these encodings.

A real UTF-8 string class must be able to hide the fact that some characters may be made of several bytes.

I agree, there have been a lot of discussions about this problem in the beginning of this year in the boost mailing list. Some solutions have been proposed but I guess until one propose a working solution the problem will be "to be solved" for a long time :/

Quote

Quote
These times experts tends to say that in the end using UTF-8 makes everything simpler for some reasons, than other Unicode encodings

I think it's true when you consider size only. English strings, for example, only use the ASCII range and so a UTF-8 english string will be equivalent to an ASCII string -- no space wasted.
But for other languages, UTF-8 makes things more complicated to handle because one character can be composed by multiple bytes, and you have to hide this from the end user. So it makes your code more complicated and more expensive for the CPU.
UTF-32 is very easy to handle (one uint32 is always one character), it just costs a little more memory but this is not a problem at all for SFML's target platforms and applications.

Well, most Boost devs have other arguments about that and suggest that in the end even for asian languages UTF-8 is still the best compromise.

That being said, I'm not expert enough to have a clear advice about it. One thing they said, if I remember correctly, is that the algorithms to ... err... "read"? UTF-8 should be simpler to implement than for other encoding. Or somthing like that.

I'm using UTF-8 mainly because it's a simple solution for me but I never had a problem with the sfml::String class so I guess it don't matter since smlf::String will do the conversion automatically, right?
Title: Extended operations for sf::String
Post by: Laurent on November 21, 2011, 11:25:14 am: Could you give me a link to the related thread(s) on the boost mailing list? I'd like to read that.
Title: Extended operations for sf::String
Post by: Klaim on November 21, 2011, 01:56:15 pm: Here is the more recent discussion about the subject, it's a review of a Google Summer of Code project : http://boost.2283326.n4.nabble.com/gsoc-Request-Feedback-for-Boost-Ustr-Unicode-String-Adapter-td3725600.html

They get again into the same arguments but don't talk a lot about the benefits of UTF-8 in this one.

I tried to find all the discussions that spawn in the beginning of the year :

- I think this one was the root from which other discussions spawned : http://boost.2283326.n4.nabble.com/general-What-will-string-handling-in-C-look-like-in-the-future-was-Always-treat-tt3224967.html
- http://boost.2283326.n4.nabble.com/string-proposal-tt3229406.html?bcsi_scan_0FB23122DF0CD2C6=0&bcsi_scan_filename=string-proposal-tt3229406.html
- http://boost.2283326.n4.nabble.com/string-gt-text-tt3243373.html
- http://boost.2283326.n4.nabble.com/string-Realistic-API-proposal-tt3244173.html
- http://boost.2283326.n4.nabble.com/UTF-String-UTF-String-library-1-5-ready-for-perusal-tt3297381.html
- http://boost.2283326.n4.nabble.com/string-Yet-another-Unicode-string-class-tt3300220.html
- http://boost.2283326.n4.nabble.com/UTF-String-Feedback-on-UTF-String-library-please-tt3301346.html

It's in chronological order of the first posting of each thread.

That being said, the boost::locale review had a lot of thinking on the Unicode subject because Artyom (the Locale author) says he's a specialist in localization domain and that in the end UTF-8 was the only long-term solution (he tends to say things in a harsh way...)

I think lot of the Boost.Locale documentation (http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/index.html) gives informations about the subject, like http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/index.html

Personally, I just wish they provide a std::text or something that is encoding aware and let me forget about this very subject :x

(but it's fascinating to see a lot of big brains having headache discussions about a such a subject)
Title: Extended operations for sf::String
Post by: Laurent on November 21, 2011, 02:17:27 pm: Nice, thank you :)
Title: Extended operations for sf::String
Post by: Klaim on November 21, 2011, 02:26:08 pm: No problem, looks like I'm not finished :D

Here is the root of the one I said was the root discussion :

http://boost.2283326.n4.nabble.com/General-Treat-narrow-strings-as-UTF-8-compilation-flag-tt3646453.html#a3686141

It referes to yet another root discussion... I'll let you find it XD

Also, some other discussions related to this (certainly refered in previous links):

- http://boost.2283326.n4.nabble.com/UTF-String-UTF-String-library-1-5-ready-for-perusal-tp3297381p3297381.html
- http://boost.2283326.n4.nabble.com/GSoC-Proposal-Preparation-For-Encoding-Awared-String-tp3381149p3381149.html << first post about the summer of code project
- http://boost.2283326.n4.nabble.com/GSoC-Proposal-Preparation-For-Encoding-Awared-String-tp3381149p3381149.html << abandonned implemetnation