Author Topic: Extended operations for sf::String (Read 9438 times)

Haze · « **on:** November 16, 2011, 01:58:04 am »

SFML 2.0 introduces sf::String, which is really convenient for handling strings.
Since this class already has string manipulation methods (Erase, Insert, Find and iterators), that would be very useful if sf::String provides more of these.

I am thinking about basic operations such as:

- Removing leading and trailing whitespaces:
sf::String Trim() const;
And why not LTrim (leading only) and RTrim (trailing only)

- Extracting a sub-string:
sf::String Substr(size_t Index, size_t Length) const;

- Replacing occurences:
sf::String Replace(const sf::String& LookFor, const sf::String& ReplaceBy) const;

- Converting to lowercase:
sf::String ToLowerCase() const;

- Converting to uppercase:
sf::String ToUpperCase() const;

Laurent · « **Reply #1 on:** November 16, 2011, 07:42:28 am »

First, I have to say that I hate sf::String, and I don't think that I'll make it a super powerful string class.

Quote

sf::String Trim() const;
And why not LTrim (leading only) and RTrim (trailing only)

What can be considered a whitespace? In the Unicode world, there are probably more than the usual {tab, space, line feed} set.

Quote

- Extracting a sub-string:
sf::String Substr(size_t Index, size_t Length) const;

It is already pending in the task tracker.

Quote

- Replacing occurences:
sf::String Replace(const sf::String& LookFor, const sf::String& ReplaceBy) const;

Why not.

Quote

- Converting to lowercase:
sf::String ToLowerCase() const;

- Converting to uppercase:
sf::String ToUpperCase() const;

Definitely not. These operations are super complicated when dealing with the full Unicode range.

As a conclusion, I may add more functions in the future, but none that requires to interpret the characters.

You can search the task that already exists in the issue tracker, and add your suggestions to it (only Replace please, so that I won't have to duplicate my answer there

).

Thanks for your feedback.

Haikarainen · « **Reply #2 on:** November 16, 2011, 09:11:29 am »

Quote from: "Laurent"

What can be considered a whitespace? In the Unicode world, there are probably more than the usual {tab, space, line feed} set.

When I made this for libMy, i simply copied the php functions default arguments found here; http://se2.php.net/manual/en/function.trim.php

Also, I made it a single function, by replacing LTrim and RTrim with an extra optional argument that could be either my::Both, my::Left or my::Right (default to both).

Quote

Why not.

If you do this, you might want to also add a "recursive" bool option, to decide wether to replace only once, or replace until needle isn't found anymore.. This has proven to be very useful for me when I wrote and used my own functions for this.

Laurent · « **Reply #3 on:** November 16, 2011, 09:20:39 am »

The definition of "whitespace" is not clear at all, for example the '\n' character may or may not be included in the definition.
http://en.wikipedia.org/wiki/Whitespace_character

If we extend it to the full Unicode range, it gets a lot more complicated.
http://en.wikipedia.org/wiki/Space_(punctuation)#Spaces_in_Unicode

(sorry about the last link, I can't get it right with BBCode)

Haze · « **Reply #4 on:** November 17, 2011, 01:54:55 am »

Quote from: "Laurent"

First, I have to say that I hate sf::String, and I don't think that I'll make it a super powerful string class.

Sure, I bet you won't turn it into something like QString.
But std::string objects are so limited...

Quote from: "Laurent"

What can be considered a whitespace? In the Unicode world, there are probably more than the usual {tab, space, line feed} set.

I understand your concern when dealing with Unicode, that's why I suggest to keep it fast & simple (:wink:) and just use the good old standard isspace function:
int isspace ( int c );

Let's face it, that would cover most of the cases people care about.

Quote from: "Laurent"

You can search the task that already exists in the issue tracker, and add your suggestions to it (only Replace please, so that I won't have to duplicate my answer there ).

Done. Sorry for not notifying the github issue first!

Laurent · « **Reply #5 on:** November 17, 2011, 07:48:16 am »

Quote

Sure, I bet you won't turn it into something like QString.
But std::string objects are so limited...

My main concern in SFML was initially to allow easy conversions between encodings and character types. That's all I needed. But then the only solution that came up was to rewrite a string class, and thus I had to duplicate a lot of functionality -- that's why I don't like sf::String.
I'd be a lot happier if people could just use their preferred string class and use SFML only to convert from/to the desired encoding. So no, I won't make something better than std::string.

Quote

just use the good old standard isspace function

A trim function would already be a step in the direction of QString. If I implement this function, why not implementing truncate, chop, left, mid, right, justified, ... ?

Haze · « **Reply #6 on:** November 18, 2011, 12:40:55 am »

I get your point; I guess it's all about finding the right balance.

Thanks for your time by the way, I always enjoy reading those brainstorming discussions.

Nexus · « **Reply #7 on:** November 19, 2011, 12:16:45 pm »

Quote from: "Haze"

But std::string objects are so limited...

std::string's interface is already heavily bloated. It is even a classic example of a monolith class, see here. I don't think it would be clever to add even more member functions.

If the functionality were added in form of free, generic functions operating on iterator ranges, they could also be applied to other containers than std::string. You should take a look at the Boost.StringAlgorithm library. If you make sf::String accessible via Boost.Range, you can apply any Boost algorithms on it, just like on std::string. That's generic programming

Klaim · « **Reply #8 on:** November 20, 2011, 02:54:30 pm »

Just a suggestion : wouldn't it be simpler to still use std::string and add a copy of boost string algorithms in a SFML namespace? That way you don't loose people who rely on std::string, you get algorithms for free by stripping boost algorithms functions and you don't have to bother anymore about this whole string thing.

I don't know if it's a good solution but i feels like it might be a good compromise.

Nexus · « **Reply #9 on:** November 20, 2011, 04:33:39 pm »

What's the point of moving Boost functions to the SFML namespace?

As I mentioned, Boost.StringAlgorithms are generic. Therefore, they can also work with sf::String, as long as you define the interface for Boost.Range (e.g. free begin() and end() functions). So, there's no sense in copying the algorithms.

Apart from that, std::string is not suitable for SFML because it lacks unicode support.

Klaim · « **Reply #10 on:** November 20, 2011, 05:55:48 pm »

Simply to avoid adding boost as a dependency. It wouldn't be the first project that takes the code of a library in boost and add it in it's own (Ogre, CppCms are two good examples).

About Unicode, does sfml::String managge anything unicode or does it just ask the user to make sure the data encoding is unicode? AFAIK it's the second, so I don't see what unicode support would be required. Assuming that every string entering SFML is UTF-8 is far enough and makes things simple.

Laurent · « **Reply #11 on:** November 20, 2011, 06:21:27 pm »

SFML works with UTF-32, so std::string is not a good candidate. And SFML does convert automatically between encodings, that's exactly why sf::String exists.

Klaim · « **Reply #12 on:** November 20, 2011, 10:40:33 pm »

Oh, Ok, didn't know that.

These times experts tends to say that in the end using UTF-8 makes everything simpler for some reasons, than other Unicode encodings, so I thought it was the same with SFML.

I tend to use UTF8-CPP (http://utfcpp.sourceforge.net/) to manage conversions between encodings, as it works with std::basic_string and I know I'll have only UTF-N text anyway in my projects.

Laurent · « **Reply #13 on:** November 21, 2011, 07:52:23 am »

Using std::basic_string for UTF-8 or UTF-16 is really a bad idea in my opinion. Many functions of its public API don't make sense and cannot be used, because it considers that each item of the container represents one character -- which is false with these encodings.

A real UTF-8 string class must be able to hide the fact that some characters may be made of several bytes.

Quote

These times experts tends to say that in the end using UTF-8 makes everything simpler for some reasons, than other Unicode encodings

I think it's true when you consider size only. English strings, for example, only use the ASCII range and so a UTF-8 english string will be equivalent to an ASCII string -- no space wasted.
But for other languages, UTF-8 makes things more complicated to handle because one character can be composed by multiple bytes, and you have to hide this from the end user. So it makes your code more complicated and more expensive for the CPU.
UTF-32 is very easy to handle (one uint32 is always one character), it just costs a little more memory but this is not a problem at all for SFML's target platforms and applications.

Klaim · « **Reply #14 on:** November 21, 2011, 11:11:34 am »

Quote from: "Laurent"

Using std::basic_string for UTF-8 or UTF-16 is really a bad idea in my opinion. Many functions of its public API don't make sense and cannot be used, because it considers that each item of the container represents one character -- which is false with these encodings.

A real UTF-8 string class must be able to hide the fact that some characters may be made of several bytes.

I agree, there have been a lot of discussions about this problem in the beginning of this year in the boost mailing list. Some solutions have been proposed but I guess until one propose a working solution the problem will be "to be solved" for a long time :/

Quote

Quote
These times experts tends to say that in the end using UTF-8 makes everything simpler for some reasons, than other Unicode encodings

I think it's true when you consider size only. English strings, for example, only use the ASCII range and so a UTF-8 english string will be equivalent to an ASCII string -- no space wasted.
But for other languages, UTF-8 makes things more complicated to handle because one character can be composed by multiple bytes, and you have to hide this from the end user. So it makes your code more complicated and more expensive for the CPU.
UTF-32 is very easy to handle (one uint32 is always one character), it just costs a little more memory but this is not a problem at all for SFML's target platforms and applications.

Well, most Boost devs have other arguments about that and suggest that in the end even for asian languages UTF-8 is still the best compromise.

That being said, I'm not expert enough to have a clear advice about it. One thing they said, if I remember correctly, is that the algorithms to ... err... "read"? UTF-8 should be simpler to implement than for other encoding. Or somthing like that.

I'm using UTF-8 mainly because it's a simple solution for me but I never had a problem with the sfml::String class so I guess it don't matter since smlf::String will do the conversion automatically, right?

Author Topic: Extended operations for sf::String (Read 9438 times)

Haikarainen