flipcode - Advanced String Techniques in C++

Advanced String Techniques in C++ - Part II: A Complete String Class
by (08 September 2000)

Introduction

This last one of the two string tutorials will focus on the theory and implementation of string classes. You'll learn how to go beyond strlen() and its fellow functions and handle strings in a way that will probably feel a lot more intuitive to most C++ programmers out there.

I'll be using both ASCII and Unicode related code in this issue, and understanding my previous tutorial will be very helpful.

Strings Of The Old School

Strings in C and C++ are not intrinsic types like int or float, but rather arrays of characters terminated by NULL characters. String manipulation is conducted by manipulating the character sequence directly; for instance, strlen() counts the number of bytes it finds in an array of characters, until a terminating NULL is found. Following that same principle, a function like strcat() locates the end of a sequence of characters, and copies another sequence of characters onto the end of the first sequence, thus concatenating (merging) two strings.

Such character manipulation is efficient and fairly straightforward, but it extends poorly into the world of object oriented programming. In a world where most everything is represented by classes, dealing directly with the nuts and bolts making up such fundamental entities as strings just feels wrong. As an analogy, consider a vector class. Almost everyone using C++ for 3D graphics purposes have access to a vector class, enabling them to write for instance the following to compute the sum of two vectors:


vector sum = vec1 + vec2;

instead of:


vector sum;
vecAdd(vec1, vec2, sum);

As strings are just as fundamental as 3D vectors in today's engines, wouldn't it be nice to be able to apply the same techniques to them? Isn't string concatenation operation #1 more intuitive and easier to follow than #2?


#1
string str = "123";
str += "456";

#2
char str[256];
strcpy(str, "123";
strcat(str, "456");

They both perform the same task, that is, setting a string to be equal to "123", and then appending "456" to the end of it, thus making it contain the string "123456". Operation #1 is performed using string classes, whereas #2 is performed using regular C string functions.

If you prefer technique #2 (or if you're a C programmer), you can stop reading now, as the rest of this tutorial will describe how to implement #1.

String Classes

The type of string class that I'll describe here is a C++ class that encapsulates the data and functions necessary to represent a character string and some common operations that can be applied to it.

Class libraries from different vendors often come with string classes as well as a multitude of other useful classes. As an example, the STL library contains a very competent string class. But since we're game programmers, we like to code things ourselves as it gives us full control and understanding of the code, am I right?

My string class is built around a regular C string, but since the C string itself is declared as protected, we're practically never allowed to mess with it from the outside. Instead, we accomplish what we want by using the class' functions and operators in the true sense of C++.

The class works with both Unicode and ASCII strings. By #defining _UNICODE before including the class' header file, the class is set to operate on Unicode strings. Otherwise, it uses ASCII strings. Remember that you must also include tchar.h before the class' header file, as the class relies on the _tcs function set (described in the previous tutorial) to transparently handle both Unicode and ASCII strings.

The class is downloadable via the link below. It might be a good idea to have it available as I'll briefly describe its member functions. The actual implementation and customization of the class, or any other type of string class that better fits your specific needs, is left as an exercise to you.

A Look At The Inside

Three member variables are defined in the class. The first one, Text, is used to hold the actual characters of the string. It's dynamically reallocated (using new and delete, or malloc() and free() if you prefer) by the two protected functions AllocStr() and FreeStr(). All memory allocations taking place inside the string class use these functions, making it easy for you to alter the way memory is handled if you're for instance using some custom memory manager. The other two member variables are integers holding the size of the memory block currently allocated for the string (Size), and the number of characters in the string (Len).

Quite a few constructors are defined: A regular constructor that empties the string, a copy constructor and constructors that takes regular C strings in both ASCII and Unicode format. It is for instance possible (with the Unicode version of the string class) to fill a class instance with the Unicode equivalent of an ASCII sequence of characters, and (of course) vice versa. There's also a set of assignment operators that matches the constructors, all according to what I believe is good C++ practice.

It is possible to get a pointer to the actual character string through the accessor GetString() or the * operator, should it prove necessary.

Some interesting functions are Compare(), which compares the string to another string and returns -1, 0 or 1 (regular strcmp() return values) to indicate the result, Find(), which locates a substring inside a string and returns its position, the two versions of Insert(), one which inserts a character at a given location and one which inserts a string, Delete(), which removes a substring within the string, and GetSubString(), which returns a part of the string.

VarArg is used to emulate sprintf()-ish behavior. It's for instance possible to write:


string str, str2;
str2 = T("A string");
str.VarArg(T("%s abc %i def %f"), str2.GetString(), 10, 20.9);

Notice the use of GetString() to get the address of the string for the VarArg function above. You can't just pass the string object, as C's variable argument system isn't capable of deriving a character string from it. You must explicitly send the string using the GetString() function. Also note the T() macro, which does the exact same thing as the _TEXT macro discussed in the first part of this tutorial, but saves some typing.

EatLeadingWhitespace() and EatTrailingWhitespace are used to remove whitespace (spaces and tabulation) from the beginning and end of a string.

ToAnsi() and ToUnicode() are used to retrieve regular C character strings in one of the specific character sets. These are useful for instance when calling Win95/98 API functions from within a Unicode program - it's easier to call ToAnsi() on a string object to get a Windows-compatible character string than to use WideCharToMultiByte() each and every time you wish to do such a conversion.

The [] operator returns a reference to the character at a specific index in the character array. Since it's a reference, the following code is perfectly legal:


CString str = T("flipcode");
str[0] = T('F');

... and will make str contain the string "Flipcode". IsValidIndex() can be used to determine if a character index is valid for a specific string.

The operators + and += are overloaded for you to be able to do concatenations quickly and easily. The + operator is defined as a friend function of the class, thus enabling you to write complex concatenation operations such as:


CString dest;
CString somestring = T(" and ");
dest = "1" + somestring + "2" + somestring + "3" + somestring + "4";

Finally, I've overloaded all of the comparison operators to call Compare() appropriately, to make it possible to compare strings using this syntax:


CString str = T("flipcode");
if (str == T("flipcode")) { ... }

What About Templates?

For the uninitiated, templates are C++ way of achieving data type independence. Many programmers would probably implement a string class as a template class and thus make the character format dynamically modifiable between char and wchar_t. This works perfectly well, but for a number of reasons my string class is not a template class:

A native string format (ASCII or Unicode) is used internally in every string object created from my class, depending on whether or not _UNICODE has been defined. However, the constructors and assignment operators support both ASCII and Unicode (through function overloading), and the class can therefore be used transparently for both string types without the need for templates.
Templates force the compiler to generate an awful lot of extra code to handle the different data types. Using internal conversions instead of templates is clearly faster if done correctly.
Using templates would force the programmer into making considerations about which string format to use every time he defines a string. The way I see it, there's no need for specialized, separate Unicode and ASCII string classes as the programmer will want to work with a generic class that uses the engine's native string format in all cases.

A Few Things to Keep In Mind

The string class is hardly as efficient as it could have been, was it not an educational piece of code. Many operations are made per-character, whereas memcpy() or memmove() operations could be faster. Another inefficiency lies is the fact that the character string is reallocated for every character being inserted or removed from it. A better approach would be to let AllocStr allocate more memory than is actually needed. By leaving such memory vacant for future operations, future allocations can be avoided.

Another memory optimization comes to mind. malloc() and free() are not as fast as we'd like them to be; using a global pool-based allocator for strings would probably be faster.

But I wont spoil you with such luxury, all such optimizations are left as exercises for the reader.

Downloads

You may download the string class source code (CString.h) here:
article_advstrings_cstring.h

Closing

I bet you're fed up with strings now.

Fredrik Andersson (f01fan@efd.lth.se)
Lead Programmer, Herring Interactive

Article Series:

Advanced String Techniques in C++ - Part I: Unicode
Advanced String Techniques in C++ - Part II: A Complete String Class