Jump to content


String table for fast comparison of static strings


20 replies to this topic

#1 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 10 June 2010 - 02:00 PM

Hey all,

We all know that operating on strings can be prohibitively expensive. In my free time, I am working on my own script language, and there are several places where strings are used: during the compilation of scripts, building the AST, code generation, and assembly generation, to name a few. To help speed up compilation, I knocked together this small class, which basically replaces string comparisons with integer comparisons. The class itself is nothing more than a glorified hash-table, that allows you to insert, find, and retrieve strings from this table.


#include <vector>

#include <assert.h>


typedef size_t IndexedString;

const IndexedString InvalidString = 0;


//

// The string table class. Keeps a list of shared static strings, which 

// can be safely compared by using the IndexedString type.

// Invalid/empty string are zero

//

class StringTable

{

public:

	StringTable()

		: hash(0)

		, hashCount(8)

	{

		Rehash();


		// Reserve first slot

		Insert("__reserved__");

	}

	~StringTable()

	{

		if( hash != 0 )

			delete[] hash;


		for(std::vector<Entry>::const_iterator it = strings.begin(); it != strings.end(); ++it)

			delete[] it->string;

	}


	IndexedString Insert(const char *string)

	{

		if(string == 0 || strlen(string) == 0)

			return InvalidString;


		// See if it is already in the table

		size_t hashVal = HashFunction(string) & (hashCount-1);

		for( size_t i = hash[hashVal]; i != (size_t)-1; i = strings[i].hashNext )

		{

			if( strcmp(strings[i].string, string) == 0 )

				return i;

		}


		// Add a new entry

		Entry entry;

		size_t index = strings.size();

		entry.hashNext = hash[hashVal];

		entry.string = _strdup(string);

		hash[hashVal] = index;

		strings.push_back(entry);


		// Update hashtable if needed

		if( hashCount*2+8 < (index+1) )

		{

			hashCount *= 2;

			strings.reserve(hashCount);

			Rehash();

		}


		assert( index == Find(string) );

		return index;

	}


	IndexedString Find(const char *string)

	{

		if(string == 0 || strlen(string) == 0)

			return InvalidString;


		// See if it is already in the table

		size_t hashVal = HashFunction(string) & (hashCount-1);

		for( size_t i = hash[hashVal]; i != (size_t)-1; i = strings[i].hashNext )

		{

			if( strcmp(strings[i].string, string) == 0 )

				return i;

		}


		return InvalidString;

	}


	// Resets the table

	void Empty()

	{

		if( hash != 0 )

		{

			delete[] hash;

			hash = 0;

			hashCount = 8;

		}


		for(std::vector<Entry>::const_iterator it = strings.begin(); it != strings.end(); ++it)

			delete[] it->string;


		strings.clear();

	}


	std::string Get(IndexedString index) const

	{

		if(index >= strings.size() || index == InvalidString)

			return std::string();

		else

			return strings[index].string;

	}


protected:


	void Rehash()

	{

		assert(!(hashCount&(hashCount-1)));

		assert(hashCount>=8);


		size_t* newHash = new size_t[hashCount+1];

		memset(newHash, -1, sizeof(size_t)*hashCount);

		{

			for( size_t i=0; i<strings.size(); i++ )

			{

				Entry& entry		= strings[i];

				size_t hashVal		= HashFunction(entry.string) & (hashCount-1);

				entry.hashNext		= newHash[hashVal];

				newHash[hashVal]	= i;

			}

		}


		if( hash )

			delete[] hash;

		hash = newHash;

	}


	void Relax()

	{

		while( hashCount > (strings.size()*2+8) )

			hashCount >>= 1;

		Rehash();

	}


	size_t HashFunction(const char *s)

	{

		if(s == 0)

			return 0;


		size_t HashVal = 5381;


		while(*s != 0)

		{

			HashVal = ((HashVal << 5) + HashVal) ^ *s;

			s++;

		}


		return HashVal;

	}


protected:

	struct Entry

	{

		size_t hashNext;

		const char *string;

	};


	size_t *hash;

	size_t hashCount;

	std::vector<Entry> strings;

};


Using the above class, you can now do string-comparisons as such:

int main(int argc, char* argv[])

{

	StringTable st;


	char *sz1 = "test string nr. 1";

	char *sz2 = "test string nr. 2";

	char *sz3 = "test string nr. 3";

	char *sz4 = "test string nr. 4";


	// Same strings, different addresses

	char *sz1_dup = _strdup(sz1);

	char *sz2_dup = _strdup(sz2);

	char *sz3_dup = _strdup(sz3);

	char *sz4_dup = _strdup(sz4);


	// Insert strings into the table

	IndexedString index1 = st.Insert(sz1);

	IndexedString index2 = st.Insert(sz2);

	IndexedString index3 = st.Insert(sz3);

	IndexedString index4 = st.Insert(sz4);


	// Validate correct comparisons

	if(st.Find(sz1_dup) == index1) std::cout << "Nr. 1 is a match" << std::endl;

	if(st.Find(sz2_dup) == index2) std::cout << "Nr. 2 is a match" << std::endl;

	if(st.Find(sz3_dup) == index3) std::cout << "Nr. 3 is a match" << std::endl;

	if(st.Find(sz4_dup) == index4) std::cout << "Nr. 4 is a match" << std::endl;


	// Retrieve strings from table

	std::cout << "Index 1 = " << st.Get(index1).c_str() << std::endl;

	std::cout << "Index 2 = " << st.Get(index2).c_str() << std::endl;

	std::cout << "Index 3 = " << st.Get(index3).c_str() << std::endl;

	std::cout << "Index 4 = " << st.Get(index4).c_str() << std::endl;

	std::cout << "Invalid index 1 = " << st.Get(0).c_str() << std::endl;

	std::cout << "Invalid index 2 = " << st.Get(0x80000000).c_str() << std::endl;


	// Free allocated data

	delete [] sz1_dup;

	delete [] sz2_dup;

	delete [] sz3_dup;

	delete [] sz4_dup;


	return 0;

}



#2 mrjones

    New Member

  • Members
  • Pip
  • 4 posts

Posted 11 June 2010 - 02:04 PM

Its nice and it's especially nice to see people testing their code ;).But how would st.Find(sz1_dup) == index1 speed up the comparison when st.Find uses HashFunction(string) that calculates the hash over all string? Unless you always calculate indices for all strings and then use these instead of strings themselves. I am just asking the question based on the example code.

Also wouldn't it be easier to use stl::set, since you are using stl anyway already? It might be slower of course. What I mean is something like that:

#include <set>

typedef std::set< std::string > StringTable;
typedef StringTable::iterator IndexedString;

int main(int argc, char* argv[])
{
	StringTable st;

	std::string sz1 = "test string nr. 1";
	std::string sz2 = "test string nr. 2";
	std::string sz3 = "test string nr. 3";
	std::string sz4 = "test string nr. 4";

	// Same strings, different addresses
	std::string sz1_dup = sz1;
	std::string sz2_dup = sz2;
	std::string sz3_dup = sz3;
	std::string sz4_dup = sz4;

	// Insert strings into the table
	IndexedString index1 = st.insert(sz1).first;
	IndexedString index2 = st.insert(sz2).first;
	IndexedString index3 = st.insert(sz3).first;
	IndexedString index4 = st.insert(sz4).first;

	// Validate correct comparisons
	if(st.find(sz1_dup) == index1) std::cout << "Nr. 1 is a match" << std::endl;
	if(st.find(sz2_dup) == index2) std::cout << "Nr. 2 is a match" << std::endl;
	if(st.find(sz3_dup) == index3) std::cout << "Nr. 3 is a match" << std::endl;
	if(st.find(sz4_dup) == index4) std::cout << "Nr. 4 is a match" << std::endl;

	// Retrieve strings from table
	std::cout << "Index 1 = " << index1->c_str() << std::endl;
	std::cout << "Index 2 = " << index2->c_str() << std::endl;
	std::cout << "Index 3 = " << index3->c_str() << std::endl;
	std::cout << "Index 4 = " << index4->c_str() << std::endl;

	return 0;
}


#3 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 11 June 2010 - 03:30 PM

mrjones said:

Its nice and it's especially nice to see people testing their code :).But how would st.Find(sz1_dup) == index1 speed up the comparison when st.Find uses HashFunction(string) that calculates the hash over all string? Unless you always calculate indices for all strings and then use these instead of strings themselves. I am just asking the question based on the example code.
I suppose the sample code could have been a bit clearer :)
The point was indeed to use the Find/Insert functions only once, when a string is encountered, and then use the indices in the strings place. The way I am using it, is during the parsing of the script files all named primitives are inserted into the table, and then throughout the rest of the AST validation and codegeneration, I only use the indices to compare string. I don't touch the actual strings, unless there is a compilation error, and I need to dump some human-readable info on the error.

mrjones said:

Also wouldn't it be easier to use stl::set, since you are using stl anyway already? It might be slower of course.
I actually hadn't thought about using std::set for this. The only issue I can see with using it, is that it sorts all its entries, which is not a requirement for my specific case. This would incur some unneeded overhead, but then again, since it's only to be called once, it might be negligible. Whatever floats your boat :)
"Stupid bug! You go squish now!!" - Homer Simpson

#4 Reedbeta

    DevMaster Staff

  • Administrators
  • 5340 posts
  • LocationSanta Clara, CA

Posted 11 June 2010 - 04:51 PM

stdext::hash_set. (Or __gnu_cxx::hash_set, depending on your compiler flavor.) ;)

EDIT: There's also std::tr1::unordered_set.
reedbeta.com - developer blog, OpenGL demos, and other projects

#5 JarkkoL

    Senior Member

  • Members
  • PipPipPipPip
  • 477 posts

Posted 11 June 2010 - 05:54 PM

How about using char* instead of int and just compare those. It's nicer to debug and you can access the string without access to the table too.

Cheers, Jarkko

#6 Reedbeta

    DevMaster Staff

  • Administrators
  • 5340 posts
  • LocationSanta Clara, CA

Posted 11 June 2010 - 06:25 PM

Using ints does have the advantage that it's easy to write the table out to a file together with things referencing it, e.g. for debug info for bytecode-compiled script, or something.
reedbeta.com - developer blog, OpenGL demos, and other projects

#7 JarkkoL

    Senior Member

  • Members
  • PipPipPipPip
  • 477 posts

Posted 11 June 2010 - 06:37 PM

Yes, I know and there are other advantages and disadvantages as well. Just saying that using char* might be a better choice depending on the case.

#8 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 12 June 2010 - 05:35 PM

Reedbeta said:

stdext::hash_set. (Or __gnu_cxx::hash_set, depending on your compiler flavor.) :)

EDIT: There's also std::tr1::unordered_set.
Only thing left for me to complain about, is the size of the iterators, which clock in at ~12 bytes :p
"Stupid bug! You go squish now!!" - Homer Simpson

#9 .oisyn

    DevMaster Staff

  • Moderators
  • 1842 posts

Posted 14 June 2010 - 08:18 AM

Where can I store my wchar_t strings?

.edit: And why does Get() return a std::string, rather than a const char *?
C++ addict
-
Currently working on: the 3D engine for Tomb Raider.

#10 Nick

    Senior Member

  • Members
  • PipPipPipPip
  • 1227 posts
  • LocationOttawa, Ontario, Canada

Posted 14 June 2010 - 12:24 PM

This is a great optimization for compilers and similar projects, but it can be a nightmare for debugging. Not long ago I had to track down an obscure bug in a C preprocessor that had a string (token) table like this, and working with numbers instead of readable strings made the job a lot harder.

So I'd recommend to either use char* as JarkkoL already suggested, with additional methods to get the index of a string when you need a more compact representation (like when writing out to a file), or to just make IndexedString a struct with an additional string pointer in debug mode.

#11 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 15 June 2010 - 05:07 AM

.oisyn said:

Where can I store my wchar_t strings?
I actually wanted to use wchar_t strings, but I'm currently using flex/bison to generate the lexer and scanner, and both of them choke on non-ansi symbols, so I sticked with char* for the time being. If I were able to use wchar_t though, modifying the class to support it would be trivial, I expect.

.oisyn said:

.edit: And why does Get() return a std::string, rather than a const char *?
A lazyness/convenience combo.I mentioned ealier that I only touch the actual strings when an error occured, and I needed to dump some info. I return a std::string, so I can start concatenating the compiler error immediately to the string, and that's the only reason :)
"Stupid bug! You go squish now!!" - Homer Simpson

#12 hammer

    New Member

  • Members
  • Pip
  • 7 posts

Posted 10 January 2011 - 12:19 AM


	size_t HashFunction(const char *s)

	{

		if(s == 0)

			return 0;

		size_t HashVal = 5381;

		while(*s != 0)

		{

			HashVal = ((HashVal << 5) + HashVal) ^ *s;

			s++;

		}

		return HashVal;

	}



Is this a popular hash function? Because I've seen it before, from PHP Zhash routines.

There is an optimized routine for it:



/*

 * DJBX33A (Daniel J. Bernstein, Times 33 with Addition)

 *

 * This is Daniel J. Bernstein's popular `times 33' hash function as

 * posted by him years ago on comp.lang.c. It basically uses a function

 * like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best

 * known hash functions for strings. Because it is both computed very

 * fast and distributes very well.

 *

 * The magic of number 33, i.e. why it works better than many other

 * constants, prime or not, has never been adequately explained by

 * anyone. So I try an explanation: if one experimentally tests all

 * multipliers between 1 and 256 (as RSE did now) one detects that even

 * numbers are not useable at all. The remaining 128 odd numbers

 * (except for the number 1) work more or less all equally well. They

 * all distribute in an acceptable way and this way fill a hash table

 * with an average percent of approx. 86%.

 *

 * If one compares the Chi^2 values of the variants, the number 33 not

 * even has the best value. But the number 33 and a few other equally

 * good numbers like 17, 31, 63, 127 and 129 have nevertheless a great

 * advantage to the remaining numbers in the large set of possible

 * multipliers: their multiply operation can be replaced by a faster

 * operation based on just one shift plus either a single addition

 * or subtraction operation. And because a hash function has to both

 * distribute good _and_ has to be very fast to compute, those few

 * numbers should be preferred and seems to be the reason why Daniel J.

 * Bernstein also preferred it.

 *

 *

 *                  -- Ralf S. Engelschall <rse@engelschall.com>

 */


//

// NOTE: This is the optimized unrolled version for:

//       while (c = *str++)

//          hash = ((hash << 5) + hash) + c;

//

static inline ulong zend_inline_hash_func(char *arKey, uint nKeyLength)

{

        register ulong hash = 5381;


        /* variant with the hash unrolled eight times */

        for (; nKeyLength >= 8; nKeyLength -= 8) {

                hash = ((hash << 5) + hash) + *arKey++;		// (hash * 33) + c

                hash = ((hash << 5) + hash) + *arKey++;

                hash = ((hash << 5) + hash) + *arKey++;

                hash = ((hash << 5) + hash) + *arKey++;

                hash = ((hash << 5) + hash) + *arKey++;

                hash = ((hash << 5) + hash) + *arKey++;

                hash = ((hash << 5) + hash) + *arKey++;

                hash = ((hash << 5) + hash) + *arKey++;

        }

        switch (nKeyLength) {

                case 7: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */

                case 6: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */

                case 5: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */

                case 4: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */

                case 3: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */

                case 2: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */

                case 1: hash = ((hash << 5) + hash) + *arKey++; break;

                case 0: break;

EMPTY_SWITCH_DEFAULT_CASE()

        }

        return hash;

}



#13 Kenneth Gorking

    Senior Member

  • Members
  • PipPipPipPip
  • 939 posts

Posted 10 January 2011 - 07:13 AM

[quote name='hammerIs this a popular hash function? Because I've seen it before, from PHP Zhash routines.[/QUOTE']
I actually don't know, it's just a hash function that I have had in my code for ever. I can't remember where I got it from though...
"Stupid bug! You go squish now!!" - Homer Simpson

#14 Mattias Gustavsson

    Senior Member

  • Members
  • PipPipPipPip
  • 413 posts

Posted 10 January 2011 - 07:33 AM

I made a similar thing a while back: http://www.mattiasgu...tringsasIDs.php

I spent a fair bit of time optimising it to get good performance - both in terms of speed and memory use. Feel free to grab bits of it if you have a use for it, it's all public domain.
  • www.mattiasgustavsson.com - My blog and current projects
  • www.rivtind.com - My Fantasy world and isometric RPG engine
  • www.pixieuniversity.com - My Software 2D Game Engine

#15 hammer

    New Member

  • Members
  • Pip
  • 7 posts

Posted 11 January 2011 - 12:05 AM

Mattias Gustavsson said:

I made a similar thing a while back: http://www.mattiasgu...tringsasIDs.php

I spent a fair bit of time optimising it to get good performance - both in terms of speed and memory use. Feel free to grab bits of it if you have a use for it, it's all public domain.

PHP Zhash routines has some hints to make it run faster:
- Use the optimized unrolled hash function instead.
- Store precomputed string length - use it as a preliminary comparison check.
- If you store string length, you can use memcmp() instead of strcmp(), for final comparison.

#16 Mattias Gustavsson

    Senior Member

  • Members
  • PipPipPipPip
  • 413 posts

Posted 11 January 2011 - 07:38 AM

hammer said:

PHP Zhash routines has some hints to make it run faster:
- Use the optimized unrolled hash function instead.
- Store precomputed string length - use it as a preliminary comparison check.
- If you store string length, you can use memcmp() instead of strcmp(), for final comparison.
Hey, those are good suggestions - I think I'll incorporate them in my code :yes:

I can't really do the memcmp though, as I need case-insensitive comparison for my use case.:sad:
  • www.mattiasgustavsson.com - My blog and current projects
  • www.rivtind.com - My Fantasy world and isometric RPG engine
  • www.pixieuniversity.com - My Software 2D Game Engine

#17 hammer

    New Member

  • Members
  • Pip
  • 7 posts

Posted 11 January 2011 - 09:39 AM

Mattias Gustavsson said:

Hey, those are good suggestions - I think I'll incorporate them in my code :yes:

I can't really do the memcmp though, as I need case-insensitive comparison for my use case.:sad:

Do you need to store the string as case-sensitive?

Because, if all strings are hashed as upper case, I suppose you could just convert the search string
to upper case before searching for it.

I guess that would be another improvement tip if you can use it. Hash all strings as upper (or lower) case.
Some people prefer upper case over lower, because they think the upper case conversion functions are slightly faster.
I haven't really search for proof of it though.

You can use this tip when all of your strings are filenames or filepaths, since you don't have to care about
case sensitivity. I've seen MMO games like World of Warcraft and Warhammer store all their game asset filepaths
as upper case, to improve hash lookups.

#18 Mattias Gustavsson

    Senior Member

  • Members
  • PipPipPipPip
  • 413 posts

Posted 11 January 2011 - 01:30 PM

Well, I need to save out the strings in some cases, and I need them to preserve case (well, at least preserve the case of the first string passed to it). For some cases, doing it as all upper or all lower would be fine though... I guess I could store a mixed-case AND an all upper version of each string... more memory, but would allow the memcmp when doing lookups...
  • www.mattiasgustavsson.com - My blog and current projects
  • www.rivtind.com - My Fantasy world and isometric RPG engine
  • www.pixieuniversity.com - My Software 2D Game Engine

#19 hammer

    New Member

  • Members
  • Pip
  • 7 posts

Posted 11 January 2011 - 05:26 PM

So, your strings are not filepaths?

I wouldn't store two strings just for that. I don't think it's worth it.

The memcmp() check is only done after the hash index is checked, so it's only useful when you have collisions.
Collisions should be avoided as much as possible, so I don't think it's worth the effort for something
that shouldn't happen most of time.

If it was possible to have a collision free hash routine, then you could get rid of all the code after the
hash index lookup, and just return the string. All the code here is just meant to speed up collision checking.

#20 Mattias Gustavsson

    Senior Member

  • Members
  • PipPipPipPip
  • 413 posts

Posted 11 January 2011 - 07:29 PM

Mostly no, they are mostly arbitrary names used as IDs for all sort of things, and a lot of them are read from (and written to) xml files. Using strings for IDs makes it very easy to work with when doing rapid prototyping (like when I tried making games in 2-3 hours for TigJamUK).
  • www.mattiasgustavsson.com - My blog and current projects
  • www.rivtind.com - My Fantasy world and isometric RPG engine
  • www.pixieuniversity.com - My Software 2D Game Engine





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users