String table for fast comparison of static strings

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jun 10, 2010 at 14:00

Hey all,

We all know that operating on strings can be prohibitively expensive. In my free time, I am working on my own script language, and there are several places where strings are used: during the compilation of scripts, building the AST, code generation, and assembly generation, to name a few. To help speed up compilation, I knocked together this small class, which basically replaces string comparisons with integer comparisons. The class itself is nothing more than a glorified hash-table, that allows you to insert, find, and retrieve strings from this table.

#include <vector>
#include <assert.h>

typedef size_t IndexedString;
const IndexedString InvalidString = 0;

//
// The string table class. Keeps a list of shared static strings, which 
// can be safely compared by using the IndexedString type.
// Invalid/empty string are zero
//
class StringTable
{
public:
    StringTable()
        : hash(0)
        , hashCount(8)
    {
        Rehash();

        // Reserve first slot
        Insert("__reserved__");
    }
    ~StringTable()
    {
        if( hash != 0 )
            delete[] hash;

        for(std::vector<Entry>::const_iterator it = strings.begin(); it != strings.end(); ++it)
            delete[] it->string;
    }

    IndexedString Insert(const char *string)
    {
        if(string == 0 || strlen(string) == 0)
            return InvalidString;

        // See if it is already in the table
        size_t hashVal = HashFunction(string) & (hashCount-1);
        for( size_t i = hash[hashVal]; i != (size_t)-1; i = strings[i].hashNext )
        {
            if( strcmp(strings[i].string, string) == 0 )
                return i;
        }

        // Add a new entry
        Entry entry;
        size_t index = strings.size();
        entry.hashNext = hash[hashVal];
        entry.string = _strdup(string);
        hash[hashVal] = index;
        strings.push_back(entry);

        // Update hashtable if needed
        if( hashCount*2+8 < (index+1) )
        {
            hashCount *= 2;
            strings.reserve(hashCount);
            Rehash();
        }

        assert( index == Find(string) );
        return index;
    }

    IndexedString Find(const char *string)
    {
        if(string == 0 || strlen(string) == 0)
            return InvalidString;

        // See if it is already in the table
        size_t hashVal = HashFunction(string) & (hashCount-1);
        for( size_t i = hash[hashVal]; i != (size_t)-1; i = strings[i].hashNext )
        {
            if( strcmp(strings[i].string, string) == 0 )
                return i;
        }

        return InvalidString;
    }

    // Resets the table
    void Empty()
    {
        if( hash != 0 )
        {
            delete[] hash;
            hash = 0;
            hashCount = 8;
        }

        for(std::vector<Entry>::const_iterator it = strings.begin(); it != strings.end(); ++it)
            delete[] it->string;

        strings.clear();
    }

    std::string Get(IndexedString index) const
    {
        if(index >= strings.size() || index == InvalidString)
            return std::string();
        else
            return strings[index].string;
    }

protected:

    void Rehash()
    {
        assert(!(hashCount&(hashCount-1)));
        assert(hashCount>=8);

        size_t* newHash = new size_t[hashCount+1];
        memset(newHash, -1, sizeof(size_t)*hashCount);
        {
            for( size_t i=0; i<strings.size(); i++ )
            {
                Entry& entry        = strings[i];
                size_t hashVal      = HashFunction(entry.string) & (hashCount-1);
                entry.hashNext      = newHash[hashVal];
                newHash[hashVal]    = i;
            }
        }

        if( hash )
            delete[] hash;
        hash = newHash;
    }

    void Relax()
    {
        while( hashCount > (strings.size()*2+8) )
            hashCount >>= 1;
        Rehash();
    }

    size_t HashFunction(const char *s)
    {
        if(s == 0)
            return 0;

        size_t HashVal = 5381;

        while(*s != 0)
        {
            HashVal = ((HashVal << 5) + HashVal) ^ *s;
            s++;
        }

        return HashVal;
    }

protected:
    struct Entry
    {
        size_t hashNext;
        const char *string;
    };

    size_t *hash;
    size_t hashCount;
    std::vector<Entry> strings;
};

Using the above class, you can now do string-comparisons as such:

int main(int argc, char* argv[])
{
    StringTable st;

    char *sz1 = "test string nr. 1";
    char *sz2 = "test string nr. 2";
    char *sz3 = "test string nr. 3";
    char *sz4 = "test string nr. 4";

    // Same strings, different addresses
    char *sz1_dup = _strdup(sz1);
    char *sz2_dup = _strdup(sz2);
    char *sz3_dup = _strdup(sz3);
    char *sz4_dup = _strdup(sz4);

    // Insert strings into the table
    IndexedString index1 = st.Insert(sz1);
    IndexedString index2 = st.Insert(sz2);
    IndexedString index3 = st.Insert(sz3);
    IndexedString index4 = st.Insert(sz4);

    // Validate correct comparisons
    if(st.Find(sz1_dup) == index1) std::cout << "Nr. 1 is a match" << std::endl;
    if(st.Find(sz2_dup) == index2) std::cout << "Nr. 2 is a match" << std::endl;
    if(st.Find(sz3_dup) == index3) std::cout << "Nr. 3 is a match" << std::endl;
    if(st.Find(sz4_dup) == index4) std::cout << "Nr. 4 is a match" << std::endl;

    // Retrieve strings from table
    std::cout << "Index 1 = " << st.Get(index1).c_str() << std::endl;
    std::cout << "Index 2 = " << st.Get(index2).c_str() << std::endl;
    std::cout << "Index 3 = " << st.Get(index3).c_str() << std::endl;
    std::cout << "Index 4 = " << st.Get(index4).c_str() << std::endl;
    std::cout << "Invalid index 1 = " << st.Get(0).c_str() << std::endl;
    std::cout << "Invalid index 2 = " << st.Get(0x80000000).c_str() << std::endl;

    // Free allocated data
    delete [] sz1_dup;
    delete [] sz2_dup;
    delete [] sz3_dup;
    delete [] sz4_dup;

    return 0;
}

20 Replies

Please log in or register to post a reply.

55699257f01e693afdaa2ec8750d7125
0
mrjones 101 Jun 11, 2010 at 14:04

Its nice and it’s especially nice to see people testing their code ;).But how would st.Find(sz1_dup) == index1 speed up the comparison when st.Find uses HashFunction(string) that calculates the hash over all string? Unless you always calculate indices for all strings and then use these instead of strings themselves. I am just asking the question based on the example code.

Also wouldn’t it be easier to use stl::set, since you are using stl anyway already? It might be slower of course. What I mean is something like that:

#include <set>

typedef std::set< std::string > StringTable;
typedef StringTable::iterator IndexedString;

int main(int argc, char* argv[])
{
    StringTable st;

    std::string sz1 = "test string nr. 1";
    std::string sz2 = "test string nr. 2";
    std::string sz3 = "test string nr. 3";
    std::string sz4 = "test string nr. 4";

    // Same strings, different addresses
    std::string sz1_dup = sz1;
    std::string sz2_dup = sz2;
    std::string sz3_dup = sz3;
    std::string sz4_dup = sz4;

    // Insert strings into the table
    IndexedString index1 = st.insert(sz1).first;
    IndexedString index2 = st.insert(sz2).first;
    IndexedString index3 = st.insert(sz3).first;
    IndexedString index4 = st.insert(sz4).first;

    // Validate correct comparisons
    if(st.find(sz1_dup) == index1) std::cout << "Nr. 1 is a match" << std::endl;
    if(st.find(sz2_dup) == index2) std::cout << "Nr. 2 is a match" << std::endl;
    if(st.find(sz3_dup) == index3) std::cout << "Nr. 3 is a match" << std::endl;
    if(st.find(sz4_dup) == index4) std::cout << "Nr. 4 is a match" << std::endl;

    // Retrieve strings from table
    std::cout << "Index 1 = " << index1->c_str() << std::endl;
    std::cout << "Index 2 = " << index2->c_str() << std::endl;
    std::cout << "Index 3 = " << index3->c_str() << std::endl;
    std::cout << "Index 4 = " << index4->c_str() << std::endl;

    return 0;
}
46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jun 11, 2010 at 15:30

@mrjones

Its nice and it’s especially nice to see people testing their code :).But how would st.Find(sz1_dup) == index1 speed up the comparison when st.Find uses HashFunction(string) that calculates the hash over all string? Unless you always calculate indices for all strings and then use these instead of strings themselves. I am just asking the question based on the example code.

I suppose the sample code could have been a bit clearer :)
The point was indeed to use the Find/Insert functions only once, when a string is encountered, and then use the indices in the strings place. The way I am using it, is during the parsing of the script files all named primitives are inserted into the table, and then throughout the rest of the AST validation and codegeneration, I only use the indices to compare string. I don’t touch the actual strings, unless there is a compilation error, and I need to dump some human-readable info on the error.
@mrjones

Also wouldn’t it be easier to use stl::set, since you are using stl anyway already? It might be slower of course.

I actually hadn’t thought about using std::set for this. The only issue I can see with using it, is that it sorts all its entries, which is not a requirement for my specific case. This would incur some unneeded overhead, but then again, since it’s only to be called once, it might be negligible. Whatever floats your boat :)

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Jun 11, 2010 at 16:51

stdext::hash_set. (Or __gnu_cxx::hash_set, depending on your compiler flavor.) ;)

EDIT: There’s also std::tr1::unordered_set.

Fe8a5d0ee91f9db7f5b82b8fd4a4e1e6
0
JarkkoL 102 Jun 11, 2010 at 17:54

How about using char* instead of int and just compare those. It’s nicer to debug and you can access the string without access to the table too.

Cheers, Jarkko

A8433b04cb41dd57113740b779f61acb
0
Reedbeta 167 Jun 11, 2010 at 18:25

Using ints does have the advantage that it’s easy to write the table out to a file together with things referencing it, e.g. for debug info for bytecode-compiled script, or something.

Fe8a5d0ee91f9db7f5b82b8fd4a4e1e6
0
JarkkoL 102 Jun 11, 2010 at 18:37

Yes, I know and there are other advantages and disadvantages as well. Just saying that using char* might be a better choice depending on the case.

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jun 12, 2010 at 17:35

@Reedbeta

stdext::hash_set. (Or __gnu_cxx::hash_set, depending on your compiler flavor.) :) EDIT: There’s also std::tr1::unordered_set.

Only thing left for me to complain about, is the size of the iterators, which clock in at \~12 bytes :p

340bf64ac6abda6e40f7e860279823cb
0
_oisyn 101 Jun 14, 2010 at 08:18

Where can I store my wchar_t strings?

.edit: And why does Get() return a std::string, rather than a const char *?

99f6aeec9715bb034bba93ba2a7eb360
0
Nick 102 Jun 14, 2010 at 12:24

This is a great optimization for compilers and similar projects, but it can be a nightmare for debugging. Not long ago I had to track down an obscure bug in a C preprocessor that had a string (token) table like this, and working with numbers instead of readable strings made the job a lot harder.

So I’d recommend to either use char* as JarkkoL already suggested, with additional methods to get the index of a string when you need a more compact representation (like when writing out to a file), or to just make IndexedString a struct with an additional string pointer in debug mode.

46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jun 15, 2010 at 05:07

@.oisyn

Where can I store my wchar_t strings?

I actually wanted to use wchar_t strings, but I’m currently using flex/bison to generate the lexer and scanner, and both of them choke on non-ansi symbols, so I sticked with char* for the time being. If I were able to use wchar_t though, modifying the class to support it would be trivial, I expect.
@.oisyn

.edit: And why does Get() return a std::string, rather than a const char *?

A lazyness/convenience combo.I mentioned ealier that I only touch the actual strings when an error occured, and I needed to dump some info. I return a std::string, so I can start concatenating the compiler error immediately to the string, and that’s the only reason :)

48851e55a43deff5442b2e33dc35a0b4
0
hammer 101 Jan 10, 2011 at 00:19
    size_t HashFunction(const char *s)
    {
        if(s == 0)
            return 0;
        size_t HashVal = 5381;
        while(*s != 0)
        {
            HashVal = ((HashVal << 5) + HashVal) ^ *s;
            s++;
        }
        return HashVal;
    }

Is this a popular hash function? Because I’ve seen it before, from PHP Zhash routines.

There is an optimized routine for it:

/*
 * DJBX33A (Daniel J. Bernstein, Times 33 with Addition)
 *
 * This is Daniel J. Bernstein's popular `times 33' hash function as
 * posted by him years ago on comp.lang.c. It basically uses a function
 * like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best
 * known hash functions for strings. Because it is both computed very
 * fast and distributes very well.
 *
 * The magic of number 33, i.e. why it works better than many other
 * constants, prime or not, has never been adequately explained by
 * anyone. So I try an explanation: if one experimentally tests all
 * multipliers between 1 and 256 (as RSE did now) one detects that even
 * numbers are not useable at all. The remaining 128 odd numbers
 * (except for the number 1) work more or less all equally well. They
 * all distribute in an acceptable way and this way fill a hash table
 * with an average percent of approx. 86%.
 *
 * If one compares the Chi^2 values of the variants, the number 33 not
 * even has the best value. But the number 33 and a few other equally
 * good numbers like 17, 31, 63, 127 and 129 have nevertheless a great
 * advantage to the remaining numbers in the large set of possible
 * multipliers: their multiply operation can be replaced by a faster
 * operation based on just one shift plus either a single addition
 * or subtraction operation. And because a hash function has to both
 * distribute good _and_ has to be very fast to compute, those few
 * numbers should be preferred and seems to be the reason why Daniel J.
 * Bernstein also preferred it.
 *
 *
 *                  -- Ralf S. Engelschall <rse@engelschall.com>
 */

//
// NOTE: This is the optimized unrolled version for:
//       while (c = *str++)
//          hash = ((hash << 5) + hash) + c;
//
static inline ulong zend_inline_hash_func(char *arKey, uint nKeyLength)
{
        register ulong hash = 5381;

        /* variant with the hash unrolled eight times */
        for (; nKeyLength >= 8; nKeyLength -= 8) {
                hash = ((hash << 5) + hash) + *arKey++;       // (hash * 33) + c
                hash = ((hash << 5) + hash) + *arKey++;
                hash = ((hash << 5) + hash) + *arKey++;
                hash = ((hash << 5) + hash) + *arKey++;
                hash = ((hash << 5) + hash) + *arKey++;
                hash = ((hash << 5) + hash) + *arKey++;
                hash = ((hash << 5) + hash) + *arKey++;
                hash = ((hash << 5) + hash) + *arKey++;
        }
        switch (nKeyLength) {
                case 7: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
                case 6: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
                case 5: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
                case 4: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
                case 3: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
                case 2: hash = ((hash << 5) + hash) + *arKey++; /* fallthrough... */
                case 1: hash = ((hash << 5) + hash) + *arKey++; break;
                case 0: break;
EMPTY_SWITCH_DEFAULT_CASE()
        }
        return hash;
}
46407cc1bdfbd2db4f6e8876d74f990a
0
Kenneth_Gorking 101 Jan 10, 2011 at 07:13

@hammerIs this a popular hash function? Because I

I actually don’t know, it’s just a hash function that I have had in my code for ever. I can’t remember where I got it from though…

8563f7b73aeb34bb8604f1dd8f546c88
0
Mattias_Gustavsson 101 Jan 10, 2011 at 07:33

I made a similar thing a while back: http://www.mattiasgustavsson.com/Blog/Entries/UsingStringsasIDs.php

I spent a fair bit of time optimising it to get good performance - both in terms of speed and memory use. Feel free to grab bits of it if you have a use for it, it’s all public domain.

48851e55a43deff5442b2e33dc35a0b4
0
hammer 101 Jan 11, 2011 at 00:05

@Mattias Gustavsson

I made a similar thing a while back: http://www.mattiasgustavsson.com/Blog/Entries/UsingStringsasIDs.php I spent a fair bit of time optimising it to get good performance - both in terms of speed and memory use. Feel free to grab bits of it if you have a use for it, it’s all public domain.

PHP Zhash routines has some hints to make it run faster:
- Use the optimized unrolled hash function instead.
- Store precomputed string length - use it as a preliminary comparison check.
- If you store string length, you can use memcmp() instead of strcmp(), for final comparison.

8563f7b73aeb34bb8604f1dd8f546c88
0
Mattias_Gustavsson 101 Jan 11, 2011 at 07:38

@hammer

PHP Zhash routines has some hints to make it run faster:
- Use the optimized unrolled hash function instead.
- Store precomputed string length - use it as a preliminary comparison check.
- If you store string length, you can use memcmp() instead of strcmp(), for final comparison.

Hey, those are good suggestions - I think I’ll incorporate them in my code :yes:

I can’t really do the memcmp though, as I need case-insensitive comparison for my use case.:sad:

48851e55a43deff5442b2e33dc35a0b4
0
hammer 101 Jan 11, 2011 at 09:39

@Mattias Gustavsson

Hey, those are good suggestions - I think I’ll incorporate them in my code :yes: I can’t really do the memcmp though, as I need case-insensitive comparison for my use case.:sad:

Do you need to store the string as case-sensitive?

Because, if all strings are hashed as upper case, I suppose you could just convert the search string
to upper case before searching for it.

I guess that would be another improvement tip if you can use it. Hash all strings as upper (or lower) case.
Some people prefer upper case over lower, because they think the upper case conversion functions are slightly faster.
I haven’t really search for proof of it though.

You can use this tip when all of your strings are filenames or filepaths, since you don’t have to care about
case sensitivity. I’ve seen MMO games like World of Warcraft and Warhammer store all their game asset filepaths
as upper case, to improve hash lookups.

8563f7b73aeb34bb8604f1dd8f546c88
0
Mattias_Gustavsson 101 Jan 11, 2011 at 13:30

Well, I need to save out the strings in some cases, and I need them to preserve case (well, at least preserve the case of the first string passed to it). For some cases, doing it as all upper or all lower would be fine though… I guess I could store a mixed-case AND an all upper version of each string… more memory, but would allow the memcmp when doing lookups…

48851e55a43deff5442b2e33dc35a0b4
0
hammer 101 Jan 11, 2011 at 17:26

So, your strings are not filepaths?

I wouldn’t store two strings just for that. I don’t think it’s worth it.

The memcmp() check is only done after the hash index is checked, so it’s only useful when you have collisions.
Collisions should be avoided as much as possible, so I don’t think it’s worth the effort for something
that shouldn’t happen most of time.

If it was possible to have a collision free hash routine, then you could get rid of all the code after the
hash index lookup, and just return the string. All the code here is just meant to speed up collision checking.

8563f7b73aeb34bb8604f1dd8f546c88
0
Mattias_Gustavsson 101 Jan 11, 2011 at 19:29

Mostly no, they are mostly arbitrary names used as IDs for all sort of things, and a lot of them are read from (and written to) xml files. Using strings for IDs makes it very easy to work with when doing rapid prototyping (like when I tried making games in 2-3 hours for TigJamUK).

48851e55a43deff5442b2e33dc35a0b4
0
hammer 101 Jan 12, 2011 at 00:58

Do these string IDs vary in length?

The string length check would only help speed up collisions for strings that vary in length, like filepaths.
If you have something like a database of 20-digit serial numbers, for example, that don’t vary in length,
then this check would do nothing to resolve collisions.

And memcmp() is only slightly faster than strcmp(), since it doesn’t need to check the ending NULL.