You could potentially save a lot of work by saving the result of the string length.

Anyway, hash functions seem a lot of hit and miss and also a lot of guess work, so if the above one doesn't work out for someone, here is the hashing function that we use in Rayne to hash strings:

Code:
void UTF8String::RecalcuateHash()
	{
		_hash = 0;
		
		const uint8 *bytes = GetBytes();
		
		for(size_t i = 0; i < _length; i ++)
		{
			HashCombine(_hash, UTF8ToUnicode(bytes));
			bytes += (UTF8TrailingBytes[*bytes] + 1);
		}
	}



And the HashCombine function looks like this:
Code:
template<class T>
	void HashCombine(size_t &seed, const T &value)
	{
		std::hash<T> hasher;
		seed ^= static_cast<size_t>(hasher(value)) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
	}



The idea is to hash each unicode character independently and then combine all of the hashes, to scramble the bits as much as possible. So you would also need a hash function for size_t, which for Rayne is std::hash<size_t>, which uses the cityhash64 function internally.

Can be simplified quite a bit, but I'll leave that as an exercise for the reader laugh

Also, UTF8TrailingBytes is a function returning the byte length of the UTF8 character. In Lite-C that would always be 0. UTF8ToUnicode simply converts the UTF8 character to unicode, which is dead simple for ASCII characters (its just a cast).

If anyone wants to pick this up and use it for Unicode, I would suggest hashing the grapheme clusters instead of the Unicode code points.


Shitlord by trade and passion. Graphics programmer at Laminar Research.
I write blog posts at feresignum.com