You could potentially save a lot of work by saving the result of the string length.
Anyway, hash functions seem a lot of hit and miss and also a lot of guess work, so if the above one doesn't work out for someone, here is the hashing function that we use in Rayne to hash strings:
void UTF8String::RecalcuateHash()
{
_hash = 0;
const uint8 *bytes = GetBytes();
for(size_t i = 0; i < _length; i ++)
{
HashCombine(_hash, UTF8ToUnicode(bytes));
bytes += (UTF8TrailingBytes[*bytes] + 1);
}
}
And the
HashCombine function looks like this:
template<class T>
void HashCombine(size_t &seed, const T &value)
{
std::hash<T> hasher;
seed ^= static_cast<size_t>(hasher(value)) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
The idea is to hash each unicode character independently and then combine all of the hashes, to scramble the bits as much as possible. So you would also need a hash function for size_t, which for Rayne is std::hash<size_t>, which uses the cityhash64 function internally.
Can be simplified quite a bit, but I'll leave that as an exercise for the reader
Also,
UTF8TrailingBytes is a function returning the byte length of the UTF8 character. In Lite-C that would always be 0.
UTF8ToUnicode simply converts the UTF8 character to unicode, which is dead simple for ASCII characters (its just a cast).
If anyone wants to pick this up and use it for Unicode, I would suggest hashing the grapheme clusters instead of the Unicode code points.