str_hash

Posted By: HeelX

str_hash - 11/17/15 21:41

I was in the need to have a hash function for strings. I copied some code from stackoverflow.com and it didn't work for strings "aa", "aaa", and so on, so I modified it a little bit. However, I don't know if it is stable - but it works for me.

Have fun:

Code:
long str_hash (STRING* str) {
	
	int c;
	long hash = 5381;
	
	char* cstr = _chr(str);
	
	int index;
	
	while (c = *cstr++, index++) {
		hash = (((hash << str_len(str)) + hash) + c + str_len(str)); 
	}

	return hash;
}

Posted By: WretchedSid

Re: str_hash - 11/17/15 23:47

You could potentially save a lot of work by saving the result of the string length.

Anyway, hash functions seem a lot of hit and miss and also a lot of guess work, so if the above one doesn't work out for someone, here is the hashing function that we use in Rayne to hash strings:

Code:
void UTF8String::RecalcuateHash()
	{
		_hash = 0;
		
		const uint8 *bytes = GetBytes();
		
		for(size_t i = 0; i < _length; i ++)
		{
			HashCombine(_hash, UTF8ToUnicode(bytes));
			bytes += (UTF8TrailingBytes[*bytes] + 1);
		}
	}



And the HashCombine function looks like this:
Code:
template<class T>
	void HashCombine(size_t &seed, const T &value)
	{
		std::hash<T> hasher;
		seed ^= static_cast<size_t>(hasher(value)) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
	}



The idea is to hash each unicode character independently and then combine all of the hashes, to scramble the bits as much as possible. So you would also need a hash function for size_t, which for Rayne is std::hash<size_t>, which uses the cityhash64 function internally.

Can be simplified quite a bit, but I'll leave that as an exercise for the reader laugh

Also, UTF8TrailingBytes is a function returning the byte length of the UTF8 character. In Lite-C that would always be 0. UTF8ToUnicode simply converts the UTF8 character to unicode, which is dead simple for ASCII characters (its just a cast).

If anyone wants to pick this up and use it for Unicode, I would suggest hashing the grapheme clusters instead of the Unicode code points.
© 2024 lite-C Forums