ID:154048
 
I'm using a program that converts each letter in a word into it's ascii equivilent value, then combines the results to create a number value for that word. But, the thing that I need to do is to have each word value be totally unique. Does anyone have a solution for how to achieve this?
Sorta like creating a unique token number for each word?

I think there is a C library that does that... something about adding the ascii values together then doing some Boolean math (AND, OR, XOR, etc.) to that total against the number of characters in the word.

I'll look around and see if I can recall where I saw that...
The solution to this problem is called a "perfect hashing equation." It doesn't really exist. You can perfectly hash (hashing=translating a text string into a number) data if you know enough about it to begin with (for example, if every word you need to represent begins with a different letter, you can just hash each word as the number representing that first letter). In general, though, it's difficult to hash arbitrary data. Typically what is needed is a system to resolve collisions (when two strings get hashed to the same number). There are a number of ways that this is done, but none of them are that simple. If you still want to try it, I can look up some common hashing functions (if BYOND were Java there would be one built in.. ) and explain some different collision resolution methods.

-AbyssDragon
In response to AbyssDragon
AbyssDragon wrote:
In general, though, it's difficult to hash arbitrary
data. Typically what is needed is a system to resolve
collisions (when two strings get hashed to the same
number). There are a number of ways that this is done,
but none of them are that simple.

One simple way (well, to me that is) that came to mind right away requires tracking, or storing the 'dictionary' of number/word combos. When creating a new hash, you check if it exists in the dictionary. If it does, just add another number to the hash (maybe a simple increment or random number), and check again. Repeat until you get a unique number, then store it in the dictionary and start on the next word.

Granted, this isn't the most elegant solution, but it is straightforward, and could be implemented easily with the hash algorithm that A.D. finds...
Foomer wrote:
I'm using a program that converts each letter in a word into it's ascii equivilent value, then combines the results to create a number value for that word. But, the thing that I need to do is to have each word value be totally unique. Does anyone have a solution for how to achieve this?

As long as words only use letters just think of the alphabet as being a base 26 number system and convert them to base 10.
In response to Theodis
I'm using a program that converts each letter in a word into it's ascii equivilent value, then combines the results to create a number value for that word. But, the thing that I need to do is to have each word value be totally unique. Does anyone have a solution for how to achieve this?

As long as words only use letters just think of the alphabet as being a base 26 number system and convert them to base 10.

Yeah, if each word's value has to be totally unique, then you pretty much have to do this, where each letter has a value between 0 and 25: letter1 + (letter2 * 26) + (letter3 * 26 ** 2) + (letter4 * 26 ** 3)...

I seem to recall that someone was working on a BigInteger library at one time. If you can find the person, that might be useful to you.


In response to Gughunter
Gughunter wrote:
Yeah, if each word's value has to be totally unique, then
you pretty much have to do this, where each letter has a
value between 0 and 25: letter1 + (letter2 * 26) +
(letter3 * 26 ** 2) + (letter4 * 26 ** 3)...

That should work, but he'll still get into trouble if a one word's total matches another. For a really simple example, if we are using: a-z = 0-25

"HAM" would be: 7+0+12 = 19
"CANE" would be: 2+0+13+4 = 19

This is the 'collision' that Abyss was talking about. You need to deal with these possibilities in order to have a unique word-set. That's why I mentioned either incrementing the total by one or adding a random number to the total - you just have to keep checking against a known set of already created values (the 'dictionary') to make sure you don't have another collision...

Gug's idea is definitely an improvement! Assuming you mean '**' to be the same as 'raise to the power of':

Using "HAM" in the sequence: 7+(0*26)+(12*26**2) = 8119
Using "CANE" in the sequence: 2+(0*26)+(13*26**2)+(4*26**3) = 79094

It would be much harder to find matches (unless you used a lot 'a' and 'b' letters) with this method, but still possible as the dictionary grows...

The only problem I see with this method is that the number could take up more space than the original word, especially in the upper 25% of the alphabet.
In response to digitalmouse
Let's make it simpler.

Let each letter be represented by a two number string from 01 to 26. String them all together, and append a decimal point to the front, and call it a number:

"Ham" = .080113
"Cane" = .03011405

Want integers? Append a 1 instead:

"Ham" = 1,080,113
"Cane" = 103,011,405

Any problems here? Aside from extraordinary length? :)
In response to Skysaw
Skysaw wrote:
Any problems here?

None really, except that the numbers are gonna get pretty big with bigger words:

catacomb = .0200190002141201 or 10,200,190,002,141,201

You will use up more space than the original word! Certainly would not make searching any faster...

A true Word->number hashing algorithm (like a one-pass encryption or digest algorithm) would be the way to go I think. Or a variant on Gughunter's exponent example, but you need to keep the length of the number down or it won't be worth storing such huge numbers - no advantage over using the original word...
In response to digitalmouse
digitalmouse wrote:
Skysaw wrote:
Any problems here?

None really, except that the numbers are gonna get pretty big with bigger words:

catacomb = .0200190002141201 or 10,200,190,002,141,201

You will use up more space than the original word! Certainly would not make searching any faster...

A true Word->number hashing algorithm (like a one-pass encryption or digest algorithm) would be the way to go I think. Or a variant on Gughunter's exponent example, but you need to keep the length of the number down or it won't be worth storing such huge numbers - no advantage over using the original word...

I guess I never really heard an explanation of what all this was for, really. Not knowing the application, I don't see a way for me to tell what is an "advantage." Was searching a requirement?
In response to Skysaw
I don't think the size of the number matters much (unless it hits 1.#INF, that might cause problems). Basically I just need a new number to stick in the rand_seed() so that it will always return the same numbers based on the word you entered. This is for my language system, so Theodis should be familiar with it.
In response to Skysaw
The first post in this thread explains it all I think...