Calculate Unique Value for any English Word

May 31, 201213 yr

We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc...

May 31, 201213 yr

Google hashing.

May 31, 201213 yr

(IMO) I can't see the number that represents the word being anything other than a code. For example you couldn't add two numbers together to produce a number which represents another meaningful word.

If just producing a code would be sufficient then it is quite a simple.

Let first 2 digits (with value 01 - 52) represent the first character. (use 1 - 26 for small characters and 27 - 52 for capitals).

Repeat for subsequent characters using 2 digits per character.

Example:- School would be represented by 450308151512.

I think you may well be looking for something much deeper than this.

Edited May 31, 201213 yr by Joatmon

May 31, 201213 yr

We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc...

ASCII, encoding characters since 1960...

May 31, 201213 yr

ASCII, encoding characters since 1960...

The same thought came to me when eating my dinner! If using decimal numbers and using the suggestion given in #3, then 3 digits would be needed per character, only 2 if hex was used. Of course using ASCII would enable punctuation (and even spaces if required).

Edited May 31, 201213 yr by Joatmon

May 31, 201213 yr

Is there some reason why normal character encoding such as ASCII or UTF-8 is not appropriate? Why are you encoding the words in the first place? What are the requirements and all that?

May 31, 201213 yr

Use base 27 or 54 (or other) for the most efficient use of numbers with the first letter of the word being the least significant "digit". So, in their order, the letters encode as 27º, 27¹, 27², 27³, etc. That is, "a..." = 1, "b..." = 2, etc, "y..." = 25, and "z..." = 26; and ".a.." = 27, ".b.." = 54, etc; "..a." = 729, "..b." = 1458, etc.

May 31, 201213 yr

Example:- School would be represented by 450308151512.

That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word.

The correct answer is Klaynos'. Google the term "hashing".

May 31, 201213 yr

Ascii is the new kid on the block.

http://en.wikipedia.org/wiki/Morse_code

Since about 1840

You need to account for the times so it needs to be a bit more complicated.

For example, 1 for a dot; 2 for a dash; 3 between letters

May 31, 201213 yr

Ascii is the new kid on the block.

http://en.wikipedia....wiki/Morse_code

Since about 1840

You need to account for the times so it needs to be a bit more complicated.

For example, 1 for a dot; 2 for a dash; 3 between letters

Lol.

And yeah, if all you're trying to do is hash some strings the real question would be what language are you programming in?

May 31, 201213 yr

Ascii is the new kid on the block.

http://en.wikipedia.org/wiki/Morse_code

Since about 1840

You need to account for the times so it needs to be a bit more complicated.

For example, 1 for a dot; 2 for a dash; 3 between letters

Yes but....morse does not distinguish between upper and lower case so the person John and the john we use in the restroom are the same in Morse

May 31, 201213 yr

There are lots of ways to do this.

What is the actual goal?

Some methods might be better than others, depending on the purpose.

May 31, 201213 yr

That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word.

The correct answer is Klaynos'. Google the term "hashing".

Completely agree regarding efficiency. I was just looking to suggest a simple algorithm that would work. Because the OP says he is looking for a "value" I'm not even sure whether he is looking for something deeper than a code. (but what else the value of the unique number would be used for is beyond me).

Edited May 31, 201213 yr by Joatmon

June 1, 201213 yr

Author

Thank you all for your responses. Our goal is to create a numeric value for any word and so that we will be able to compare the calculated values of words to other calculated values of other words to get the nearness of the two words. We are thinking of using something like bitwise operations to do the comparison. We will not surely go into technologies like phonetic or Fuzzy word match due to their slowness.

June 1, 201213 yr

Can you define "nearness" as that doesn't really mean much.

June 2, 201213 yr

Author

Nearness means something like:

The word "school" and the word "Skool" might be 70% near or alike.

June 2, 201213 yr

Nearness means something like:

The word "school" and the word "Skool" might be 70% near or alike.

I would consider "skool" a misspelling, rather than a comparative word.

For many words, you must also consider usage. Using your example:

Main Entry: school

Part of Speech: noun

Definition: place, system for educating

Synonyms: academy, alma mater, blackboard, college, department, discipline, establishment, faculty, hall, halls of ivy, institute, institution, schoolhouse, seminary, university

Part of Speech: noun

Definition: body of philosophy on subject

Synonyms: belief, creed, faith, outlook, persuasion, school of thought, stamp, way, way of life

Part of Speech: verb

Definition: teach

Synonyms: advance, coach, control, cultivate, direct, discipline, drill, educate, guide, indoctrinate, inform, instruct, lead, manage, prepare, prime, show, train, tutor, verse

- Source:

Thesaurus.com

June 2, 201213 yr

Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So:

[math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math]

I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions.

Edited June 2, 201213 yr by Xittenn

June 2, 201213 yr

Author

Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So:

[math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math]

I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions.

Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures).

We have so far decided to use binary values (base 2) for storage of codes and calculations. Our tests are so far acceptable, but we still think we need more. Appreciate further brainstorm or ideas.

June 2, 201213 yr

I'm not sure this is really possible.

Skool and School are similar to us as humans because we have been trained to know that sk and sch make a similar sound. You would need to encode that into a computer.

There are other problems because then you'll come across words like which and witch, they are identical in pronounceation but very different in spelling as compared to read and read which is identical in spelling but changes sound depending on context.

You either need to encode pretty much EVERYTHING or you need to invent artificial intelligence and teach it English.

June 2, 201213 yr

If you wish to compare similar sounding words you just need to use a on-line dictionary that has a full pronunciation guide

which

(UK) enPR: hwĭch, IPA: /ʍɪʧ/, X-SAMPA: /WItS/

witch

enPR: wĭch, IPA: /wɪtʃ/, X-SAMPA: /wItS/

If you were able to get database access to wiktionary's pronunciation portion of the definitions - you could notice that (even as Klaynos showed) that very differently spelled words have very similary pronunciation. Even these two near hompophones have slight differences - the initial sound of the w is slightly different; the which sound is slightly more breathy and soft compared to a harder firmer witch

There are multiple methods that lexicographers and etymologists use to denote methods of speaking aloud. Three different forms are show for each of the words above.

June 2, 201213 yr

If you are wanting to match similar sounding words I wonder if the software associated with turning the spoken word into the written word could be adapted? I am thinking of the software that produces subtitles on live programs. It certainly often prints similar sounding words by mistake. (Often with hilarious results!)

Edited June 2, 201213 yr by Joatmon

June 2, 201213 yr

Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures).

If you are just looking for homonyms it sounds like you really don't need to encode the characters of a word but it's phonemes. Words with multiple pronunciations will probably need to have multiple entries in your database, one for each pronunciation. You may need to encode both though, the characters and the phonemes depending on what you are trying to achieve.

Sign In

Calculate Unique Value for any English Word

Featured Replies

Archived

Important Information

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)