Jump to content

Recommended Posts

Posted (edited)

(IMO) I can't see the number that represents the word being anything other than a code. For example you couldn't add two numbers together to produce a number which represents another meaningful word.

If just producing a code would be sufficient then it is quite a simple.

Let first 2 digits (with value 01 - 52) represent the first character. (use 1 - 26 for small characters and 27 - 52 for capitals).

Repeat for subsequent characters using 2 digits per character.

Example:- School would be represented by 450308151512.

I think you may well be looking for something much deeper than this.

Edited by Joatmon
Posted

We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc...

ASCII, encoding characters since 1960...

Posted (edited)

ASCII, encoding characters since 1960...

The same thought came to me when eating my dinner! If using decimal numbers and using the suggestion given in #3, then 3 digits would be needed per character, only 2 if hex was used. Of course using ASCII would enable punctuation (and even spaces if required).

Edited by Joatmon
Posted

Use base 27 or 54 (or other) for the most efficient use of numbers with the first letter of the word being the least significant "digit". So, in their order, the letters encode as 27º, 27¹, 27², 27³, etc. That is, "a..." = 1, "b..." = 2, etc, "y..." = 25, and "z..." = 26; and ".a.." = 27, ".b.." = 54, etc; "..a." = 729, "..b." = 1458, etc.

Posted
Example:- School would be represented by 450308151512.

That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word.

 

The correct answer is Klaynos'. Google the term "hashing".

Posted

Ascii is the new kid on the block.

http://en.wikipedia.org/wiki/Morse_code

Since about 1840

You need to account for the times so it needs to be a bit more complicated.

For example, 1 for a dot; 2 for a dash; 3 between letters

Yes but....morse does not distinguish between upper and lower case so the person John and the john we use in the restroom are the same in Morse :D

Posted (edited)

That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word.

 

The correct answer is Klaynos'. Google the term "hashing".

Completely agree regarding efficiency. I was just looking to suggest a simple algorithm that would work. Because the OP says he is looking for a "value" I'm not even sure whether he is looking for something deeper than a code. (but what else the value of the unique number would be used for is beyond me).

Edited by Joatmon
Posted

Thank you all for your responses. Our goal is to create a numeric value for any word and so that we will be able to compare the calculated values of words to other calculated values of other words to get the nearness of the two words. We are thinking of using something like bitwise operations to do the comparison. We will not surely go into technologies like phonetic or Fuzzy word match due to their slowness.

Posted

Nearness means something like:

 

The word "school" and the word "Skool" might be 70% near or alike.

I would consider "skool" a misspelling, rather than a comparative word.

 

For many words, you must also consider usage. Using your example:

 

Main Entry: school

 

Part of Speech: noun

Definition: place, system for educating

Synonyms: academy, alma mater, blackboard, college, department, discipline, establishment, faculty, hall, halls of ivy, institute, institution, schoolhouse, seminary, university

 

Part of Speech: noun

Definition: body of philosophy on subject

Synonyms: belief, creed, faith, outlook, persuasion, school of thought, stamp, way, way of life

 

Part of Speech: verb

Definition: teach

Synonyms: advance, coach, control, cultivate, direct, discipline, drill, educate, guide, indoctrinate, inform, instruct, lead, manage, prepare, prime, show, train, tutor, verse

 

- Source:

Posted (edited)

Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So:

 

[math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math]

 

I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions.

Edited by Xittenn
Posted

Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So:

 

[math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math]

 

I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions.

 

Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures).

 

We have so far decided to use binary values (base 2) for storage of codes and calculations. Our tests are so far acceptable, but we still think we need more. Appreciate further brainstorm or ideas.

Posted

I'm not sure this is really possible.

 

Skool and School are similar to us as humans because we have been trained to know that sk and sch make a similar sound. You would need to encode that into a computer.

 

There are other problems because then you'll come across words like which and witch, they are identical in pronounceation but very different in spelling as compared to read and read which is identical in spelling but changes sound depending on context.

 

You either need to encode pretty much EVERYTHING or you need to invent artificial intelligence and teach it English.

Posted

If you wish to compare similar sounding words you just need to use a on-line dictionary that has a full pronunciation guide

 

which

witch

If you were able to get database access to wiktionary's pronunciation portion of the definitions - you could notice that (even as Klaynos showed) that very differently spelled words have very similary pronunciation. Even these two near hompophones have slight differences - the initial sound of the w is slightly different; the which sound is slightly more breathy and soft compared to a harder firmer witch

 

There are multiple methods that lexicographers and etymologists use to denote methods of speaking aloud. Three different forms are show for each of the words above.

Posted (edited)

If you are wanting to match similar sounding words I wonder if the software associated with turning the spoken word into the written word could be adapted? I am thinking of the software that produces subtitles on live programs. It certainly often prints similar sounding words by mistake. (Often with hilarious results!)

Edited by Joatmon
Posted

Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures).

If you are just looking for homonyms it sounds like you really don't need to encode the characters of a word but it's phonemes. Words with multiple pronunciations will probably need to have multiple entries in your database, one for each pronunciation. You may need to encode both though, the characters and the phonemes depending on what you are trying to achieve.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.