Camille Thomas Barkho Posted May 31, 2012 Posted May 31, 2012 We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc...
Joatmon Posted May 31, 2012 Posted May 31, 2012 (edited) (IMO) I can't see the number that represents the word being anything other than a code. For example you couldn't add two numbers together to produce a number which represents another meaningful word. If just producing a code would be sufficient then it is quite a simple. Let first 2 digits (with value 01 - 52) represent the first character. (use 1 - 26 for small characters and 27 - 52 for capitals). Repeat for subsequent characters using 2 digits per character. Example:- School would be represented by 450308151512. I think you may well be looking for something much deeper than this. Edited May 31, 2012 by Joatmon
doG Posted May 31, 2012 Posted May 31, 2012 We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc... ASCII, encoding characters since 1960... 1
Joatmon Posted May 31, 2012 Posted May 31, 2012 (edited) ASCII, encoding characters since 1960... The same thought came to me when eating my dinner! If using decimal numbers and using the suggestion given in #3, then 3 digits would be needed per character, only 2 if hex was used. Of course using ASCII would enable punctuation (and even spaces if required). Edited May 31, 2012 by Joatmon
the asinine cretin Posted May 31, 2012 Posted May 31, 2012 Is there some reason why normal character encoding such as ASCII or UTF-8 is not appropriate? Why are you encoding the words in the first place? What are the requirements and all that?
ewmon Posted May 31, 2012 Posted May 31, 2012 Use base 27 or 54 (or other) for the most efficient use of numbers with the first letter of the word being the least significant "digit". So, in their order, the letters encode as 27º, 27¹, 27², 27³, etc. That is, "a..." = 1, "b..." = 2, etc, "y..." = 25, and "z..." = 26; and ".a.." = 27, ".b.." = 54, etc; "..a." = 729, "..b." = 1458, etc.
D H Posted May 31, 2012 Posted May 31, 2012 Example:- School would be represented by 450308151512. That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word. The correct answer is Klaynos'. Google the term "hashing".
John Cuthber Posted May 31, 2012 Posted May 31, 2012 Ascii is the new kid on the block. http://en.wikipedia.org/wiki/Morse_code Since about 1840 You need to account for the times so it needs to be a bit more complicated. For example, 1 for a dot; 2 for a dash; 3 between letters
the asinine cretin Posted May 31, 2012 Posted May 31, 2012 Ascii is the new kid on the block. http://en.wikipedia....wiki/Morse_code Since about 1840 You need to account for the times so it needs to be a bit more complicated. For example, 1 for a dot; 2 for a dash; 3 between letters Lol. And yeah, if all you're trying to do is hash some strings the real question would be what language are you programming in?
doG Posted May 31, 2012 Posted May 31, 2012 Ascii is the new kid on the block. http://en.wikipedia.org/wiki/Morse_code Since about 1840 You need to account for the times so it needs to be a bit more complicated. For example, 1 for a dot; 2 for a dash; 3 between letters Yes but....morse does not distinguish between upper and lower case so the person John and the john we use in the restroom are the same in Morse
John Cuthber Posted May 31, 2012 Posted May 31, 2012 There are lots of ways to do this. What is the actual goal? Some methods might be better than others, depending on the purpose.
Joatmon Posted May 31, 2012 Posted May 31, 2012 (edited) That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word. The correct answer is Klaynos'. Google the term "hashing". Completely agree regarding efficiency. I was just looking to suggest a simple algorithm that would work. Because the OP says he is looking for a "value" I'm not even sure whether he is looking for something deeper than a code. (but what else the value of the unique number would be used for is beyond me). Edited May 31, 2012 by Joatmon
Camille Thomas Barkho Posted June 1, 2012 Author Posted June 1, 2012 Thank you all for your responses. Our goal is to create a numeric value for any word and so that we will be able to compare the calculated values of words to other calculated values of other words to get the nearness of the two words. We are thinking of using something like bitwise operations to do the comparison. We will not surely go into technologies like phonetic or Fuzzy word match due to their slowness.
Klaynos Posted June 1, 2012 Posted June 1, 2012 Can you define "nearness" as that doesn't really mean much.
Camille Thomas Barkho Posted June 2, 2012 Author Posted June 2, 2012 Nearness means something like: The word "school" and the word "Skool" might be 70% near or alike.
Pantaz Posted June 2, 2012 Posted June 2, 2012 Nearness means something like: The word "school" and the word "Skool" might be 70% near or alike. I would consider "skool" a misspelling, rather than a comparative word. For many words, you must also consider usage. Using your example: Main Entry: school Part of Speech: noun Definition: place, system for educating Synonyms: academy, alma mater, blackboard, college, department, discipline, establishment, faculty, hall, halls of ivy, institute, institution, schoolhouse, seminary, university Part of Speech: noun Definition: body of philosophy on subject Synonyms: belief, creed, faith, outlook, persuasion, school of thought, stamp, way, way of life Part of Speech: verb Definition: teach Synonyms: advance, coach, control, cultivate, direct, discipline, drill, educate, guide, indoctrinate, inform, instruct, lead, manage, prepare, prime, show, train, tutor, verse - Source: Thesaurus.com
Xittenn Posted June 2, 2012 Posted June 2, 2012 (edited) Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So: [math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math] I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions. Edited June 2, 2012 by Xittenn
Camille Thomas Barkho Posted June 2, 2012 Author Posted June 2, 2012 Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So: [math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math] I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions. Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures). We have so far decided to use binary values (base 2) for storage of codes and calculations. Our tests are so far acceptable, but we still think we need more. Appreciate further brainstorm or ideas.
Klaynos Posted June 2, 2012 Posted June 2, 2012 I'm not sure this is really possible. Skool and School are similar to us as humans because we have been trained to know that sk and sch make a similar sound. You would need to encode that into a computer. There are other problems because then you'll come across words like which and witch, they are identical in pronounceation but very different in spelling as compared to read and read which is identical in spelling but changes sound depending on context. You either need to encode pretty much EVERYTHING or you need to invent artificial intelligence and teach it English.
imatfaal Posted June 2, 2012 Posted June 2, 2012 If you wish to compare similar sounding words you just need to use a on-line dictionary that has a full pronunciation guide which (UK) enPR: hwĭch, IPA: /ʍɪʧ/, X-SAMPA: /WItS/witch enPR: wĭch, IPA: /wɪtʃ/, X-SAMPA: /wItS/If you were able to get database access to wiktionary's pronunciation portion of the definitions - you could notice that (even as Klaynos showed) that very differently spelled words have very similary pronunciation. Even these two near hompophones have slight differences - the initial sound of the w is slightly different; the which sound is slightly more breathy and soft compared to a harder firmer witch There are multiple methods that lexicographers and etymologists use to denote methods of speaking aloud. Three different forms are show for each of the words above. 1
Joatmon Posted June 2, 2012 Posted June 2, 2012 (edited) If you are wanting to match similar sounding words I wonder if the software associated with turning the spoken word into the written word could be adapted? I am thinking of the software that produces subtitles on live programs. It certainly often prints similar sounding words by mistake. (Often with hilarious results!) Edited June 2, 2012 by Joatmon
doG Posted June 2, 2012 Posted June 2, 2012 Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures). If you are just looking for homonyms it sounds like you really don't need to encode the characters of a word but it's phonemes. Words with multiple pronunciations will probably need to have multiple entries in your database, one for each pronunciation. You may need to encode both though, the characters and the phonemes depending on what you are trying to achieve.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now