fredreload Posted December 18, 2017 Posted December 18, 2017 Alright, so here is something useful. After playing with n-gram for text analysis I came up with an idea for AI. How does n-gram works? For instance, I can type: "I drink a Coca Cola in my room today." You run an n-gram for 2. "I drink" no meaning. "drink a" no meaning. "a Coca" no meaning. "Coca Cola" has meaning. And it goes up with all possible combinations in ascending order "I Coca", "I Cola", "I in" etc. Collecting all web pages and texts you'll probably sum up the frequency for Coca Cola to be 100. Meaning Coca cola has a meaning and all words associated with it. And therefore when you type I drink a Coca Cola, you would get a list in the database saying "in the room", "in the zoo", "is cold", "is hot", based on the frequencies. And the one with the highest frequencies, usually make sense. And there you have a huge database at your command. I used the algorithm provided here. And used it on the text dump from Wikipedia. Pure text, parse it to sentences, and named it "AI.txt". And it generated a "B.txt" file which I tried to shove it into the database. Then I find out that most Wikipedia information only shows up once, not the best solution for Frequency learning. So I've decided to scrape all web pages on the net, with the loop over IP addresses. If you know how to do that do help me out. Btw the size after generated from a 3MB text file is 3.9GB, and only on 3 words. So.........be prepared to have a huge database. If you have no idea what this is just ignore it. Else I'd like to hear your feedback and what you got, thanks. P.S No this is not a homework assignment, just something I made out of free time AI2.py STEXT.py
Strange Posted December 19, 2017 Posted December 19, 2017 21 hours ago, fredreload said: Btw the size after generated from a 3MB text file is 3.9GB, and only on 3 words 3.9 GB sounds extraordinarily large. I assume by "3 words", you mean 3-grams. How many of these are there in the 3MB file?
Sensei Posted December 19, 2017 Posted December 19, 2017 (edited) Fred, you should check your implementation of algorithm with literally 2, 3 words.. to be sure algorithm has no errors introduced during writing.. Then on 4, then on 5, then further.. Manually check output in text editor. Your implementation looks slightly different than #2 post in the link that you provided. ps. You should not hardcode path to file, instead take it from command-line as argument. If it's not specified, use hardcoded path (if you have to). Edited December 19, 2017 by Sensei 1
fredreload Posted December 19, 2017 Author Posted December 19, 2017 5 hours ago, Strange said: 3.9 GB sounds extraordinarily large. I assume by "3 words", you mean 3-grams. How many of these are there in the 3MB file? Ya, I am attempting to analyze all sentence based on its frequencies. But what would that get me? For instance if I search "machine" it would return "a tool" frequency "100", "smart apparatus" frequency "99" etc. And that only works out if "machine" is in the sentence, not "it", not "this", or "that". After thinking about this, I just sort of give up on the whole idea. We might need an n-gram on the entire paragraph or article of "machine". That might work. n-gram for an entire article of the same class, but I don't think any computer can run an 100-gram or more and only categorized by titles. So how do you program an AI. Well, you have to train it. Each time it answers something correctly, give it a +1, sort of like a fitness on words. And with words you can mess around with its thought pattern (binary tree?neural network?). Anyway, this is as far as I got. You'll have to ask Watson from IBM on this one.
Strange Posted December 19, 2017 Posted December 19, 2017 You might be better looking at neural networks. Simple chatbots use n-grams and they are impressively unintelligent!.
fredreload Posted December 19, 2017 Author Posted December 19, 2017 (edited) 7 minutes ago, Strange said: You might be better looking at neural networks. Simple chatbots use n-grams and they are impressively unintelligent!. Agreed, but now I am wondering how to generate a neural network based on text mining and create thought patterns https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6 Forgive me this is a new field for me, and I'm tired today = = Edited December 19, 2017 by fredreload
EdEarl Posted December 19, 2017 Posted December 19, 2017 (edited) Since language is not static; in other words, languages change as they are used. New words are invented and new n-grams occur; this language growth process means your database must be modified as you process language. Some of your n-grams will consist of grammar errors, for example, "There big, instead of They're big." Wikipedia is a large information base, but it's style is limited. Your n-gram database would be larger if you include other kinds of text such as literature and poetry. As Strange said n-gram systems are unintelligent, and neural networks such as OpenAI.com are pretty good. I believe OpenAI publishes their software so anyone may use it. Although, neural nets are notorious for being computationally intensive. Edited December 19, 2017 by EdEarl clarity
fredreload Posted December 21, 2017 Author Posted December 21, 2017 (edited) On 2017/12/19 at 9:26 PM, Strange said: You might be better looking at neural networks. Simple chatbots use n-grams and they are impressively unintelligent!. Hey Strange, I've decided that I'll start with a dictionary. Do you happen to know a dictionary with a list of words in excel or text or database or anything that I can download? Thanks P.S By the way it will specify all forms, noun, verb, adjective, etc. P.S And don't have a loop back like I is an anonym for myself, I want like a description, an entity which suggest the subject to be yourself or something like that Edited December 21, 2017 by fredreload
Sensei Posted December 21, 2017 Posted December 21, 2017 (edited) 8 minutes ago, fredreload said: I've decided that I'll start with a dictionary. Do you happen to know a dictionary with a list of words in excel or text or database or anything that I can download? Google for "scrabble english dictionary download" or so.. Edited December 21, 2017 by Sensei
Strange Posted December 21, 2017 Posted December 21, 2017 There is this: http://wordnet.princeton.edu There may be others...
fredreload Posted December 21, 2017 Author Posted December 21, 2017 (edited) 1 hour ago, Strange said: There is this: http://wordnet.princeton.edu There may be others... Looks good, exactly what I want, but I dunno how to use its database. Is there one where it goes in text? I might have to import it into my Oracle database later using Python. It goes word:play, type:verb, meaning:to do something fun. Keep in mind word and type cannot be null, so there is always a word and always one type that goes with it. P.S I would map every word to a number based on its type and has a index for it. For instance, "type (noun) = 1", "type(verb) = 2". And its explanation is a solid rule that incorporates other words also as numbers. "type (noun): explain a variety for an object". That would be a rule set by the dictionary as "explain a variety of object" and I would further looks into "explain(verb)=3", "a(compound word? i dunno)=4" and map to its subsequent dictionary rule "explain(verb):tries to come up some meaning for the object". These numbers could then be used to analyze a paragraph. example: I (try to come up some meaning for the object) (on a variety of objects). machine interpretation: I explain type. On reverse, you could issue a command: Machine please explain type machine interpretation: I (try to come up some meaning for the object) (on a variety of objects). If you train the machine with the article. Edited December 21, 2017 by fredreload
Strange Posted December 21, 2017 Posted December 21, 2017 26 minutes ago, fredreload said: but I dunno how to use its database. Is there one where it goes in text? Dunno. But they have various formats and tools available on the download pages. Maybe google will help you find some info on using it.
Strange Posted December 21, 2017 Posted December 21, 2017 A good starting point, and test, for coding would be to generate a syntax graph from input sentences. For example:
Sensei Posted December 21, 2017 Posted December 21, 2017 (edited) Quote Is there one where it goes in text? If you would take my above advice, google for "scrabble english dictionary download" + "raw text", you would get this: https://raw.githubusercontent.com/jonbcard/scrabble-bot/master/src/dictionary.txt Then you could run it in a loop in script, and open website (using dozen proxy servers) like http://www.wordreference.com/definition/[word] or http://www.dictionary.com/browse/[word] where [word] is single row from dictionary.txt Parse received page to find whether it's noun, verb, etc. Edited December 21, 2017 by Sensei
Strange Posted December 21, 2017 Posted December 21, 2017 (edited) 26 minutes ago, Sensei said: If you would take my above advice, google for "scrabble english dictionary download" + "raw text", you would get this: Is that better than the standard "words" file that comes with Linux, etc? Just checked: that file seems to have about 179,000 words. /usr/share/dict/words on my machine has about 236,000 and I think there are versions with more than 400,000. Edited December 21, 2017 by Strange
fredreload Posted December 21, 2017 Author Posted December 21, 2017 3 hours ago, Strange said: A good starting point, and test, for coding would be to generate a syntax graph from input sentences. For example: Hi Strange, precisely what I am looking for, that is why I am looking into Spacy for Python. Also studying English grammer in regards to pronoun. Now, I want to generate my own dictionary simply from reading Wikipedia dumps. To begin with, I would get a dictionary for the words, but not the title for the article. For instance, if the article is about "social anarchy". Then the entire article is the key. But if I want just a dictionary for the word "social" or "anarchy". I would filter the meaning of the word based on pronouns and verb such as (am,is,are)? For example: I like to eat pizza. It is good. It is a pronoun for pizza, and based on is the pizza should be good by definition, then filter the results with frequency. While pizza is good is quite a broad and general definition. Frequency wise it should stand out? Pizza is good. Pizza is made of cheese. Pizza is delicious etc. Let me know what you think Strange. If this is the case I would use (am,is,are) to build my own dictionary with frequency on Wikipedia dumps.
Strange Posted December 21, 2017 Posted December 21, 2017 30 minutes ago, fredreload said: that is why I am looking into Spacy for Python That looks really impressive. It should manage a lot of the language processing things that I thought would be a problem for your project (parsing text and identifying parts of speech, etc). 31 minutes ago, fredreload said: For example: I like to eat pizza. It is good. It is a pronoun for pizza While it may be obvious to you what "it" is referring to, I think it is going to be trickier for a program. That would be a good test: write a program to identify the antecedent for each pronoun in the text. If you can do that with reasonable accuracy, I will be impressed. (Unless Spacy just does it for you )
EdEarl Posted December 21, 2017 Posted December 21, 2017 (edited) Noam Chomsky is considered "the father of modern linguistics," who developed the theory of transformational grammar.You may find help with your project within his papers. You may find difficulties such as Are Estonia is a small borough in Are Parish, Estonia which are linguistic parsing challenges. Edited December 21, 2017 by EdEarl It may be are is a noun.
Strange Posted December 21, 2017 Posted December 21, 2017 1 hour ago, EdEarl said: Noam Chomsky is considered "the father of modern linguistics," who developed the theory of transformational grammar.You may find help with your project within his papers. Except, of course, natural language doesn't actually work like that! 1 hour ago, EdEarl said: linguistic parsing challenges Time flies like an arrow, fruit flies like a banana.
EdEarl Posted December 21, 2017 Posted December 21, 2017 (edited) Quote Wikipedia Transformational grammar (TG) or transformational-generative grammar (TGG) is, in the study of linguistics, part of the theory of generative grammar, especially of naturally evolved languages, that considers grammar to be a system of rules that generate exactly those combinations of words which form grammatical sentences in a given language. The link to "naturally evolved languages" refers to Natural languages, for example English. When he started linguistics, scientists grossly underestimated the difficulty of processing natural languages, circa 1960. Edited December 21, 2017 by EdEarl
Strange Posted December 21, 2017 Posted December 21, 2017 Generative grammar is the "epicycles" of linguistics.
EdEarl Posted December 21, 2017 Posted December 21, 2017 2 minutes ago, Strange said: Generative grammar is the "epicycles" of linguistics. I looked up epicycles, which is defined for astronomy and mathematics. Please explain.
Strange Posted December 21, 2017 Posted December 21, 2017 5 minutes ago, EdEarl said: I looked up epicycles, which is defined for astronomy and mathematics. Please explain. Epicycles were additional cycles that were added to try and model the movements of planets, based on the assumptions that they orbited the Earth and only moved in circles. As are accurate observations were Mae, more and and more epicycles were added. Generative theory has become increasingly complex as new special cases (kludges) are added every time a new language or feature is found that it can't explain.
EdEarl Posted December 21, 2017 Posted December 21, 2017 Whereas epicycles are demonstrated to be incorrect in astronomy, Chomsky's Universal Grammar has not been demonstrated incorrect; although, some argue that it is not universal, but it cannot be falsified. I think we should consider your position as possibly correct, but there is no way to know.
Strange Posted December 21, 2017 Posted December 21, 2017 18 minutes ago, EdEarl said: but it cannot be falsified That would mean it wasn't science! I am fairly sure that other models of language will turn out to be more successful and that, in a generation or so, Chomsky's work will be of only historical interest. Unfortunately, I won't be around to find out!
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now