Jump to content

Recommended Posts

Posted

Alright, so here is something useful. After playing with n-gram for text analysis I came up with an idea for AI.

How does n-gram works? For instance, I can type:

"I drink a Coca Cola in my room today."

You run an n-gram for 2. "I drink" no meaning. "drink a" no meaning. "a Coca" no meaning. "Coca Cola" has meaning. And it goes up with all possible combinations in ascending order "I Coca", "I Cola", "I in" etc.

Collecting all web pages and texts you'll probably sum up the frequency for Coca Cola to be 100. Meaning Coca cola has a meaning and all words associated with it. And therefore when you type I drink a Coca Cola, you would get a list in the database saying "in the room", "in the zoo", "is cold", "is hot", based on the frequencies. And the one with the highest frequencies, usually make sense. And there you have a huge database at your command.

I used the algorithm provided here. And used it on the text dump from Wikipedia. Pure text, parse it to sentences, and named it "AI.txt". And it generated a "B.txt" file which I tried to shove it into the database. Then I find out that most Wikipedia information only shows up once, not the best solution for Frequency learning. So I've decided to scrape all web pages on the net, with the loop over IP addresses. If you know how to do that do help me out. Btw the size after generated from a 3MB text file is 3.9GB, and only on 3 words. So.........be prepared to have a huge database. If you have no idea what this is just ignore it. Else I'd like to hear your feedback and what you got, thanks.

P.S No this is not a homework assignment, just something I made out of free time

AI2.py

STEXT.py

snake.png

Posted
21 hours ago, fredreload said:

Btw the size after generated from a 3MB text file is 3.9GB, and only on 3 words

3.9 GB sounds extraordinarily large. I assume by "3 words", you mean 3-grams. How many of these are there in the 3MB file?

Posted (edited)

Fred, you should check your implementation of algorithm with literally 2, 3 words.. to be sure algorithm has no errors introduced during writing.. Then on 4, then on 5, then further..

Manually check output in text editor.

Your implementation looks slightly different than #2 post in the link that you provided.

 

ps. You should not hardcode path to file, instead take it from command-line as argument. If it's not specified, use hardcoded path (if you have to).

 

Edited by Sensei
Posted
5 hours ago, Strange said:

3.9 GB sounds extraordinarily large. I assume by "3 words", you mean 3-grams. How many of these are there in the 3MB file?

Ya, I am attempting to analyze all sentence based on its frequencies. But what would that get me? For instance if I search "machine" it would return "a tool" frequency "100", "smart apparatus" frequency "99" etc. And that only works out if "machine" is in the sentence, not "it", not "this", or "that". After thinking about this, I just sort of give up on the whole idea. We might need an n-gram on the entire paragraph or article of "machine". That might work. n-gram for an entire article of the same class, but I don't think any computer can run an 100-gram or more and only categorized by titles.

So how do you program an AI. Well, you have to train it. Each time it answers something correctly, give it a +1, sort of like a fitness on words. And with words you can mess around with its thought pattern (binary tree?neural network?). Anyway, this is as far as I got. You'll have to ask Watson from IBM on this one.

Posted (edited)
7 minutes ago, Strange said:

You might be better looking at neural networks. Simple chatbots use n-grams and they are impressively unintelligent!.

Agreed, but now I am wondering how to generate a neural network based on text mining and create thought patterns

https://machinelearnings.co/text-classification-using-neural-networks-f5cd7b8765c6

Forgive me this is a new field for me, and I'm tired today = =

Edited by fredreload
Posted (edited)

Since language is not static; in other words, languages change as they are used. New words are invented and new n-grams occur; this language growth process means your database must be modified as you process language. Some of your n-grams will consist of grammar errors, for example, "There big, instead of They're big." Wikipedia is a large information base, but it's style is limited. Your n-gram database would be larger if you include other kinds of text such as literature and poetry.

As Strange said n-gram systems are unintelligent, and neural networks such as OpenAI.com are pretty good. I believe OpenAI publishes their software so anyone may use it. Although, neural nets are notorious for being computationally intensive.

Edited by EdEarl
clarity
Posted (edited)
On 2017/12/19 at 9:26 PM, Strange said:

You might be better looking at neural networks. Simple chatbots use n-grams and they are impressively unintelligent!.

Hey Strange, I've decided that I'll start with a dictionary. Do you happen to know a dictionary with a list of words in excel or text or database or anything that I can download? Thanks

P.S By the way it will specify all forms, noun, verb, adjective, etc.

P.S And don't have a loop back like I is an anonym for myself, I want like a description, an entity which suggest the subject to be yourself or something like that

Edited by fredreload
Posted (edited)
8 minutes ago, fredreload said:

I've decided that I'll start with a dictionary. Do you happen to know a dictionary with a list of words in excel or text or database or anything that I can download?

Google for "scrabble english dictionary download" or so..

 

Edited by Sensei
Posted (edited)
1 hour ago, Strange said:

There is this: http://wordnet.princeton.edu

There may be others...

Looks good, exactly what I want, but I dunno how to use its database. Is there one where it goes in text? I might have to import it into my Oracle database later using Python.

It goes word:play, type:verb, meaning:to do something fun. Keep in mind word and type cannot be null, so there is always a word and always one type that goes with it.

P.S I would map every word to a number based on its type and has a index for it. For instance, "type (noun) = 1", "type(verb) = 2". And its explanation is a solid rule that incorporates other words also as numbers. "type (noun): explain a variety for an object". That would be a rule set by the dictionary as "explain a variety of object" and I would further looks into "explain(verb)=3", "a(compound word? i dunno)=4" and map to its subsequent dictionary rule "explain(verb):tries to come up some meaning for the object". These numbers could then be used to analyze a paragraph.

example: I (try to come up some meaning for the object) (on a variety of objects).

machine interpretation: I explain type.

On reverse, you could issue a command: Machine please explain type

machine interpretation: I (try to come up some meaning for the object) (on a variety of objects).

If you train the machine with the article.

Edited by fredreload
Posted
26 minutes ago, fredreload said:

but I dunno how to use its database. Is there one where it goes in text?

Dunno. But they have various formats and tools available on the download pages. Maybe google will help you find some info on using it.

Posted (edited)
Quote

Is there one where it goes in text?

If you would take my above advice, google for "scrabble english dictionary download"  + "raw text", you would get this:

https://raw.githubusercontent.com/jonbcard/scrabble-bot/master/src/dictionary.txt

 

Then you could run it in a loop in script, and open website (using dozen proxy servers) like

http://www.wordreference.com/definition/[word]

or

http://www.dictionary.com/browse/[word]

where [word] is single row from dictionary.txt

Parse received page to find whether it's noun, verb, etc.

 

Edited by Sensei
Posted (edited)
26 minutes ago, Sensei said:

If you would take my above advice, google for "scrabble english dictionary download"  + "raw text", you would get this:

Is that better than the standard "words" file that comes with Linux, etc?

Just checked: that file seems to have about 179,000 words. /usr/share/dict/words on my machine has about 236,000 and I think there are versions with more than 400,000.

Edited by Strange
Posted
3 hours ago, Strange said:

A good starting point, and test, for coding would be to generate a syntax graph from input sentences. For example:

DRjy7uuX4AA4Dmu.jpg

Hi Strange, precisely what I am looking for, that is why I am looking into Spacy for Python. Also studying English grammer in regards to pronoun.

Now, I want to generate my own dictionary simply from reading Wikipedia dumps. To begin with, I would get a dictionary for the words, but not the title for the article. For instance, if the article is about "social anarchy". Then the entire article is the key. But if I want just a dictionary for the word "social" or "anarchy". I would filter the meaning of the word based on pronouns and verb such as (am,is,are)?

For example: I like to eat pizza. It is good.

It is a pronoun for pizza, and based on is the pizza should be good by definition, then filter the results with frequency. While pizza is good is quite a broad and general definition. Frequency wise it should stand out? Pizza is good. Pizza is made of cheese. Pizza is delicious etc.

Let me know what you think Strange. If this is the case I would use (am,is,are) to build my own dictionary with frequency on Wikipedia dumps.

 

Posted
30 minutes ago, fredreload said:

that is why I am looking into Spacy for Python

That looks really impressive. It should manage a lot of the language processing things that I thought would be a problem for your project (parsing text and identifying parts of speech, etc).

31 minutes ago, fredreload said:

For example: I like to eat pizza. It is good.

It is a pronoun for pizza

While it may be obvious to you what "it" is referring to, I think it is going to be trickier for a program. That would be a good test: write a program to identify the antecedent for each pronoun in the text. If you can do that with reasonable accuracy, I will be impressed. (Unless Spacy just does it for you :))

Posted (edited)
Quote

Wikipedia

Transformational grammar (TG) or transformational-generative grammar (TGG) is, in the study of linguistics, part of the theory of generative grammar, especially of naturally evolved languages, that considers grammar to be a system of rules that generate exactly those combinations of words which form grammatical sentences in a given language.

The link to "naturally evolved languages" refers to Natural languages, for example English.

When he started linguistics, scientists grossly underestimated the difficulty of processing natural languages, circa 1960.

Edited by EdEarl
Posted
2 minutes ago, Strange said:

Generative grammar is the "epicycles" of linguistics. 

I looked up epicycles, which is defined for astronomy and mathematics. Please explain.

Posted
5 minutes ago, EdEarl said:

I looked up epicycles, which is defined for astronomy and mathematics. Please explain.

Epicycles were additional cycles that were added to try and model the movements of planets, based on the assumptions that they orbited the Earth and only moved in circles. As are accurate observations were Mae, more and and more epicycles were added.

Generative theory has become increasingly complex as new special cases (kludges) are added every time a new language or feature is found that it can't explain.

Posted

Whereas epicycles are demonstrated to be incorrect in astronomy, Chomsky's Universal Grammar has not been demonstrated incorrect; although, some argue that it is not universal, but it cannot be falsified.

I think we should consider your position as possibly correct, but there is no way to know.

Posted
18 minutes ago, EdEarl said:

but it cannot be falsified

That would mean it wasn't science! :)

I am fairly sure that other models of language will turn out to be more successful and that, in a generation or so, Chomsky's work will be of only historical interest. Unfortunately, I won't be around to find out!

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.