fredreload Posted February 24, 2021 Posted February 24, 2021 Description: I wrote this AI after I left my previous company in Pou Chen. I did a little bit of web scraping with words frequency testing in the company so I came up with this AI after I left the company. Below are the videos and the Python scripts are included in the video as well as a link to Crunchyroll explaining its content posted by me. You can: Run the scripts and test them out for me to see what I can improve on. Yes the scripts are crude with no comments but I took 6 months(non continuous) working on them. So before you say it does not make sense or not working, sort of take some time to get used to the scripts and I will try and answer the questions here. You can also test the scripts with a different dictionary once you understand the program. Part 1: Part 2:
Ghideon Posted February 24, 2021 Posted February 24, 2021 (edited) 1 hour ago, fredreload said: the scripts are crude with no comments Ok. 1 hour ago, fredreload said: see what I can improve on Suggestion 1: Add comments and documentation to your scripts. Note that Python 2.7 has reached end of life*, migrate to 3.x. *) https://www.python.org/doc/sunset-python-2/ Edited February 24, 2021 by Ghideon format
fredreload Posted February 25, 2021 Author Posted February 25, 2021 17 hours ago, Ghideon said: Suggestion 1: Add comments and documentation to your scripts. That would be a good idea, I would add it during free time. Quote Note that Python 2.7 has reached end of life*, migrate to 3.x. My scripts technically also runs on 3.x 64 bits Python, maybe some changes to the print statements
fredreload Posted March 1, 2021 Author Posted March 1, 2021 I modified the last part of the script, now it is more responsive and completed.
fredreload Posted March 1, 2021 Author Posted March 1, 2021 (edited) This Python script is too slow. I think I need to rewrite the program in C#, do you think that would give it a performance boost Ghideon? Edited March 1, 2021 by fredreload
Ghideon Posted March 1, 2021 Posted March 1, 2021 15 minutes ago, fredreload said: This Python script is too slow. I think I need to rewrite the program in C#, do you think that would give it a performance boost Ghideon? The information provided is too limited, I will not try to make a prediction. In my experience performance is a result of many parameters; using another implementation language may or may not result in a required performance boost.
fredreload Posted March 1, 2021 Author Posted March 1, 2021 2 hours ago, Ghideon said: The information provided is too limited, I will not try to make a prediction. In my experience performance is a result of many parameters; using another implementation language may or may not result in a required performance boost. I agree, I am unable to find a suitable disk based dictionary for c# = =, I am using shelvedb for this program. So I might have to use sqlite if I want to integrate it to a c# platform, but sqlite is pretty slow still.
Ghideon Posted March 2, 2021 Posted March 2, 2021 21 hours ago, fredreload said: I agree, I am unable to find a suitable disk based dictionary for c# = =, I am using shelvedb for this program. So I might have to use sqlite if I want to integrate it to a c# platform, but sqlite is pretty slow still. Information provided so far is not enough to comment on what kind of issue you are facing regarding performance; I do not have an opinion if specific products are suitable or not suitable.
fredreload Posted March 3, 2021 Author Posted March 3, 2021 20 hours ago, Ghideon said: Information provided so far is not enough to comment on what kind of issue you are facing regarding performance; I do not have an opinion if specific products are suitable or not suitable. As you can see my shelve db file are some few GBs in size. Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower.
Prof Reza Sanaye Posted March 3, 2021 Posted March 3, 2021 The ongoing flow of information makes memory distinguishers impossible the more we fracture time into Nano- and Femto-seconds . ... . . . . .Background (echo) vibration is thus the distinguishing motor . . . . . .Because memory signals ought to go on a differing time-scale vibration . . . .
Ghideon Posted March 3, 2021 Posted March 3, 2021 3 minutes ago, Prof Reza Sanaye said: The ongoing flow of information makes memory distinguishers impossible the more we fracture time into Nano- and Femto-seconds . ... . . . . .Background (echo) vibration is thus the distinguishing motor . . . . . .Because memory signals ought to go on a differing time-scale vibration . . . . And how is that related to the discussion in this tread?
Prof Reza Sanaye Posted March 3, 2021 Posted March 3, 2021 3 minutes ago, Ghideon said: And how is that related to the discussion in this tread? By turning the temporal Script (fractals) all but continual . . . . .
Ghideon Posted March 3, 2021 Posted March 3, 2021 (edited) 2 minutes ago, Prof Reza Sanaye said: By turning the temporal Script (fractals) all but continual . . . . . Your text look like the output from some algorithm, based on Markov Chains(?), that generates random sentences. 4 hours ago, fredreload said: As you can see my shelve db file are some few GBs in size. Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower. This seem to require more time and effort than I am willing to provide at this time. Edited March 3, 2021 by Ghideon grammar
Sensei Posted March 4, 2021 Posted March 4, 2021 (edited) On 3/3/2021 at 6:05 PM, fredreload said: As you can see my shelve db file are some few GBs in size. Do you have SSD? Do you have NVMe? What is transfer of data during db access? How many GB of memory does your computer have? Try using virtual disk in memory to see whether there will be change in speed. Quote Each dictionary consists of a phrase and a frequency and I would constantly update this frequency. So for sqlite I would have like 1 billion upsert statements. I tried to run 1 billion inserts, which is fast, but when it comes to update the frequency it takes forever. This is for if exist insert else update. I never tried upsert so I would not know if it is faster/slower. How are you storing, querying and updating db? Show SQL query string for them all. You can try: - calculate md5 (or so) of phrase text first. It will be hash code. - phrase table. Use above hash as unique key together with each phrase text. - frequency table. Use above hash code in second table with quantities / frequencies as integer. Therefore update should be faster. Won't require adding or replacing entire string. Alternatively don't store phrases as plain text. Have dictionary with words with unique indices. 4 bytes integer is enough to have 4.2 bln words. Then make phrase dictionary table. One with two columns for word indexes. Second table with three columns for word indexes. etc. in the future you will add more. Edited March 4, 2021 by Sensei 1
fredreload Posted March 5, 2021 Author Posted March 5, 2021 (edited) 20 hours ago, Sensei said: Do you have SSD? Do you have NVMe? What is transfer of data during db access? How many GB of memory does your computer have? Try using virtual disk in memory to see whether there will be change in speed. How are you storing, querying and updating db? Show SQL query string for them all. You can try: - calculate md5 (or so) of phrase text first. It will be hash code. - phrase table. Use above hash as unique key together with each phrase text. - frequency table. Use above hash code in second table with quantities / frequencies as integer. Therefore update should be faster. Won't require adding or replacing entire string. Alternatively don't store phrases as plain text. Have dictionary with words with unique indices. 4 bytes integer is enough to have 4.2 bln words. Then make phrase dictionary table. One with two columns for word indexes. Second table with three columns for word indexes. etc. in the future you will add more. That is a really good idea. First I would index all the words presented with a list of unique IDs. But the problem I am facing is, SQL is not a dictionary. If I have do to "select * from db where string='123'" it would have a much faster run time with a dictionary for db["123"], if there is a way to combine this aspect of dictionary with the SQL database. Edited March 5, 2021 by fredreload
Sensei Posted March 5, 2021 Posted March 5, 2021 (edited) 24 minutes ago, fredreload said: That is a really good idea. First I would index all the words presented with a list of unique IDs. But the problem I am facing is, SQL is not a dictionary. If I have do to "select * from db where string='123'" it would have a much faster run time with a dictionary for db["123"], if there is a way to combine this aspect of dictionary with the SQL database. Make cache in memory. Check if word is present in dynamically allocated array or key-value pair associative array, if it is, increase usage counter of the entry. If it is not present, lookup the database and put the new entry in the cache. Have 1000, 10000, or so, the most used entries. From time to time flush cache of the least used entries. The most frequently used words-entries-phrases will be cached at all times during execution of the script. You can make separate caches for words, phrase with two words. three words. Each with user configurable max number of entries. In OOP language you should just make class for cache which will cover entire database code. Edited March 5, 2021 by Sensei
fredreload Posted March 5, 2021 Author Posted March 5, 2021 1 hour ago, Sensei said: Make cache in memory. Check if word is present in dynamically allocated array or key-value pair associative array, if it is, increase usage counter of the entry. If it is not present, lookup the database and put the new entry in the cache. Have 1000, 10000, or so, the most used entries. From time to time flush cache of the least used entries. The most frequently used words-entries-phrases will be cached at all times during execution of the script. You can make separate caches for words, phrase with two words. three words. Each with user configurable max number of entries. In OOP language you should just make class for cache which will cover entire database code. When I cache the memory it used up all 16GB of RAM, that is actually the first thing I tried, but my computer cannot handle that big of a cache(I only have 16GB of RAM), so I switched to a disk based dictionary. The idea about the dictionary is since every single dict key is assigned a unique formula so the search time it takes to get to any particular key is always O(1). I dunno about database though or if it could be converted to a dictionary based method to optimize the run time.
Sensei Posted March 5, 2021 Posted March 5, 2021 Caching thousands the most used words in ASCII would take a few dozen kilobytes. Not MB. Not GB. But KB. In Unicode 2-4x more. You are caching to not have to lookup things like "I", "you", "it", etc. etc.
fredreload Posted March 6, 2021 Author Posted March 6, 2021 19 hours ago, Sensei said: Caching thousands the most used words in ASCII would take a few dozen kilobytes. Not MB. Not GB. But KB. In Unicode 2-4x more. You are caching to not have to lookup things like "I", "you", "it", etc. etc. Ya, but I am not caching words, I am caching phrases in a group of 3
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now