fredreload Posted December 18, 2017 Share Posted December 18, 2017 So I've played around with web crawlers before and Selenium. I am currently looking for a way to download web pages to collect texts. I am using python and is thinking of scraping through IPS. Which I believe ranges from 0.0.0.0 to 999.999.999.999? I think most people have port 80 open? So I would have a loop like 0.0.0.0 to 999.999.999.999 and just use urllib to download the webpages. Let me know if this is correct or if there is a better way, thanks Link to comment Share on other sites More sharing options...
fredreload Posted December 18, 2017 Author Share Posted December 18, 2017 Alright I've managed to use urllib2 with Python to traverse from 0.0.0.0 to 256.256.256.256, but this does not go to the sub web pages ie. 0.0.0.0/subfolder/. So I would like to know if there's a tree traversal into all the contents listed by a IP. Or does urllib2 already does that? Link to comment Share on other sites More sharing options...
Sensei Posted December 18, 2017 Share Posted December 18, 2017 (edited) 9 hours ago, fredreload said: Which I believe ranges from 0.0.0.0 to 999.999.999.999? OMG, your computer knowledge is near zero... If each of them is unsigned byte what will be range.. ? 0 ... 2^8-1 = 0....255 0.0.0.0 .... 255.255.255.255 (NOT 256.256.256.256 !) How about starting from reading Wikipedia pages about IP v4 addresses for example.. ? You will learn which IP addresses must be skipped because they have special meaning.. Scanning the all IP v4 from the start is silly idea. It's ~4.3 billions IPs. Visiting one per second would take 136 years.. The majority of them contain no servers nor computers, so you will be just wasting time. One IP can be hundred or thousands of computers. Connecting to port 80,443,8080 won't give you much. Virtual servers are configured typically that they REQUIRE host name to reveal their content.. (Did you ever configure Virtual Server in Apache? https://httpd.apache.org/docs/current/vhosts/examples.html ) You should start from collecting host names.. That's why web crawlers are analyzing web pages, to find A HREF HTML tags.. Google made special technique for web admins for revealing what are pages that should be visited or not visited by their bot. But it can be used to examine what are pages hosted on server, if you will pretend Google bot.. If you ever set up website with optimization for Google, using their panel, it should be there in instructions how to optimize and how to fight with 404 errors.. Edited December 18, 2017 by Sensei 2 Link to comment Share on other sites More sharing options...
fiveworlds Posted December 18, 2017 Share Posted December 18, 2017 I would strongly advise against doing this. Some server admins put zip bombs and viruses on their ip address to stop malicious ddos attacks. Link to comment Share on other sites More sharing options...
fredreload Posted December 19, 2017 Author Share Posted December 19, 2017 Ya well, you guys are right. I just need a text dump of science articles and they need to be repeating. For instance I expect 100 articles talking about the same feature for lizard. I tried the Wikipedia dump file, but it is non repeating. So if any of you know of a huge text dump of science articles let me know. Else I'll have to scrape the ips Link to comment Share on other sites More sharing options...
Strange Posted December 19, 2017 Share Posted December 19, 2017 On 18/12/2017 at 10:53 AM, fredreload said: I am currently looking for a way to download web pages to collect texts. There are large text corpora available with you trying to write your own web crawler. For example: https://corpus.byu.edu or https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/index2.html 1 Link to comment Share on other sites More sharing options...
fredreload Posted December 19, 2017 Author Share Posted December 19, 2017 28 minutes ago, Strange said: There are large text corpora available with you trying to write your own web crawler. For example: https://corpus.byu.edu or https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/index2.html Ya, that sounds like you gave up on the whole AI thing Link to comment Share on other sites More sharing options...
Strange Posted December 19, 2017 Share Posted December 19, 2017 18 minutes ago, fredreload said: Ya, that sounds like you gave up on the whole AI thing No, just pointing out that there are sources of language you can use that don't require you to write a web crawler. Link to comment Share on other sites More sharing options...
Sensei Posted December 19, 2017 Share Posted December 19, 2017 6 hours ago, fredreload said: Ya, that sounds like you gave up on the whole AI thing There was no AI in your idea since the beginning.. Link to comment Share on other sites More sharing options...
fredreload Posted December 20, 2017 Author Share Posted December 20, 2017 (edited) 8 hours ago, Sensei said: There was no AI in your idea since the beginning.. Ya well, I thought n-gram is a good way to be applied to an AI, based on the discussion it turns out not to be. So I am going with thought bubbles and neural network now Edited December 20, 2017 by fredreload Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now