Web Crawler on all IPS

fredreload · December 18, 2017

So I've played around with web crawlers before and Selenium. I am currently looking for a way to download web pages to collect texts. I am using python and is thinking of scraping through IPS. Which I believe ranges from 0.0.0.0 to 999.999.999.999? I think most people have port 80 open?

So I would have a loop like 0.0.0.0 to 999.999.999.999 and just use urllib to download the webpages. Let me know if this is correct or if there is a better way, thanks

fredreload · December 18, 2017

Alright I've managed to use urllib2 with Python to traverse from 0.0.0.0 to 256.256.256.256, but this does not go to the sub web pages ie. 0.0.0.0/subfolder/. So I would like to know if there's a tree traversal into all the contents listed by a IP. Or does urllib2 already does that?

Sensei · December 18, 2017

9 hours ago, fredreload said:

Which I believe ranges from 0.0.0.0 to 999.999.999.999?

OMG, your computer knowledge is near zero...

If each of them is unsigned byte what will be range.. ? 0 ... 2^8-1 = 0....255

0.0.0.0 .... 255.255.255.255

(NOT 256.256.256.256 !)

How about starting from reading Wikipedia pages about IP v4 addresses for example.. ?

You will learn which IP addresses must be skipped because they have special meaning..

Scanning the all IP v4 from the start is silly idea. It's ~4.3 billions IPs. Visiting one per second would take 136 years..

The majority of them contain no servers nor computers, so you will be just wasting time.

One IP can be hundred or thousands of computers. Connecting to port 80,443,8080 won't give you much. Virtual servers are configured typically that they REQUIRE host name to reveal their content.. (Did you ever configure Virtual Server in Apache? https://httpd.apache.org/docs/current/vhosts/examples.html )

You should start from collecting host names.. That's why web crawlers are analyzing web pages, to find A HREF HTML tags..

Google made special technique for web admins for revealing what are pages that should be visited or not visited by their bot. But it can be used to examine what are pages hosted on server, if you will pretend Google bot.. If you ever set up website with optimization for Google, using their panel, it should be there in instructions how to optimize and how to fight with 404 errors..

Edited December 18, 2017 by Sensei

fiveworlds · December 18, 2017

I would strongly advise against doing this. Some server admins put zip bombs and viruses on their ip address to stop malicious ddos attacks.

fredreload · December 19, 2017

Ya well, you guys are right. I just need a text dump of science articles and they need to be repeating. For instance I expect 100 articles talking about the same feature for lizard. I tried the Wikipedia dump file, but it is non repeating.

So if any of you know of a huge text dump of science articles let me know. Else I'll have to scrape the ips

**Strange** · December 19, 2017

On 18/12/2017 at 10:53 AM, fredreload said:

I am currently looking for a way to download web pages to collect texts.

There are large text corpora available with you trying to write your own web crawler. For example: https://corpus.byu.edu or https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/index2.html

fredreload · December 19, 2017

28 minutes ago, Strange said:

There are large text corpora available with you trying to write your own web crawler. For example: https://corpus.byu.edu or https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/index2.html

Ya, that sounds like you gave up on the whole AI thing

**Strange** · December 19, 2017

18 minutes ago, fredreload said:

Ya, that sounds like you gave up on the whole AI thing

No, just pointing out that there are sources of language you can use that don't require you to write a web crawler.

Sensei · December 19, 2017

6 hours ago, fredreload said:

Ya, that sounds like you gave up on the whole AI thing

There was no AI in your idea since the beginning..

fredreload · December 20, 2017

8 hours ago, Sensei said:

There was no AI in your idea since the beginning..

Ya well, I thought n-gram is a good way to be applied to an AI, based on the discussion it turns out not to be. So I am going with thought bubbles and neural network now

Edited December 20, 2017 by fredreload

Sign In

Web Crawler on all IPS

Recommended Posts

fredreload

fredreload

Sensei

fiveworlds

fredreload

Strange

fredreload

Strange

Sensei

fredreload

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information