Popcorn Sutton Posted June 9, 2014 Posted June 9, 2014 Hey guys! I feel like I'm exhausting my resources again and I'm not getting much information from the experts on this one. I'm trying to get data from a server.... legally. Can you guys list the methods that I can use to get data from servers such as Apache, Apache Tomcat, Microsoft IIS/7.5, and SQL? I've switched to Linux (Ubuntu) by this point, but I do also have a Windows PC sitting next to me in my office now. I've tried a few methods, and it feels like I'm getting closer, but everytime I get close something goes wrong. So, the question is as stated above. What methods can I use to get into these servers? (The information is public for every server I'm trying to get into so I don't think that there will be any legal issues if I'm not extracting personal information). Please, if you respond, please please please elaborate as much as you can. Let me know what language (if any) I should be using. I like to use the terminal (command prompt).
Greg H. Posted June 9, 2014 Posted June 9, 2014 First of all, you're not going to get a terminal connection to a web server unless the owner of the server is an idiot about security (or it is a public terminal server). Second. one of those server types you have listed above is a database server, not a web server, which means that in order to extract any information from it, you would need to execute a properly formatted SQL instruction (which are likely to be severely limited, unless you happen to have system access to the server in question). So let's back up and start from the beginning - what information, specifically, are you trying to get from the server? 1
Popcorn Sutton Posted June 9, 2014 Author Posted June 9, 2014 It seems that you're right about accessing the information from the terminal. All I want to get is a list of every inmate in a few jails and look up any cases that are in the circuit court about these inmates. I want to do this as efficiently as possible and a friend (or maybe not a friend) of mine posted to one of my facebook posts saying that my method (which takes approximately 5 hours to gather the information) is horrible because it takes so long and any other programmer can do it in less than an hour. I think that he might just be a little jealous about me having the job that I do without having a degree in Computer Science, but I think that it's worth looking into because he does have a degree in Computer Science. He suggested using freelancers and I do not want to do that because I need to learn how to do these things myself. I don't think I should post the links to the webpages, I will PM you the links though, but the ones that I'm trying to get through to at this point are using- Microsoft IIS/7.5 AND Microsoft IIS/6.0 I can provide IP addresses as well, but only over PM.
Popcorn Sutton Posted June 9, 2014 Author Posted June 9, 2014 I guess I can rephrase my question. How do I download all the inmate and case information without the help of a download link?
Popcorn Sutton Posted June 9, 2014 Author Posted June 9, 2014 I guess I can rephrase it again... (heheh) I want to download an entire webpage. It has a search function too in order to find the specific inmates, if there's a method to download the entire webpage with all of its search directories as well, then I got what I need.
Sensei Posted June 9, 2014 Posted June 9, 2014 (edited) You need application that is recursively analyzing HTTP pages, searching for links and downloading,. Such as wget. Edited June 9, 2014 by Sensei
Popcorn Sutton Posted June 9, 2014 Author Posted June 9, 2014 wget doesn't preserve the search function. I don't know what to do. I have programs downloading the data, but if we could just get a snapshot of the entire webpage and every possible search for the day/week/month then I could really do something with that.
Cap'n Refsmmat Posted June 9, 2014 Posted June 9, 2014 You're not going to be able to download all the data unless the people who built the website specifically added features to do so. Otherwise you have to scrape every web page and extract the data the hard way. 1
Popcorn Sutton Posted June 9, 2014 Author Posted June 9, 2014 Thanks Cap'n. That's really all I needed to hear.
Sensei Posted June 9, 2014 Posted June 9, 2014 (edited) wget doesn't preserve the search function. I have no idea what search function you're talking about. Wget is simply downloading what you will tell to download.. I don't know what to do. I have programs downloading the data, but if we could just get a snapshot of the entire webpage and every possible search for the day/week/month then I could really do something with that. You can generate your own HTML page with just links f.e. http://www.server.com/2014/01/01/ http://www.server.com/2014/01/31/ http://www.server.com/2014/12/01/ http://www.server.com/2014/12/31/ Then use wget on your own web page. And wget will find links, and download stuff they're pointing to. Or prepare script with many rows with "wget args [some url]" And execute dynamically created script. Personally I am using .NET Framework Net::WebClient::DownloadFile() or DownloadString(). We can have list (f.e. 1000+) of proxy HTTP servers, and use them instead of our own IP address, so web server owner won't know that we're downloading large quantity of data. Each request will be from different proxy. Edited June 10, 2014 by Sensei
Popcorn Sutton Posted June 9, 2014 Author Posted June 9, 2014 I've been programming and interpreting with Python, but I do use perl modules pretty often now that I'm using Linux. I'm not sure what you're getting at with that method though
barfbag Posted June 11, 2014 Posted June 11, 2014 I program and have several websites, but saving a webpage in searchable format could be done simply with Copy/Paste into Word or PDF. There must be search features in those. Having read previous posts I still am having trouble grasping what you want from servers, it seems like website info you seek. I don't know.
Popcorn Sutton Posted June 11, 2014 Author Posted June 11, 2014 (edited) I am seeking website info, but I'm trying to see if there is a more efficient method. Hacking seems to be the most efficient because that way you can just copy and paste all the information you need. At this point, my program is not hacking, it's just going through every possible search and gathering every result (using language processing, machine learning, and operation synthesis). Is there an easier way that you know of where I can get every search result more efficiently than by searching every possible query? (When I say every possible query, I mean every possible minimal query) Edited June 11, 2014 by Popcorn Sutton
barfbag Posted June 11, 2014 Posted June 11, 2014 I'm still shy of understanding your needs here. Now it sounds like you want your own Spider to track a few websites content. Something like http://download.cnet.com/Internet-Spider-Download/3000-12512_4-10300592.html maybe?
Popcorn Sutton Posted June 11, 2014 Author Posted June 11, 2014 Yup. Just like that. I'll look into it
Genecks Posted June 12, 2014 Posted June 12, 2014 It looks like you want a fast processor and a fast connection. I believe I understand what you're doing. You're looking through court dockets and attempting to collect as much information between defendant and plaintiff. 1
Popcorn Sutton Posted June 12, 2014 Author Posted June 12, 2014 It looks like you want a fast processor and a fast connection. I believe I understand what you're doing. You're looking through court dockets and attempting to collect as much information between defendant and plaintiff. And the behavior of judges. What I'm getting at is not the docket (although it's probably a good idea to get the docket), I'm trying to get EVERY register of actions for EVERY inmate possible.
Sensei Posted June 13, 2014 Posted June 13, 2014 (edited) Show website url that you want to get data from. I have been scanning many websites for data. And the most appropriate solution always depends on website structure. The easiest are f.e. http://website/[article number].html just make for() loop. Many times there is no need to look for direct links to know where are data. Google crawlers can't be adjusted to every website on the world, simply too many websites. So search engine bots must rely on links. Proprietary tool not necessarily. Today I will be gathering data from wikipedia. I have database of names of elements. Script will be downloading pages http://en.wikipedia.org/wiki/Isotopes_of_[name of element from mine database] and then searching for data in them (isotopes data) such as protons, neutrons, mass etc. and writing this to CSV. Edited June 13, 2014 by Sensei
Popcorn Sutton Posted June 13, 2014 Author Posted June 13, 2014 The websites I am working with are extremely secure and they don't change the url for any search. On top of that, when I check the host it shows that the website that we use on the front end is actually an alias. So when I go to the real url, that url doesn't change either. I just tried to host the server again and it's failing at the moment. The point is that I can't use the for loop without actually using the webpage, but that's no problem. I went to a conference yesterday and everyone seems to be pretty comfortable with hacking... they didn't mention it often but it was kind of an unspoken thing. Sooooo I'm probably going to be getting a little deeper into hacking. I'm open to using actualy web crawlers, but only if they're cost effective.
Sensei Posted June 14, 2014 Posted June 14, 2014 (edited) Show website url that you want to get data from. The websites I am working with are extremely secure and they don't change the url for any search. On top of that, when I check the host it shows that the website that we use on the front end is actually an alias. So when I go to the real url, that url doesn't change either. I just tried to host the server again and it's failing at the moment. It might be hidden by using frames, iframes, ajax, or using $_POST[] instead of $_GET[]. Sending post args is harder but not impossible. Edited June 14, 2014 by Sensei
Sensei Posted June 14, 2014 Posted June 14, 2014 (edited) It's calling itself with post method... And changing behavior depending on HTTP request method - GET is showing website like normal, POST - doing search.. Do you know how to create post request? Check this http://stackoverflow.com/questions/5647461/how-do-i-send-a-post-request-with-php You will need to analyze the all post variables it's sending (use some packet traffic viewer f.e. good firewall), and then prepare such request by yourself. Better make experiments with your own server, your own website that's using form post method, to see what you will need to do to mimics web browser sending such request. If you won't fool your own code, you won't do it with 3rd party website either. BTW, they are updating database often - if you will duplicate their db somehow on your own disk, you won't know about updates anymore.. You potentially risk loosing some valuable data. Edited June 14, 2014 by Sensei
Popcorn Sutton Posted June 17, 2014 Author Posted June 17, 2014 Ya, I just decided that the method I'm using is suitable.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now