Cap'n Refsmmat Posted April 23, 2006 Posted April 23, 2006 I've spotted a very interesting new concept for a search engine. Simply put, it uses distributed computing rather than one centralized cluster of servers doing the crawling. This means that the system is getting ridiculous amounts of new URLs every day (because there are numerous crawlers at once, rather than a few big ones) and lots of new data. I think it's a rather nice idea. Unfortunately, their alpha search engine component (the bit for actually searching the stuff the crawlers have gotten) is a bit lacking. A lot of various searches turn up Microsoft as the first result - no idea why - and other irrelevant things. They have multiple algorithms, however, so I think progress is being made there. Thoughts? I think it could be much better at crawling as much internet content as possible. link: http://www.majestic12.co.uk/
Rasori Posted April 23, 2006 Posted April 23, 2006 Wow. Just to test, I searched "star wars" on both this and Google. This new site got me 1,345,560 results, google gets me "about 144,000." Granted, I would never search through all the results, but that shows that there's likely a better chance of finding specific things you may be looking for under a subject.
Cap'n Refsmmat Posted April 23, 2006 Author Posted April 23, 2006 Google gets me 149,000,000 for "star wars", and this new one gets me just over a million. Not actually comparable. But there are many more being crawled daily. I've done about 30,000 URLs so far today.
alt_f13 Posted April 23, 2006 Posted April 23, 2006 I'm sure that's a record, CR. Any reason why you two would get differing results for google? There is more than one method of searching in that new alpha.
Cap'n Refsmmat Posted April 23, 2006 Author Posted April 23, 2006 No, the individual with the most URLs for today has 4,869,963. There are multiple ways because you can fiddle with your own algorithm and see if you can make it better than the default one. The default is rather pathetic for relevancy, although that's what the owner plans on improving now that the crawler works well.
bluesmudge Posted April 23, 2006 Posted April 23, 2006 isn't that the future of computing? distrubted systems - what with that bbc climate program (which though rubbish at process management) seems to be hooking onto the idea that the world itself makes up a bigger computer than the worlds current most powerfull machine - btw what its the world's most powerful computer these days? A few years ago i was told it was the mainframe being installed into the met office in the new Exeter head office, though obviously that won't be true anymore!
alt_f13 Posted April 24, 2006 Posted April 24, 2006 Can't wait to play Doom on that thing. I'll be the best! 350TFLOPS!!! Joking aside, Holy Crap!
Rasori Posted April 24, 2006 Posted April 24, 2006 Cap'n and I got different results because when I put "star wars" I included the quotes, I believe. I get 174 million or so on google when I don't. Oddly enough, retesting that, Majestic-12 gives me 679,774 results for "star wars" this time...
bascule Posted April 24, 2006 Posted April 24, 2006 The problem with delegating the construction of a search engine index is verifying the authenticity of the data returned. I think such a system would become immensely vulnerable to spam... think of how many spammers already control botnets with tens or hundreds of thousands of infected machines. How could you possibly protect a distributed search engine index from spam attacks from these systems? I predict a search in such a system would yield results for porn and online gambling sites for virtually every search term.
Cap'n Refsmmat Posted April 24, 2006 Author Posted April 24, 2006 The problem with delegating the construction of a search engine index is verifying the authenticity of the data returned. I think such a system would become immensely vulnerable to spam... think of how many spammers already control botnets with tens or hundreds of thousands of infected machines. How could you possibly protect a distributed search engine index from spam attacks from these systems? I predict a search in such a system would yield results for porn and online gambling sites for virtually every search term. The actual indexing is done on the server. All the client does is gather up URLs and their content. Only the server can decide what the content is, and what searches it will show up in.
RyanJ Posted April 24, 2006 Posted April 24, 2006 Its an interesting idea and with enough participation would easily outmatch the Google search engine with their current search algorithms... a very interesting idea Cheers, Ryan Jones
drochaid Posted May 2, 2006 Posted May 2, 2006 I honestly believe this is placing the wrong emphasis on where search engines need to go. The number of sites trawled isn't that much of an issue whether you have lots of standalone machines dotted about or a smaller number of central clusters. The real issue is how the data is transformed to information that is actually useable and google is still by far the best on that front, albeit far from perfect. But regardless, searching needs to become a great deal more personalised and I have some designs in mind on how to achieve this over the next 10 years. Not a replacement for google and the like, just an additional method.
chemfreak Posted May 3, 2006 Posted May 3, 2006 I like it. But mabie you could try to NOT omit letters like:"I" in the search for "I-pod":confused:
Cap'n Refsmmat Posted May 3, 2006 Author Posted May 3, 2006 I honestly believe this is placing the wrong emphasis on where search engines need to go. The number of sites trawled isn't that much of an issue whether you have lots of standalone machines dotted about or a smaller number of central clusters. The real issue is how the data is transformed to information that is actually useable and google is still by far the best on that front, albeit far from perfect. If a search engine could get a huge index and relevant results, it would be a world-beater. But regardless, searching needs to become a great deal more personalised and I have some designs in mind on how to achieve this over the next 10 years. Not a replacement for google and the like, just an additional method. The problem with personalization is that sometimes people break out of the personalized "mold." If, for example, they always search for pet information, the engine might personalize to bring up more relevant results, but then their evil sibling gets on and tries to find instructions for nuclear weapons and only gets guinea pig feeding directions.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now