An interesting new search engine idea

Cap'n Refsmmat · April 23, 2006

I've spotted a very interesting new concept for a search engine.

Simply put, it uses distributed computing rather than one centralized cluster of servers doing the crawling. This means that the system is getting ridiculous amounts of new URLs every day (because there are numerous crawlers at once, rather than a few big ones) and lots of new data. I think it's a rather nice idea.

Unfortunately, their alpha search engine component (the bit for actually searching the stuff the crawlers have gotten) is a bit lacking. A lot of various searches turn up Microsoft as the first result - no idea why - and other irrelevant things. They have multiple algorithms, however, so I think progress is being made there.

Thoughts? I think it could be much better at crawling as much internet content as possible.

link: http://www.majestic12.co.uk/

Rasori · April 23, 2006

Wow. Just to test, I searched "star wars" on both this and Google. This new site got me 1,345,560 results, google gets me "about 144,000."

Granted, I would never search through all the results, but that shows that there's likely a better chance of finding specific things you may be looking for under a subject.

Cap'n Refsmmat · April 23, 2006

Google gets me 149,000,000 for "star wars", and this new one gets me just over a million.

Not actually comparable. But there are many more being crawled daily. I've done about 30,000 URLs so far today.

alt_f13 · April 23, 2006

I'm sure that's a record, CR.

Any reason why you two would get differing results for google?

There is more than one method of searching in that new alpha.

Cap'n Refsmmat · April 23, 2006

No, the individual with the most URLs for today has 4,869,963.

There are multiple ways because you can fiddle with your own algorithm and see if you can make it better than the default one. The default is rather pathetic for relevancy, although that's what the owner plans on improving now that the crawler works well.

bluesmudge · April 23, 2006

isn't that the future of computing? distrubted systems - what with that bbc climate program (which though rubbish at process management) seems to be hooking onto the idea that the world itself makes up a bigger computer than the worlds current most powerfull machine - btw what its the world's most powerful computer these days?

A few years ago i was told it was the mainframe being installed into the met office in the new Exeter head office, though obviously that won't be true anymore!

Cap'n Refsmmat · April 24, 2006

It's Blue Gene/L by IBM.

alt_f13 · April 24, 2006

Can't wait to play Doom on that thing. I'll be the best!

350TFLOPS!!! Joking aside, Holy Crap!

Rasori · April 24, 2006

Cap'n and I got different results because when I put "star wars" I included the quotes, I believe. I get 174 million or so on google when I don't.

Oddly enough, retesting that, Majestic-12 gives me 679,774 results for "star wars" this time...

bascule · April 24, 2006

The problem with delegating the construction of a search engine index is verifying the authenticity of the data returned. I think such a system would become immensely vulnerable to spam... think of how many spammers already control botnets with tens or hundreds of thousands of infected machines. How could you possibly protect a distributed search engine index from spam attacks from these systems?

I predict a search in such a system would yield results for porn and online gambling sites for virtually every search term.

Cap'n Refsmmat · April 24, 2006

The problem with delegating the construction of a search engine index is verifying the authenticity of the data returned. I think such a system would become immensely vulnerable to spam... think of how many spammers already control botnets with tens or hundreds of thousands of infected machines. How could you possibly protect a distributed search engine index from spam attacks from these systems?

I predict a search in such a system would yield results for porn and online gambling sites for virtually every search term.

The actual indexing is done on the server. All the client does is gather up URLs and their content. Only the server can decide what the content is, and what searches it will show up in.

RyanJ · April 24, 2006

Its an interesting idea and with enough participation would easily outmatch the Google search engine with their current search algorithms... a very interesting idea

Cheers,

Ryan Jones

drochaid · May 2, 2006

I honestly believe this is placing the wrong emphasis on where search engines need to go. The number of sites trawled isn't that much of an issue whether you have lots of standalone machines dotted about or a smaller number of central clusters. The real issue is how the data is transformed to information that is actually useable and google is still by far the best on that front, albeit far from perfect.

But regardless, searching needs to become a great deal more personalised and I have some designs in mind on how to achieve this over the next 10 years. Not a replacement for google and the like, just an additional method.

chemfreak · May 3, 2006

I like it. But mabie you could try to NOT omit letters like:"I" in the search for "I-pod":confused:

Cap'n Refsmmat · May 3, 2006

I honestly believe this is placing the wrong emphasis on where search engines need to go. The number of sites trawled isn't that much of an issue whether you have lots of standalone machines dotted about or a smaller number of central clusters. The real issue is how the data is transformed to information that is actually useable and google is still by far the best on that front, albeit far from perfect.

If a search engine could get a huge index and relevant results, it would be a world-beater.

But regardless, searching needs to become a great deal more personalised and I have some designs in mind on how to achieve this over the next 10 years. Not a replacement for google and the like, just an additional method.

The problem with personalization is that sometimes people break out of the personalized "mold." If, for example, they always search for pet information, the engine might personalize to bring up more relevant results, but then their evil sibling gets on and tries to find instructions for nuclear weapons and only gets guinea pig feeding directions.

Sign In

An interesting new search engine idea

Recommended Posts

Cap'n Refsmmat

Rasori

Cap'n Refsmmat

alt_f13

Cap'n Refsmmat

bluesmudge

Cap'n Refsmmat

alt_f13

Rasori

bascule

Cap'n Refsmmat

RyanJ

drochaid

chemfreak

Cap'n Refsmmat

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information