I already wrote about how there are too many fracking bots on the Internet. Bots have downloaded twice as many pages on Cyde Weys Musings as people in February, with many thousands of hits each from the big three search engines Google, Microsoft, and Yahoo.
So I’m wondering, why can’t Google, Microsoft, and Yahoo collaborate? The reason they need to crawl the Internet at all (rather than just my blogging software updating them each time an entry is posted, edited, or commented on) is because they cannot trust individual individual sites. Spammers are always trying to break the rules, and if the search engines didn’t even come out to crawl sites they’d be overloaded with false information.
So that explains the need for crawling, but it doesn’t explain why Google, Yahoo, and Microsoft will all crawl the same page within hours of each other (when it’s extremely unlikely that anything has changed). That’s just wasted traffic. Whereas they can’t trust individual sites, they can certainly trust each other (or effectively deal with an abuse of that trust if necessary). For each page that they would crawl, rather than hitting the site immediately, the crawler should automatically ask its two peer search institutions if they’ve crawled the page recently, and if so, just transfer the crawled page data directly rather than having to make another hit on the site’s webserver.
This would also save Yahoo, Google, and Microsoft lots of money in bandwidth, because they could make their own dedicated internet for communicating web crawl data. This traffic would be much cheaper than traffic on the Internet. It would make owners of individual sites a lot happier too, because they’d be paying less for bandwidth while still being kept updated in the three major search engines.
Or Yahoo, Google, and Microsoft could go a step further. They could go in three ways on a colossal data center which would do all of the crawling. Then they’d have individual (massive) connections into the database of crawled pages, and could request re-crawls as necessary. Each crawled page would be immediately accessible to all three of them, saving individual sites’ bandwidth. They could even sell access to the crawled pages to other lesser search engines, recouping some of their costs.
Unfortunately, this has about a snowball’s chance in hell of succeeding. Even though it would be beneficial to each company in the form of lower bandwidth expenses and fewer required servers, the companies will never go for it because it would require cooperation. Each probably thinks they can come out on top eventually, and they’re not going to want to go in on a deal that helps them because it’s also helping their competitors.
It’s too bad. My server was looking for some relief from the constant pounding. And a single centralized bot cluster on the net would really be a nifty thing.