The highest-editing zombie bot on Wikipedia

Monday, May 26th, 2008

I stopped actively editing Wikipedia more or less one year ago. Naturally, I haven’t stopped editing completely, as I still read Wikipedia nearly every day in the pursuit of my own edification. But I no longer seek out thankless administrative tasks to perform, nor do I browse articles solely to find a way to contribute some writing. In that way I’m much more like the casual reader who occasionally fixes a typo, though the casual reader also doesn’t have the ability to delete articles, block users, and protect pages (ah, the privileges of being an administrator). But I don’t much use those abilities anymore, so it matters little.

In addition to doing lots of editing and administrative tasks (page may take awhile to load), I also spent a good amount of time hacking on programs for Wikipedia. Some, such as the userbox generator (don’t even ask), were purposefully silly. Others, such as my work on the PyWikipediaBot free software project, were more useful. In addition to my work on that bot framework, I wrote quite a few bots, which are programs for making automated edits. By the time I (mostly) retired from Wikipedia, I had put many hours into those bots, and I couldn’t bear to just shut them down. So I left them running. They’ve been running now for over a year, unattended for the most part, and have been remarkably error-free all things considered. I have variously forgotten about them for months at a time, and only remembered them when my network connection chugs for an extended period of time (long “Categories for deletion” backlog) or when my server’s CPU utilization pegs (bot process gets stuck in an endless loop). So yes, there is a zombie bot editing Wikipedia, and it even has administrative rights that it uses quite frequently!

All of these bot programs that I wrote run under one Wikipedia user account, Cydebot. That account was the first account on any Wikipedia project to break one million edits. The total currently stands somewhere at a million and a quarter (proof), though it has been out-edited by one other bot account by now. But just think about the enormity of that number. At one point Cydebot had a single digit percentage of all edits to the English Wikipedia. You can’t say that’s not impressive, especially considering how ridiculously massive Wikipedia is. Yet being a bot operator was largely unsung work. The only time I really got noticed for all the effort I was putting into it (and never mind the network resources involved, especially when I was running AntiVandalBot, which downloaded and analyzed the text of every single edit to Wikipedia in real time) was when yet another person thought they were the first to realize that Cydebot was using administrative tools and deemed it necessary to yell at me about it. Wikipedia has this cargo cult rule that “admin bots aren’t allowed” — even though people have been running them for years. I’ll grant that it’s schizophrenic.

So after continuing to run Cydebot for this long, I’m not going to stop now. I haven’t put any effort into Cydebot for over a year besides occasionally updating the pyWikipediaBot framework from SVN, killing pegged bot processes, and rarely modifying the batch files for my bots when someone points out that the associated pages on Wikipedia have changed. I don’t have the time (nor the desire) to put any further serious development work into Cydebot, so at some point things will finally break and Cydebot will no longer be able to do any work. But it’s already gone for over a year performing all sorts of thankless tasks on Wikipedia that no human wants to be bothered with; why not let it continue going and see how much longer my favorite zombie bot can continue at it for?

If you want to track the continuing edits of a zombie bot on Wikipedia, you can do so here. So the next time you are idly reading Wikipedia, remember that, not only are there bots behind the scenes that are making millions of automated edits, but some of them are zombies that have been running largely unattended for months, if not years. Wikipedia is built, in no small part, upon zombie labor.

Robotic collaboration on the net

Friday, February 23rd, 2007

I already wrote about how there are too many fracking bots on the Internet. Bots have downloaded twice as many pages on Cyde Weys Musings as people in February, with many thousands of hits each from the big three search engines Google, Microsoft, and Yahoo.

So I’m wondering, why can’t Google, Microsoft, and Yahoo collaborate? The reason they need to crawl the Internet at all (rather than just my blogging software updating them each time an entry is posted, edited, or commented on) is because they cannot trust individual individual sites. Spammers are always trying to break the rules, and if the search engines didn’t even come out to crawl sites they’d be overloaded with false information.

So that explains the need for crawling, but it doesn’t explain why Google, Yahoo, and Microsoft will all crawl the same page within hours of each other (when it’s extremely unlikely that anything has changed). That’s just wasted traffic. Whereas they can’t trust individual sites, they can certainly trust each other (or effectively deal with an abuse of that trust if necessary). For each page that they would crawl, rather than hitting the site immediately, the crawler should automatically ask its two peer search institutions if they’ve crawled the page recently, and if so, just transfer the crawled page data directly rather than having to make another hit on the site’s webserver.

This would also save Yahoo, Google, and Microsoft lots of money in bandwidth, because they could make their own dedicated internet for communicating web crawl data. This traffic would be much cheaper than traffic on the Internet. It would make owners of individual sites a lot happier too, because they’d be paying less for bandwidth while still being kept updated in the three major search engines.

Or Yahoo, Google, and Microsoft could go a step further. They could go in three ways on a colossal data center which would do all of the crawling. Then they’d have individual (massive) connections into the database of crawled pages, and could request re-crawls as necessary. Each crawled page would be immediately accessible to all three of them, saving individual sites’ bandwidth. They could even sell access to the crawled pages to other lesser search engines, recouping some of their costs.

Unfortunately, this has about a snowball’s chance in hell of succeeding. Even though it would be beneficial to each company in the form of lower bandwidth expenses and fewer required servers, the companies will never go for it because it would require cooperation. Each probably thinks they can come out on top eventually, and they’re not going to want to go in on a deal that helps them because it’s also helping their competitors.

It’s too bad. My server was looking for some relief from the constant pounding. And a single centralized bot cluster on the net would really be a nifty thing.

Wikipedia gets CAPTCHAs for anonymous edits

Thursday, February 22nd, 2007

Yesterday, image CAPTCHAs were enabled for all anonymous edits on all Wikimedia Foundation wikis (including the popular encyclopedia Wikipedia). I noticed this by chance because I’m in a computer lab right now and found some vandalism on an article linked from the main page, but didn’t want to take the time to log in first. However, by the time I finished typing in the CAPTCHA, an admin had already reverted the vandalism. Drat.

The reason for the CAPTCHA is that we’ve been having some spam problems on-wiki recently, with spammers using automated bots to add links to dozens of pages before they end up being blocked. We have a global spam blacklist that does a good job of stopping spammers dead, but all of their edits still have to be manually reverted, which is a pain. Hopefully this new change will alleviate some of that. This change will basically stop all anonymous bot edits (including legitimate bots that get logged out by accident). It will also stop vandalism bots that are running anonymously, which we’ve seen a few of.

Unfortunately, this change still doesn’t do anything against spamming/vandalism being done using registered user accounts. Yes, you do have to pass an image CAPTCHA to register an account too, but that’s only once per account rather than on every edit, so people could conceivably manually register a bunch of accounts and then hand the account details off to their bots.

What I’d like to see is CAPTCHAs on the first twenty edits of each new user (in addition to each anonymous edit). This would make automated spam/vandalism impossible.

One thing I’m worried about though — are we making the barriers to edit too high? Anonymous edits do contribute significantly towards writing the encyclopedia. There’s a trade-off between making it hard for automated ne’er-do-wells and putting a burden on legitimate editors who just can’t be bothered to login or register an account. I hope we haven’t gone too far in one direction.

Update: It looks like CAPTCHAs have been disabled; read the comments for more information.

Too many fracking bots

Thursday, February 22nd, 2007

The Internet is infested with way too many bots. Bots are currently downloading twice as many pages on this site as actual humans. It’s nearly absurd. Do I really need to be indexed twice as often as people actually read my stuff? Just looking at the Apache logs, I’ve been visited thousands of times this month by each of MSN Search’s Bot, Google bot, and Yahoo’s bot. Throw in Google’s feedfetcher bot and AdSense bot and Yahoo’s feed seeker bot for good measure. And don’t forget to add a pinch of Internet Archive bot and Technorati bot.

It’s crazy! I’m up to my neck in bots over here! And there’s not a damn thing I can do about it. Well, theoretically I could just make my site unindexable using a robots.txt file, but that’d be like curing a roach infestation through the use of nuclear weapons. Given a choice of too many bots versus nobody new being able to find my site, I think I’ll choose the bots, thank you very much. Google and Yahoo’s RSS feed bots don’t even respect robots.txt anyway.

I suspect that a graph of bot activity percentage versus traffic numbers looks sort of like an inverted V-curve. Below a certain threshold bots don’t even care about a site (it is doomed to obscurity), so what few visits you get are mostly human. And at the high end, when you’re getting hundreds of human visitors an hour, the bot visit numbers are dwarfed. But somewhere in the middle, corresponding to a small-to-medium-sized size, is the domain where bots rule. That’s where I am right now.

To be fair, I don’t really hate bots. I realize that they are a necessary evil, and that the 1,100 incoming hits from search results so far this month couldn’t possibly have come without them. It’s just annoying that nearly half of my bandwidth is being eaten up by non-sentient entities who get the same level of robotic satisfaction from consuming my site’s bits as the common splog’s.

My experiences as an open source developer

Monday, January 8th, 2007

In the second part of a series on my experiences that are not widely-shared and hopefully interesting, I will talk about my experiences as an open source software developer (the first part of this series was about being a newspaper columnist).

I am a developer on the Python Wikipedia Bot Framework, which is a collection of programs that perform automated tasks on wikis based on the MediaWiki software. Wikipedia and other Wikimedia projects are by far the largest users of MediaWiki, but there are lots of other ones out there too, and pyWiki is used by lots of people for various tasks.

Before I go over my experiences in-depth, I’ll start with an overview of everything I’ve done on pyWiki. Skip ahead to after the list if these details are too technical.

  • delete.py – A bot that deletes a list of pages. I wrote it from scratch.
  • templatecount.py – A bot that counts how many times any given number of templates are used. I wrote it from scratch.
  • category.py – A bot that handles category renaming and deletion. I made some changes to it and some libraries, catlib.py and wikipedia.py, to make it more flexible, more automated, and to handle English Wikipedia-specific “Categories for discussion” (CFD) tagging.
  • template.py – A bot that handles template substitution, renaming, and deletion. I made some changes to it (and the library pagegenerators.py) to handle operations on multiple templates simultaneously, as well as increasing flexibility. I will admit, I added the capability to delete any number of templates in one run with the hope that I would some day be able to use it on userboxes.
  • replace.py – A bot that uses regular expressions to modify text on pages. I modified it to handle case insensitive matching, amongst other things.
  • wow.py – An unreleased bot that I used to anonymize thousands of vandal userpages to prevent glorification of vandalism. I wrote it from scratch.
  • catmove.pl – A metabot* written in Perl that parses a list of category changes and does them all in one run. I wrote it from scratch.
  • cfd.pl – An unreleased automatic version of catmove.pl that pulls down the list of category changes directly from the wiki, parses them, and executes them, in one single command. I wrote it from scratch. Hopefully I will be able to release it soon (it may have some security issues that I want to make sure are entirely resolved first).

Cfd.pl is the “secret sauce” that lets Cydebot do its magic. To date, Cydebot has over 160,000 edits, most of them category-related. I attribute this to cfd.pl, which allows me to, with a single command, handle dozens of CFDs simultaneously, whereas people using other bots have to input each one manually. It’s no surprise that everyone else pretty much gave up CFD, leaving my highly-efficient bot to handle it all on its own.

I also had some involvement with Vandalbot, which is a Python anti-vandal bot that uses pyWiki. I ran a Vandalbot clone called AntiVandalBot off of my own server for many months, until somewhat recently, when AntiVandalBot was switched over to being hosted on a Wikimedia Foundation server. If you add up all of the edits committed by both Cydebot and AntiVandalBot then I have the highest number of bot edits on the English Wikipedia — of course, it’s not just my work. I merely came up with the account name and hosted it for awhile; Joshbuddy is the one who actually wrote the vast majority of Vandalbot, and Tawker is the one who hosted the original Tawkerbot2 for awhile (and who now hosts it on the Wikimedia Foundation server).

Working on an open source project is very fun, and rather unlike a programming job for pay in the “real world”. For one, it’s entirely volunteer. I work at my leisure, when I feel like it, or when I have a functionality that I or someone else needs. Programming can actually be relaxing and cathartic when there are no deadlines and I am undertaking a coding project simply for the sake of writing something.

All of the developers on pyWiki are very relaxed and they definitely “get” the open source movement. There’s no expectation of anyone having to get anything done. This can have its downsides, in that it might take awhile for something to be taken care of, but it also doesn’t scare anyone off who is worried about a large time commitment. To become a developer on pyWiki, all I had to do was ask the project head, and I was given write access to the CVS repository within a few days, even though I had never used Python before. The amount of trust is very refreshing, and I definitely feel an impetus not to let the other guys down by uploading code with bugs in it (so my testing is always rigorous).

There gets to be a point with computer languages where learning another one is simply no big deal. I wouldn’t want to estimate how many languages I’ve used by now, but it’s probably somewhere around a dozen. After the first few, though, one simply knows how to program, and learning another language is a simple manner of looking through the documentation to find the specific code to do exactly what one has in mind. That was the situation I was in with pyWiki; although I had never used Python before, I knew exactly what I wanted to accomplish and how to accomplish it: I merely needed to know how to do it in Python. Within a week I was hacking away at the code, adding significant new functionality. It should be noted that working on an existing project in a new language is much, much easier than trying to make something from scratch in a new language.

I would say that pyWiki is a medium-size open source project, which is probably exactly the right size for a first-time developer. It’s not so small that it ever goes stagnant; there are code changes submitted every day, and the mailing list is very active. Any reasonable message posted to it will get a response within a day, if not hours. On the other hand, pyWiki is not too large. It has no barriers to entry; anyone can get started hacking right away it and submitting code changes. Larger projects necessarily have large bureaucracies (large projects need to be managed, there’s no way around it), which means there’s an approval process for code changes, and it’s unlikely that anything written by a novice will actually end up making it into a release. Trying to work on a large project right off the bat can be disheartening because there’s very little one can actually do that doesn’t require an expert level of knowledge. Compare this to pyWiki, which lacks lot of functionality that even a novice would be able to code up (delete.py wasn’t hard at all; it’s simply that no one had done it yet).

I would encourage anyone who is interested in programming to find an open source project they enjoy that they can contribute to. It’s great experience, and it much more closely resembles what actually happens in industry than hacking away on a solo project. I’m sure it’s a great resume line item. The key is to find a project you want to work on because it benefits you. In my case, I was writing new functionality that I needed to get stuff done on Wikipedia.

And there’s just something very compelling about contributing to the worldwide corpus of free software. It’s a way to leave your mark on the world — I way to say, “I was here.”

Read the rest of this entry »