Archive for the 'Programming' Category

Reminiscing about the naïve, spam-free days of the web

Tuesday, July 21st, 2009

Remember a long time ago when the web was free of spam? I’m not talking about email, which has had spam problems for awhile, I’m talking about the web. Nowadays, the web is festering with link-crawling spambots. Anyone with a blog, Twitter account, or heck, even a webpage with a simple submit form with some text fields on it, knows this. There’s not much that can be done about it besides spam-detection heuristic algorithms and CAPTCHAs.

Well, I just recently found some code that I wrote way back in 2002 that displays a blissfully unaware naïvité of what was to come. That code was part of my website Fyre’s Domain, which I have since put an archived copy of online. I had just been learning Perl CGI and I wanted to write a simple guestbook/comments form that readers could use to give me feedback without having to use a mailto: link. This was in the era before blogging software was commonplace — what I was running was a home-brew blog, but before the word “blog” was even invented. I basically copied the format from one of the first chat rooms I ever used, Paddynet, way back in 1995 or so. The “chat room” consisted of an HTML form that would dump unvalidated input (including HTML tags) into a chat buffer displayed on the page that would display the last 30 or so messages.

Paddynet was around long before spambots, but my site was started right when they began appearing in the wild, and the code proceeded to run for another 7 years until I just shut it off.

You can probably guess what happened.

The only reason I even re-discovered this code is because I happened to notice it was getting an unusual number of hits in my web analytics software. And those hits were anything but benign. My poor naïve Perl CGI comments submission form has accumulated 26 MB worth of text over the years, all of it spam. And since I figure it may be interesting to someone to see exactly what seven years of web spam looks like, you can download it for yourself (a text file thankfully compressed down to just 1.8 MB). If anyone finds any interesting trends in popular spam topics over the years in there, do let me know.

So those are the dangers of trusting user input on the web these days. Revel in the blissful simplicity of the following code, which was all it took to implement a comment submission system back in the day. Nowadays you couldn’t get away with anything even close to it. As my data proves, you’ll be eaten alive by spambots.

#!/usr/bin/perl

use CGI qw(:standard);

print header;
print start_html('Leave Comments on Fyre'),
	h1('Leave Comments on Fyre'),
	start_form,
	"<i>Note, all fields are optional, but empty comments will be ignored.</i><br>",
	"Name: ", textfield(-name=>'name',-default=>''),
	"E-mail: ", textfield(-name=>'e-mail',-default=>''),
	"Your Comments: <br>", textarea(-name=>'Comments',-rows=>10,-columns=>50,-default=>''),
	'<br>',
	submit('Submit'), reset,
	end_form,
	p,
	hr;

if (param() && param('name') ne '' && param('Comments') ne '') {
	$date = `date '+%H:%M %m/%d/%Y'`;

	print '<i>Your comment has been posted.</i><hr><br>';
	@foo = "\n\n" . '<br><b>' . param('name') . '</b> ' . "\n" .
	'<u>' . param('e-mail') . '</u> ' . "\n" . '<i>' . $date . '</i>' .
	"\n" . '<table><tr><td width = "100%">' . param('Comments') .
	'</td></tr></table><hr>';
	push @foo, `cat mk.txt`;
	open CFILE, ">mk.txt" or die "Failed to open comments file!";
	print CFILE @foo;
	close CFILE;
}

@foo = `cat mk.txt`; print @foo;

print 'This program is open source, and released under the GPL by Ben McIlwain, 2002.  See
the source <a href = "mk_script.txt">here</a>.';
print end_html;

A Python script to auto-follow all Twitter followers

Tuesday, March 10th, 2009

In my recent fiddling around with Twitter I came across the Twitter API, which is surprisingly feature-complete. Since programming is one of my hobbies (as well as my occupation), I inevitably started fooling around with it and have already come up with something useful. I’m posting it here, so if you need to do the same thing that I am, you won’t have to reinvent the wheel.

One common thing that people do on Twitter is they follow everyone that follows them. This is good for social networking (or just bald self-promotion), as inbound links to your Twitter page show in the followers list of everyone that you’re following. You’d think Twitter itself would have a way to do this, but alas, it does not. So what I wanted to do is use a program to automatically follow everyone following me instead of having to manually follow each person.

Other sites that interface with Twitter will do it for you (such as TweetLater), but I’m not interested in signing up for another service, and I’m especially not interested in giving out my Twitter login credentials to anyone else. So I needed software that ran locally. A Google search turned up an auto-follow script written in Perl, but the download link requires registration with yet another site. I didn’t want to do that so I decided to program it for myself, which ended up being surprisingly simple.

My Auto-Follow script is written in Python. I decided to use Python because of the excellent Python Twitter library. It provides an all-Python interface to the Twitter API. You’ll need to download and install Python-Twitter (and its dependency, python-simplejson, if you don’t have it already; sudo apt-get install python-simplejson does the trick on Ubuntu GNU/Linux). Just follow the instructions on the Python-Twitter page; it’s really simple.

Now, create a new Python script named auto_follow.py and copy the following code into it:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#(c) 2009 Ben McIlwain, released under the terms of the GNU GPL v3.
import twitter
from sets import Set

username = 'your_username'
password = 'your_password'
api = twitter.Api(username=username, password=password)

following = api.GetFriends()
friendNames = Set()
for friend in following:
    friendNames.add(friend.screen_name)

followers = api.GetFollowers()
for follower in followers:
    if (not follower.screen_name in friendNames):
        api.CreateFriendship(follower.screen_name)

Yes, it really is that simple. I’d comment it, but what’s the point? I can summarize its operation in one sentence: It gets all of your friends and all of your followers, and then finds every follower that isn’t a friend and makes them a friend. Just make sure to edit the script to give it your actual username and password so that it can sign in.

Run the script and you will now be following all of your followers. Pretty simple, right? But you probably don’t want to have to keep running this program manually. Also, I’ve heard rumors that the Twitter API limits you to following 70 users per hour (as an anti-spam measure, I’m guessing), so if you have more than 70 followers you’re not following, you won’t be able to do it all at once. Luckily, there’s a solution for both problems: add the script as an hourly cronjob. This will keep who you follow synced with your followers over time, and if you have a large deficit in who you follow at the start (lucky bastard), it’ll slowly chip away at it each hour until they do get in sync. In Ubuntu GNU/Linux, adding the following line to a text file in /etc/cron.d/ (as root) should do it:

0 * * * * username python /path/to/auto_follow.py >/dev/null 2>&1

This will run the auto_follow script at the top of each hour. You’ll need to set the username to the user account you want the job to run under — your own user account is fine — and set the path to wherever you saved the auto_follow script. Depending on your GNU/Linux distribution and which cron scheduler you have installed, you may not need the username field, and this line might go in a different file (such as /etc/crontab). Refer to your distro’s documentation for more information.

So that’s it. That’s all it takes to automatically auto-follow everyone who’s following you — a dozen or so lines of Python, one crontab entry, and one excellent library and API. Enjoy.

Here’s a pretty bad Unicode WTF

Tuesday, March 3rd, 2009

I’m doing some research on Unicode and compression algorithms right now for a side-project I’m working on, and I came across a highly ranked Google search result for a UTF-8 munging code snippet that is so idiotic I couldn’t let it pass without comment. If this post helps even one person who would’ve otherwise followed the linked advice, it is worth it.

First, some background. UTF-8 is a character encoding format that can pretty much handle any character under the Sun, from the English alphabet to Japanese kanji to obscure extinct languages. It even includes thousands of esoteric symbols used in smaller fields of study that you’ve probably never even heard of before. But the nice thing about UTF-8 is that it is variable-length. Standard ASCII characters (including everything on a standard English keyboard) only take one byte to represent. All of the common characters from other widely used languages typically take just two bytes to encode. It’s only the really obscure characters that require more than two bytes.

So now you see why the linked “solution” is so stupid. This guy says he is “designing a little client/server binary message format” and wants “a simple and quick way to encode strings”. Well, duh — use UTF-8, no ifs, ands, or buts about it. It’s simple, quick, and already implemented in any programming language you can think of, so it requires no additional coding. There are all sorts of dumb ways to unnecessarily reinvent the wheel in sotware engineering, but trying to come up with your own character encoding is particularly idiotic. It’s really tricky to get right because there are so many corner cases you’ll never even know existed until they cause your application to break. The Unicode Consortium exists for a reason — what they do is hard.

This guy even confesses that his expected input will probably not contain Unicode characters that are longer than 2 bytes. So there is no justification at all for what he does next — he creates a mangled version of UTF-8 that turns all Unicode characters 3 bytes and longer into question marks, instead of just leaving them as is. So instead of allowing a rare character to take an additional byte or two, it gets mangled. And to accomplish this, he has to create his own custom encoding solution that is an intentionally broken version of UTF-8. That’s the worst part — he’s wasting time creating completely unnecessary code, that will need to be maintained, that will need to be debugged — and for what?

Of course, none of the people responding to his thread point out that what he is trying to do is stupid. They just smile and hand him some rope.

What C# does better than Java

Thursday, October 16th, 2008

I spend 90% of the development time at my job using either Java or C# .NET, with a roughly equal split between the two. So I know a fair amount about both languages — enough to feel qualified to comment on the differences between them, anyway. Now being a Free Software guy, my obvious preference is for Java, which is actually Free as in freedom (finally, anyway) and runs on a large variety of platforms. Given the choice of which to use on my personal projects, Java is a no-brainer. The best IDE for Java, Eclipse, is absolutely Free. The best IDE for C#, Visual Studio is … well, it’s several hundred dollars and proprietary to boot. And it has the limitation of not running on or compiling for GNU/Linux; since I use Ubuntu as my home desktop operating system, that’s a deal breaker.

But just on a pure comparison between the languages, I have to say that C# is the better of the two. It’s not a fair comparison because C# is many years younger and was able to learn from all of Java’s mistake, but then again, that old canard about life not being fair still holds true. C# is the better language. It has lots of features that simply make it more pleasant to code in. One feature I would’ve killed for in Java while writing a recent project at work is properties. Here’s a sample of the code I wrote in Java:

writeOut(data.getAccount().getContract().getAddress().getAddress1());
writeOut(data.getAccount().getContract().getAddress().getAddress2());
writeOut(data.getAccount().getContract().getAddress().getCity());
writeOut(data.getAccount().getContract().getAddress().getZipCode());
writeOut(data.getAccount().getClient().getCoSigner().getFullName());

And it went on and on for dozens of lines; you get the drift. This is getter and parentheses overload. There’s no real reason the code has to be this messy. And with C#, it isn’t. Here’s how the same code would look in C#:

writeOut(data.Account.Contract.Address.Address1);
writeOut(data.Account.Contract.Address.Address2);
writeOut(data.Account.Contract.Address.City);
writeOut(data.Account.Contract.Address.ZipCode);
writeOut(data.Account.Client.CoSigner.FullName);

And yes, you could accomplish the latter in Java by making all member variables public, but that’s a bad idea. In C# you don’t have to make all of the member variables public to do this — you simply define them as properties, which allows for fine-grained control on who can get and set each property, and without all of the messiness of having to corral dozens of different getter and setter functions for each member variable.

So if nothing else mattered, I would recommend and use C# to the exclusion of Java. But since other issues matter a lot more than programming conveniences, like software freedom, I do still recommend and use Java to the exclusion of C#. But Microsoft did put up a good effort.

A first-hand lesson in software optimization

Sunday, October 5th, 2008

The player is a striped bass, one of the arch predators in the bay.

The player is a striped bass, one of the arch predators in the bay.


One of my major computer science projects in college was creating an educational videogame for elementary school children called A Day in the Bay. We had eight people on the team, though only three of us (myself included) were programmers. It was a self-directed project, and we pretty much figured out things on our own as we went along. I’m not kidding about that last part — check out the CVS revisions for all the gory details. We not-so-seamlessly morphed our custom game engine through a couple completely different game types as our project evolved, and by the time we completed the final project, the overall structure of the code was rather incompatible with how it was being used.

But despite all of the problems, the most important programming lesson I learned from all of college came out of A Day in the Bay. That lesson was how to seriously optimize software, and it would’ve been entirely impossible to learn in the classroom, or even in a month-long programming project.

When we started writing our game, we were very laid back about it. We developers had gaming-caliber computers, and we only cut back on features when the game started lagging for us. So you can imagine our shock when we first tried testing the game on the many-year-old computers in a local elementary school’s computer lab. Oh, and our game was written in Java. Beginning to see some problems now?

The game was completely unplayable. We’re talking frame rates in the low single digits. We had to heavily optimize the game for performance in a short period of time. We did so by cutting out most of the game’s simulationist features. Our game, as originally envisioned, simulated an entire cross-section of the Chesapeake Bay, including plants, small fish, the larger fish eating them, etc. We had to keep track of thousands of fish and have all of them take actions in real time. We even had a state diagram for each fish which determined if it was running for its life, hunting for food, or just chilling out. Needless to say, it wasn’t working on the school system’s hardware.

So our first optimization was to establish a virtual boundary that was roughly one-and-a-half screen lengths in diameter around the viewable area. No processing time was spent on any of the creatures outside that rectangle. They were frozen, and only simulated when the player swam near them. This optimization wasn’t enough, though — we still had to keep track of thousands of in-game objects, which was overflowing the memory available to us on the classroom computers.

Read the rest of this entry »

Drawing enlightening parallels between computer science and bacon

Wednesday, September 17th, 2008

My housemate has become rather obsessed with bacon in recent days. He’s been reading the Bacon Subreddit religiously and filling me in on all of the breaking news and happenings in the burgeoning online bacon community. This past weekend, the bacon fever reached critical levels, because he went to the grocery store, bought a lot of bacon, and started cooking it. This was a big step for him, trust me.

So the first time he made the bacon he put some olive oil in the pan first (I know, bacon purists are wincing at the travesty). But by the time the bacon was well underway, it had already made so much of its own grease to cook in that the puny amount of olive oil was proven completely unnecessary. Bacon thus contains everything within it necessary to cook it. How amazing is that? So the second time he made bacon, he didn’t bother with oil. He just turned the heat down at the beginning while the bacon was generating its own grease (so as to not burn it), then it was off to the races.

In other words, to borrow a term from computer science, bacon is self-hosting. Much like how the compilers of mature programming languages are written in the language they compile, bacon produces all of the greases necessary to cook itself. Bacon isn’t just scrumptious, it’s a smart choice.

But that’s not only the comparison that can be drawn to computer science. Bacon can also bootstrap other bacon. Too impatient to wait for bacon to generate its own grease to cook in? Just save up the bacon grease from the previous bacon cooking session in a Tupperware container and use it to jump-start the next batch. The parallels here with using an already-compiled compiler executable to compile the first compiler on a new system are uncanny, are they not?

I’m sure there are lots more analogies to be drawn between bacon and computer science, and I’d love to see them fleshed out in the comments below. In fact, bacon would make a great standard analogy to be used in introductory computer science classes, alongside Alice and Bob and “think of your computer as an automobile”. Not only is this a great idea because bacon is easily understood by everyone, but also because it’s delicious.

Quantity trumps quality

Tuesday, August 5th, 2008

Jeff Atwood relates an insightful anecdote about quantity over quality that you may initially find counter-intuitive. A pottery-making class was broken up into two groups, with one half graded on the basis of creating a single perfect pot while the other half graded on how much pottery they produced (it was literally weighed, and grades given out for ranges in pounds). At the end of the class, the group that was producing the best pottery was the group that was going for quantity, because they had created such a large number of pots that the experience they gained overshadowed the pain-staking analysis of the other group.

Naturally, there are all sorts of parallels between this anecdote and many other areas, but I’d really like to relate it to software engineering. I’ve always been a “get straight to coding” kind of guy, doing just the bare minimum of planning necessary to start coding, and then writing the documentation along with the code. And after many years of doing this, my code consistently turns out pretty well. A big reason, I suspect, is simply because this approach leads to so much coding. I write programs for all sorts of little random fun things that I would never get around to if I had to spend a bunch of time painstakingly planning out each program beforehand. The best way to become a better coder is not to plan out how to do it, it’s to actually do it, a maxim which also applies to any other activity, including pottery.

Code commenting: one of the casualties of outsourcing

Monday, July 21st, 2008

During college I worked as a computer programmer intern at the National Institute of Standards and Technology. I had the opportunity to work on all sorts of nifty cutting-edge physics simulations using some serious science. Unfortunately, everything was written in VB 6, C++ .NET, or Fortran, but you can’t have it all, and .NET is actually pretty decent compared to some of the alternatives.

One of the programs I worked on was originally written by a Korean researcher working at NIST, thus technically not making it outsourcing, but the problems I’m about to describe are relevant nonetheless. The code was rather hard to understand, especially the variable names, which followed some kind of naming convention that was completely foreign to me. Luckily, the code was actually decently commented. In Korean. Not that it would’ve helped me if I was able to read Korean, because sometime between the original writing of the code and when it got to me, all of the nice UTF-8 comments were corrupted down to ASCII-128. So they appeared as complete gibberish that wouldn’t be understandable by anyone — if you’ve ever viewed binary executable data as text, you know what I’m talking about.

My best guess is that another American maintenance programmer before me edited the program in an IDE that wasn’t set up to understand UTF-8. He must’ve not noticed when all of the nicely formatted Korean comments turned into gibberish — or maybe he didn’t care. Either way, by the time the comments got to me, they were thoroughly worthless. Well, not quite. Their presence at least alerted me to sections of the code that required extra attention, because they were generally non-trivial.

Code maintainability is thus one of the biggest casualties of outsourcing. If the coders you’re outsourcing to don’t speak English, or if they at least don’t bother to comment the code in English, you’ll be facing significantly higher code maintenance costs down the line. That’s just something to keep in mind. In the long run, you save money by hiring local programmers. At least that’s the official line I’m sticking with, seeing as how doing so directly benefits me (hey, did I ever say I wasn’t a biased blogger?).

A better solution to the FizzBuzz interview problem

Wednesday, July 2nd, 2008

Many months ago, I wrote about a simple programming problem that I was administering to interviewees at work to assess their programming skills. The basic problem is this: loop through a range of integers, outputting different strings depending on whether each integer is divisible by one given number, by another given number, by both given numbers, or by neither. It’s a very basic weed-out problem that can be solved easily by any applicant with a basic understanding of control structures and modular arithmetic.

Last week, the comments section on that post erupted with new activity. I posted my own solution, which I was then criticized for because it used more than two division operations per integer (and as you should know, division is by far the most expensive basic arithmetic operation on a computer). So I got to brainstorming and I quickly came up with a solution that uses two divisions total, no matter how large the range is. I’m pretty convinced that my solution is very close to optimal (minus some minor fudging regarding how the if-statements are laid out and how the modular incrementing is handled). Here’s the solution in Java, simply because Java is the language most job applicants seemed to want to solve the problem in:

public class FizzBuzz {
	// I got lazy and coded in the parameters as constants instead of as arguments.
	static final int x = 3; // The first integer to test divisibility on
	static final int y = 5; // The second integer to test divisibility on
	static final int a = 27; // The starting number in the range to process
	static final int b = 74; // The ending number in the range to process

	public static void main(String[] args) {
		int x1 = a % x; // These are our only
		int y1 = a % y; // two divides!!
		for (int i = a; i <= b; i++) {
			if (x1 == 0) System.out.print("Fizz");
			if (y1 == 0) System.out.print("Buzz");
			if (x1 != 0 && y1 != 0) System.out.print(i);
			System.out.println("");
			x1 = ( x1 == x - 1 ? 0 : x1 + 1);
			y1 = ( y1 == y - 1 ? 0 : y1 + 1);
		}
	}
}

Pretty nifty, eh? Instead of having to perform expensive division operations on each integer in the loop, we’re simply incrementing two modular counters. This solution immediately suggests a good part two for the FizzBuzz interview question. After the applicant demonstrates basic familiarity with how programming works by writing up the naive solution, test their analytical abilities by asking them to come up with a more efficient version of the same program. If they don’t know it already, explain that division operations are expensive, and then ask them to minimize the total number of division operations necessary.

There will be a fair number of people who can get through the first part of the problem but won’t be able to get through the second part of the problem without a lot of hints. It’s these people that you may want to avoid hiring, because being able to improve on the naive algorithm and find more optimized ways of doing things is very important in programming. So, if you happen to be in a hiring position at a programming company, do try this problem out, and let me know how it goes!

In search of stream-based desktop metaphors

Thursday, June 26th, 2008

I just ran across an excellent article comparing two competing desktop worldviews, documents and streams. The author argues that everything in our desktop environments is set up to support a document-based metaphor, when actually what is more relevant to the majority of our work these days is streams. He makes a very persuasive argument:

The prevailing UI paradigm today is built around the notion of document authoring. It expects that the main thing you do is create spreadsheets, word documents, presentations, and so on. There is a task bar to remind you of what documents you’re editing, there is cross-application cut and paste so you can put pieces of one document into another. You can place documents on your desktop surface itself, so you can organize your work. You can define which applications to use for which types of docs. You can set up a default printer to put your documents to hard copy. You can set up system-wide fonts to use in documents. You can put icons to apps and even documents onto your panel. And on and on. [...]

Really, what I mostly do today is stream management. And I suspect this is true for the vast majority of people. I don’t deal with writing documents, but with changes to documents. I put comments onto things. I slap patches onto things. I tweak the states of things. Once in a rare while I may author a completely new thingee, but even there I usually end up working with it as a stream of changes that I build up over time (and usually in collaboration with a few other people who stream changes to me).

I’m sold.

The problem is, our virtual desktops (and pretty much all OSes fail equally at this) do not support stream-centric interfaces to data. I can create discrete files just fine, even organize them into nice little directories, but what about my precious streams? I’m talking about my constantly updating server logs, the weather, stocks quotes, news, emails, instant messages, IRC messages, downloads, and more. Everything is handled separately and discordantly.

I can use an ugly hackish little program that outputs system log tails directly to my desktop. I have a Firefox plugin that tells me the current weather and a couple days’ forecasts. My investing service offers a streaming stock quote desktop application, but it only runs on Windows. Mozilla Thunderbird and Azureus pop up email notices and download completion notices, respectively. Instant messages are handled by Pidgin while incoming IRC messages are handled by X-Chat, both of which blink in my taskbar. As for the news — I can use a KDE plugin called Knewsticker that snarfs up RSS feeds. And I haven’t yet found a good way to track, say, SVN commits to the pyWikipediaBot project, so I’m stuck with getting a new email on every commit. Brilliant.

Read the rest of this entry »