Archive for the 'Programming' Category

Python’s negative list indexing is confusing and shouldn’t be taught to novice programmers

Saturday, March 8th, 2014

I’ve been programming for a long time, in a wide variety of languages, but never did I see negative list indexing until I learned Python nine years ago to play around with making automated edits on Wikipedia. For those who don’t know, negative list indexing allows you to index into a list from the end instead of the beginning. Whereas arr[0] returns the first item in a list, arr[-1] returns the last item in a list, arr[-2] the penultimate item, etc. It seemed like such a no-brainer; why haven’t all languages implemented this feature? It’s useful, right? That’s what I thought until I recently interviewed a spate of internship program candidates for the upcoming summer, most of whom are learning programming in Python, and I saw firsthand the confusion that negative list indexing causes in novices developers.

I do a fair amount of interviewing at my job, with an average of three or so interviews per week. We’re more often hiring developers than not. Now that I’m leading a development team, I have a huge vested interest in finding the best possible talent because inferior developers directly affect the quality of my team’s output, thus making me look bad. The main thing I focus on during interviews is programming problems. You’d be amazed at how many people do just fine in the talking portion but then fail miserably when asked to actually code something. So I use a progression of increasingly difficult programming problems, the first of which is usually “Write a function that takes a sentence and a number as input, and returns the re-arranged sentence with the specified number of words moved from the back to the front”. So for the example sentence “That is a dark brown fox” and the number 4, the output would be “a dark brown fox That is”. I ask the interviewees to solve the problem in whatever language they’re most comfortable with, because I’d rather get good developers in general than someone with certain specific language skills.

A typical solution in C#, which I get a plurality of responses in, looks like this:

public string Rearrange(string sentence, int numWords)
{
    string[] words = sentence.Split(' ');
    string answer = "";
    for (int i = words.Length - numWords; i < words.Length; i++)
    {
        answer += words[i] + " ";
    }
    for (int j = 0; j < words.Length - numWords; j++)
    {
        answer += words[j] + " ";
    }
    return answer.Trim();
}

It's not an amazing solution; there's a faster and more efficient way to do it that doesn't require splitting up the input into a words array at all (I'll leave that as an exercise to the reader). But it gets the job done, and most people end up thinking about the problem in this way. Yet a curious thing happened when I started interviewing current undergraduate students for the intern positions: most of them ended up using Python, and all of Python solutions used negative list indexing and ended up going off the rails because of it. I understand why they immediately thought to use negative indexing; we are asking them to grab things off the end of a sentence, after all. But it consistently sowed confusion, and never clarified.

There were two main problems that I saw when people used negative list indexing. One problem was that the interviewees would start with -1 and then decrement to -2, then -3, etc. This seems logical, but you end up reversing the order of the words in the rear segment, such that your output sentence from the above example becomes "fox brown dark a That is". What you really need to do is count "down" through the words in reverse, in the order -4, -3, -2, -1. But that's a bit harder to wrap your mind around than avoiding negative indexes entirely and simply counting up through the words in the order 2, 3, 4, 5. In particular I saw a lot of reversed parameters to the range function, which ends up executing the inside of the for-loop zero times. The problem looks like it's dealing with the end of the list, but it's really not; it's dealing with the middle of the list and counting up through to the end of it, for which using positive indexes is a lot more handy. But as soon as you say "end" to a novice programmer who's learned the "trick" of negative list indexes, that's where their mind immediately goes, and they tend to get stuck.

The second main problem with negative list indexing is that it doesn't mesh well with positive list indexing, both of which you end up needing to use to solve this problem. For instance, [0] is the first element in a list, and [1] the second, but [-1] isn't the second-to-last element in the list; it's the last one. This caused a lot of confusion. Positive list indexes are zero-indexed but negative list indexes are one-indexed. Fencepost errors abounded, or what you might even call reverse-fencepost errors, where interviewees expressed the index of the start of the second segment as either [-numWords +1] or [-numWords +-1] (both wrong but in different ways) when it should simply be [-numWords]. The interviewees simply didn't have enough experience using the two together to understand how they worked in unison, whereas when I prompted them to write the boundary conditions of both for-loops using positive indexing, the mental confusion usually went away and they were able to come up with a workable solution.

For the record, here is a correct solution in Python that uses negative list indexing. It doesn't look that tricky when you see it presented as an answer, but trust me when I say that no one was coming up with it on their own without access to a Python interpreter. The boundary condition of 0 on the first range in particular was highly un-intuitive to all of the interviewees.

def rearrange(sentence, numWords):
    words = sentence.split()
    answer = ""
    for i in xrange(-numWords, 0):
        answer += words[i] + " "
    for i in xrange(0, len(words) - numWords):
        answer += words[i] + " "
    return answer.strip()

And there's one final point that I can't help but bring up: reverse traversal of lists is cache inefficient. Granted it doesn't matter for this problem, and it probably won't matter for most things you do in Python, but if you're trying to write highly performant code in C/C++, like for scientific computation or a game engine, you're going to cause a lot more cache misses by trawling through memory in reverse order.

Using LINQ in C# to easily get a list of most often used words

Monday, May 20th, 2013

A pretty common programming interview question is to parse an input sentence and return the list of unique words used in that sentence. A further elaboration on that problem, the one that this post will be addressing, is to additionally calculate the number of occurrences of each word, and then return the top K words for some input value K. I’ll be demonstrating a simple solution to that problem in C#, both because I’ve been using it a lot recently and also because the choice of C# gives us access to LINQ, which is a powerful C# language feature that allows queries on collections using a SQL-like syntax. The top K problem is incredibly easy to solve in SQL, and boils down to SELECT TOP @K ..... FROM Words ORDER BY Word.Name DESC. The C# solution is similarly easy.

First, I’ll clarify the assumptions that I’m using (and that would be wise for an interviewee to address if none of these are made explicit):

  • I’m writing my solution to be case-insensitive. “case”, “Case”, and “CASE” will all thus count as the same word.
  • I’m not dealing with ties. If you ask for the top 3 words by occurrence and there are 5 words that are all used the most in equal numbers, then you’re still only going to get three words of those five, selected in no particular manner.
  • I’m going to use the C#’s language spec definition of a non-word character in regular expressions to separate the input sentence into words. The naive solution would be to split the string on only spaces, but then you’re not handling punctuation correctly.

And for the solution:

Dictionary<string, int> GetTopKWords(string input, int k)
{
	string[] words = Regex.Split(input, @"\W");
	var occurrences = new Dictionary<string, int>();
	
	foreach (var word in words)
	{
		string lowerWord = word.ToLowerInvariant();
		if (!occurrences.ContainsKey(lowerWord))
			occurrences.Add(lowerWord, 1);
		else
			occurrences[lowerWord]++;
	}
	return (from wp in occurrences.OrderByDescending(kvp => kvp.Value) select wp).Take(k).ToDictionary (kw => kw.Key, kw => kw.Value);
}

The vast majority of this code is responsible simply for finding the list of unique words the number of occurrences of each one. Once that is known, finding the top K words is a single line of code thanks to the power of LINQ. All we’re doing is ordering the words by frequency in descending order and taking the top K words. Without LINQ, there would be significantly more book-keeping code required to do this, which would make a good exercise for the reader (e.g. solve this problem in Java). The first roadblock you’ll probably run into is that you can’t simply flip the keys and values of the array, because the frequency counts aren’t unique. The best I’ve come up with is to construct a list of ordered tuples out of the dictionary of words, order it on the occurrence count part of each tuple, and then extract the first K elements from the resulting ordered list and return it. Let me know in the comments if you have a better solution.

Oh, and here’s an example input/output for the program, handling the display of output using LINQPad:

var input = "the quick brown fox is brown and jumps over the brown log over the long fire and quickly jumps to a brown fire fox";
GetTopKWords(input, 10);

outputs:

Top K Words sample run

Fixing an error with being unable to add user fields in Drupal 7

Wednesday, April 4th, 2012

Drupal 7 has a nice built-in Fields functionality which can be used to add fields to any entity. As applied to users, this replaces the previous Profile module which was used in Drupal 6 to add fields to users. However, right after creating a new Drupal 7 site, I couldn’t figure out how to add fields to users.

I followed these instructions

Just go to:

Configuration > People > Account Settings

Then click on the Manage Fields tab, and then just manage the fields just like you would for a Content Type.

But there wasn’t a “Manage Fields” tab on that page. Going to the URL directly, admin/config/people/accounts/fields, was redirecting me back to the Account Settings page. After banging my head against the wall for ten minutes I finally realized that the reason I couldn’t see this tab was because the “Field UI” module wasn’t enabled. Go to the List Modules page, enable that module (and its dependencies), and now you should be able to add fields to users.

Reminiscing about the naïve, spam-free days of the web

Tuesday, July 21st, 2009

Remember a long time ago when the web was free of spam? I’m not talking about email, which has had spam problems for awhile, I’m talking about the web. Nowadays, the web is festering with link-crawling spambots. Anyone with a blog, Twitter account, or heck, even a webpage with a simple submit form with some text fields on it, knows this. There’s not much that can be done about it besides spam-detection heuristic algorithms and CAPTCHAs.

Well, I just recently found some code that I wrote way back in 2002 that displays a blissfully unaware naïvité of what was to come. That code was part of my website Fyre’s Domain, which I have since put an archived copy of online. I had just been learning Perl CGI and I wanted to write a simple guestbook/comments form that readers could use to give me feedback without having to use a mailto: link. This was in the era before blogging software was commonplace — what I was running was a home-brew blog, but before the word “blog” was even invented. I basically copied the format from one of the first chat rooms I ever used, Paddynet, way back in 1995 or so. The “chat room” consisted of an HTML form that would dump unvalidated input (including HTML tags) into a chat buffer displayed on the page that would display the last 30 or so messages.

Paddynet was around long before spambots, but my site was started right when they began appearing in the wild, and the code proceeded to run for another 7 years until I just shut it off.

You can probably guess what happened.

The only reason I even re-discovered this code is because I happened to notice it was getting an unusual number of hits in my web analytics software. And those hits were anything but benign. My poor naïve Perl CGI comments submission form has accumulated 26 MB worth of text over the years, all of it spam. And since I figure it may be interesting to someone to see exactly what seven years of web spam looks like, you can download it for yourself (a text file thankfully compressed down to just 1.8 MB). If anyone finds any interesting trends in popular spam topics over the years in there, do let me know.

So those are the dangers of trusting user input on the web these days. Revel in the blissful simplicity of the following code, which was all it took to implement a comment submission system back in the day. Nowadays you couldn’t get away with anything even close to it. As my data proves, you’ll be eaten alive by spambots.

#!/usr/bin/perl

use CGI qw(:standard);

print header;
print start_html('Leave Comments on Fyre'),
	h1('Leave Comments on Fyre'),
	start_form,
	"<i>Note, all fields are optional, but empty comments will be ignored.</i><br>", 
	"Name: ", textfield(-name=>'name',-default=>''),
	"E-mail: ", textfield(-name=>'e-mail',-default=>''),
	"Your Comments: <br>", textarea(-name=>'Comments',-rows=>10,-columns=>50,-default=>''),
	'<br>',
	submit('Submit'), reset,
	end_form,
	p,
	hr;

if (param() && param('name') ne '' && param('Comments') ne '') {
	$date = `date '+%H:%M %m/%d/%Y'`;

	print '<i>Your comment has been posted.</i><hr><br>';
	@foo = "\n\n" . '<br><b>' . param('name') . '</b> ' . "\n" .
	'<u>' . param('e-mail') . '</u> ' . "\n" . '<i>' . $date . '</i>' . 
	"\n" . '<table><tr><td width = "100%">' . param('Comments') . 
	'</td></tr></table><hr>';
	push @foo, `cat mk.txt`;
	open CFILE, ">mk.txt" or die "Failed to open comments file!";
	print CFILE @foo;
	close CFILE;
}

@foo = `cat mk.txt`; print @foo;

print 'This program is open source, and released under the GPL by Ben McIlwain, 2002.  See 
the source <a href = "mk_script.txt">here</a>.';
print end_html;

A Python script to auto-follow all Twitter followers

Tuesday, March 10th, 2009

In my recent fiddling around with Twitter I came across the Twitter API, which is surprisingly feature-complete. Since programming is one of my hobbies (as well as my occupation), I inevitably started fooling around with it and have already come up with something useful. I’m posting it here, so if you need to do the same thing that I am, you won’t have to reinvent the wheel.

One common thing that people do on Twitter is they follow everyone that follows them. This is good for social networking (or just bald self-promotion), as inbound links to your Twitter page show in the followers list of everyone that you’re following. You’d think Twitter itself would have a way to do this, but alas, it does not. So what I wanted to do is use a program to automatically follow everyone following me instead of having to manually follow each person.

Other sites that interface with Twitter will do it for you (such as TweetLater), but I’m not interested in signing up for another service, and I’m especially not interested in giving out my Twitter login credentials to anyone else. So I needed software that ran locally. A Google search turned up an auto-follow script written in Perl, but the download link requires registration with yet another site. I didn’t want to do that so I decided to program it for myself, which ended up being surprisingly simple.

My Auto-Follow script is written in Python. I decided to use Python because of the excellent Python Twitter library. It provides an all-Python interface to the Twitter API. You’ll need to download and install Python-Twitter (and its dependency, python-simplejson, if you don’t have it already; sudo apt-get install python-simplejson does the trick on Ubuntu GNU/Linux). Just follow the instructions on the Python-Twitter page; it’s really simple.

Now, create a new Python script named auto_follow.py and copy the following code into it:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#(c) 2009 Ben McIlwain, released under the terms of the GNU GPL v3.
import twitter
from sets import Set

username = 'your_username'
password = 'your_password'
api = twitter.Api(username=username, password=password)

following = api.GetFriends()
friendNames = Set()
for friend in following:
    friendNames.add(friend.screen_name)

followers = api.GetFollowers()
for follower in followers:
    if (not follower.screen_name in friendNames):
        api.CreateFriendship(follower.screen_name)

Yes, it really is that simple. I’d comment it, but what’s the point? I can summarize its operation in one sentence: It gets all of your friends and all of your followers, and then finds every follower that isn’t a friend and makes them a friend. Just make sure to edit the script to give it your actual username and password so that it can sign in.

Run the script and you will now be following all of your followers. Pretty simple, right? But you probably don’t want to have to keep running this program manually. Also, I’ve heard rumors that the Twitter API limits you to following 70 users per hour (as an anti-spam measure, I’m guessing), so if you have more than 70 followers you’re not following, you won’t be able to do it all at once. Luckily, there’s a solution for both problems: add the script as an hourly cronjob. This will keep who you follow synced with your followers over time, and if you have a large deficit in who you follow at the start (lucky bastard), it’ll slowly chip away at it each hour until they do get in sync. In Ubuntu GNU/Linux, adding the following line to a text file in /etc/cron.d/ (as root) should do it:

0 * * * * username python /path/to/auto_follow.py >/dev/null 2>&1

This will run the auto_follow script at the top of each hour. You’ll need to set the username to the user account you want the job to run under — your own user account is fine — and set the path to wherever you saved the auto_follow script. Depending on your GNU/Linux distribution and which cron scheduler you have installed, you may not need the username field, and this line might go in a different file (such as /etc/crontab). Refer to your distro’s documentation for more information.

So that’s it. That’s all it takes to automatically auto-follow everyone who’s following you — a dozen or so lines of Python, one crontab entry, and one excellent library and API. Enjoy.

Here’s a pretty bad Unicode WTF

Tuesday, March 3rd, 2009

I’m doing some research on Unicode and compression algorithms right now for a side-project I’m working on, and I came across a highly ranked Google search result for a UTF-8 munging code snippet that is so idiotic I couldn’t let it pass without comment. If this post helps even one person who would’ve otherwise followed the linked advice, it is worth it.

First, some background. UTF-8 is a character encoding format that can pretty much handle any character under the Sun, from the English alphabet to Japanese kanji to obscure extinct languages. It even includes thousands of esoteric symbols used in smaller fields of study that you’ve probably never even heard of before. But the nice thing about UTF-8 is that it is variable-length. Standard ASCII characters (including everything on a standard English keyboard) only take one byte to represent. All of the common characters from other widely used languages typically take just two bytes to encode. It’s only the really obscure characters that require more than two bytes.

So now you see why the linked “solution” is so stupid. This guy says he is “designing a little client/server binary message format” and wants “a simple and quick way to encode strings”. Well, duh — use UTF-8, no ifs, ands, or buts about it. It’s simple, quick, and already implemented in any programming language you can think of, so it requires no additional coding. There are all sorts of dumb ways to unnecessarily reinvent the wheel in sotware engineering, but trying to come up with your own character encoding is particularly idiotic. It’s really tricky to get right because there are so many corner cases you’ll never even know existed until they cause your application to break. The Unicode Consortium exists for a reason — what they do is hard.

This guy even confesses that his expected input will probably not contain Unicode characters that are longer than 2 bytes. So there is no justification at all for what he does next — he creates a mangled version of UTF-8 that turns all Unicode characters 3 bytes and longer into question marks, instead of just leaving them as is. So instead of allowing a rare character to take an additional byte or two, it gets mangled. And to accomplish this, he has to create his own custom encoding solution that is an intentionally broken version of UTF-8. That’s the worst part — he’s wasting time creating completely unnecessary code, that will need to be maintained, that will need to be debugged — and for what?

Of course, none of the people responding to his thread point out that what he is trying to do is stupid. They just smile and hand him some rope.

What C# does better than Java

Thursday, October 16th, 2008

I spend 90% of the development time at my job using either Java or C# .NET, with a roughly equal split between the two. So I know a fair amount about both languages — enough to feel qualified to comment on the differences between them, anyway. Now being a Free Software guy, my obvious preference is for Java, which is actually Free as in freedom (finally, anyway) and runs on a large variety of platforms. Given the choice of which to use on my personal projects, Java is a no-brainer. The best IDE for Java, Eclipse, is absolutely Free. The best IDE for C#, Visual Studio is … well, it’s several hundred dollars and proprietary to boot. And it has the limitation of not running on or compiling for GNU/Linux; since I use Ubuntu as my home desktop operating system, that’s a deal breaker.

But just on a pure comparison between the languages, I have to say that C# is the better of the two. It’s not a fair comparison because C# is many years younger and was able to learn from all of Java’s mistake, but then again, that old canard about life not being fair still holds true. C# is the better language. It has lots of features that simply make it more pleasant to code in. One feature I would’ve killed for in Java while writing a recent project at work is properties. Here’s a sample of the code I wrote in Java:

writeOut(data.getAccount().getContract().getAddress().getAddress1());
writeOut(data.getAccount().getContract().getAddress().getAddress2());
writeOut(data.getAccount().getContract().getAddress().getCity());
writeOut(data.getAccount().getContract().getAddress().getZipCode());
writeOut(data.getAccount().getClient().getCoSigner().getFullName());

And it went on and on for dozens of lines; you get the drift. This is getter and parentheses overload. There’s no real reason the code has to be this messy. And with C#, it isn’t. Here’s how the same code would look in C#:

writeOut(data.Account.Contract.Address.Address1);
writeOut(data.Account.Contract.Address.Address2);
writeOut(data.Account.Contract.Address.City);
writeOut(data.Account.Contract.Address.ZipCode);
writeOut(data.Account.Client.CoSigner.FullName);

And yes, you could accomplish the latter in Java by making all member variables public, but that’s a bad idea. In C# you don’t have to make all of the member variables public to do this — you simply define them as properties, which allows for fine-grained control on who can get and set each property, and without all of the messiness of having to corral dozens of different getter and setter functions for each member variable.

So if nothing else mattered, I would recommend and use C# to the exclusion of Java. But since other issues matter a lot more than programming conveniences, like software freedom, I do still recommend and use Java to the exclusion of C#. But Microsoft did put up a good effort.

A first-hand lesson in software optimization

Sunday, October 5th, 2008

The player is a striped bass, one of the arch predators in the bay.

The player is a striped bass, one of the arch predators in the bay.


One of my major computer science projects in college was creating an educational videogame for elementary school children called A Day in the Bay. We had eight people on the team, though only three of us (myself included) were programmers. It was a self-directed project, and we pretty much figured out things on our own as we went along. I’m not kidding about that last part — check out the CVS revisions for all the gory details. We not-so-seamlessly morphed our custom game engine through a couple completely different game types as our project evolved, and by the time we completed the final project, the overall structure of the code was rather incompatible with how it was being used.

But despite all of the problems, the most important programming lesson I learned from all of college came out of A Day in the Bay. That lesson was how to seriously optimize software, and it would’ve been entirely impossible to learn in the classroom, or even in a month-long programming project.

When we started writing our game, we were very laid back about it. We developers had gaming-caliber computers, and we only cut back on features when the game started lagging for us. So you can imagine our shock when we first tried testing the game on the many-year-old computers in a local elementary school’s computer lab. Oh, and our game was written in Java. Beginning to see some problems now?

The game was completely unplayable. We’re talking frame rates in the low single digits. We had to heavily optimize the game for performance in a short period of time. We did so by cutting out most of the game’s simulationist features. Our game, as originally envisioned, simulated an entire cross-section of the Chesapeake Bay, including plants, small fish, the larger fish eating them, etc. We had to keep track of thousands of fish and have all of them take actions in real time. We even had a state diagram for each fish which determined if it was running for its life, hunting for food, or just chilling out. Needless to say, it wasn’t working on the school system’s hardware.

So our first optimization was to establish a virtual boundary that was roughly one-and-a-half screen lengths in diameter around the viewable area. No processing time was spent on any of the creatures outside that rectangle. They were frozen, and only simulated when the player swam near them. This optimization wasn’t enough, though — we still had to keep track of thousands of in-game objects, which was overflowing the memory available to us on the classroom computers.

Read the rest of this entry »

Drawing enlightening parallels between computer science and bacon

Wednesday, September 17th, 2008

My housemate has become rather obsessed with bacon in recent days. He’s been reading the Bacon Subreddit religiously and filling me in on all of the breaking news and happenings in the burgeoning online bacon community. This past weekend, the bacon fever reached critical levels, because he went to the grocery store, bought a lot of bacon, and started cooking it. This was a big step for him, trust me.

So the first time he made the bacon he put some olive oil in the pan first (I know, bacon purists are wincing at the travesty). But by the time the bacon was well underway, it had already made so much of its own grease to cook in that the puny amount of olive oil was proven completely unnecessary. Bacon thus contains everything within it necessary to cook it. How amazing is that? So the second time he made bacon, he didn’t bother with oil. He just turned the heat down at the beginning while the bacon was generating its own grease (so as to not burn it), then it was off to the races.

In other words, to borrow a term from computer science, bacon is self-hosting. Much like how the compilers of mature programming languages are written in the language they compile, bacon produces all of the greases necessary to cook itself. Bacon isn’t just scrumptious, it’s a smart choice.

But that’s not only the comparison that can be drawn to computer science. Bacon can also bootstrap other bacon. Too impatient to wait for bacon to generate its own grease to cook in? Just save up the bacon grease from the previous bacon cooking session in a Tupperware container and use it to jump-start the next batch. The parallels here with using an already-compiled compiler executable to compile the first compiler on a new system are uncanny, are they not?

I’m sure there are lots more analogies to be drawn between bacon and computer science, and I’d love to see them fleshed out in the comments below. In fact, bacon would make a great standard analogy to be used in introductory computer science classes, alongside Alice and Bob and “think of your computer as an automobile”. Not only is this a great idea because bacon is easily understood by everyone, but also because it’s delicious.

Quantity trumps quality

Tuesday, August 5th, 2008

Jeff Atwood relates an insightful anecdote about quantity over quality that you may initially find counter-intuitive. A pottery-making class was broken up into two groups, with one half graded on the basis of creating a single perfect pot while the other half graded on how much pottery they produced (it was literally weighed, and grades given out for ranges in pounds). At the end of the class, the group that was producing the best pottery was the group that was going for quantity, because they had created such a large number of pots that the experience they gained overshadowed the pain-staking analysis of the other group.

Naturally, there are all sorts of parallels between this anecdote and many other areas, but I’d really like to relate it to software engineering. I’ve always been a “get straight to coding” kind of guy, doing just the bare minimum of planning necessary to start coding, and then writing the documentation along with the code. And after many years of doing this, my code consistently turns out pretty well. A big reason, I suspect, is simply because this approach leads to so much coding. I write programs for all sorts of little random fun things that I would never get around to if I had to spend a bunch of time painstakingly planning out each program beforehand. The best way to become a better coder is not to plan out how to do it, it’s to actually do it, a maxim which also applies to any other activity, including pottery.