Python’s negative list indexing is confusing and shouldn’t be taught to novice programmers

Saturday, March 8th, 2014

I’ve been programming for a long time, in a wide variety of languages, but never did I see negative list indexing until I learned Python nine years ago to play around with making automated edits on Wikipedia. For those who don’t know, negative list indexing allows you to index into a list from the end instead of the beginning. Whereas arr[0] returns the first item in a list, arr[-1] returns the last item in a list, arr[-2] the penultimate item, etc. It seemed like such a no-brainer; why haven’t all languages implemented this feature? It’s useful, right? That’s what I thought until I recently interviewed a spate of internship program candidates for the upcoming summer, most of whom are learning programming in Python, and I saw firsthand the confusion that negative list indexing causes in novices developers.

I do a fair amount of interviewing at my job, with an average of three or so interviews per week. We’re more often hiring developers than not. Now that I’m leading a development team, I have a huge vested interest in finding the best possible talent because inferior developers directly affect the quality of my team’s output, thus making me look bad. The main thing I focus on during interviews is programming problems. You’d be amazed at how many people do just fine in the talking portion but then fail miserably when asked to actually code something. So I use a progression of increasingly difficult programming problems, the first of which is usually “Write a function that takes a sentence and a number as input, and returns the re-arranged sentence with the specified number of words moved from the back to the front”. So for the example sentence “That is a dark brown fox” and the number 4, the output would be “a dark brown fox That is”. I ask the interviewees to solve the problem in whatever language they’re most comfortable with, because I’d rather get good developers in general than someone with certain specific language skills.

A typical solution in C#, which I get a plurality of responses in, looks like this:

public string Rearrange(string sentence, int numWords)
{
    string[] words = sentence.Split(' ');
    string answer = "";
    for (int i = words.Length - numWords; i < words.Length; i++)
    {
        answer += words[i] + " ";
    }
    for (int j = 0; j < words.Length - numWords; j++)
    {
        answer += words[j] + " ";
    }
    return answer.Trim();
}

It's not an amazing solution; there's a faster and more efficient way to do it that doesn't require splitting up the input into a words array at all (I'll leave that as an exercise to the reader). But it gets the job done, and most people end up thinking about the problem in this way. Yet a curious thing happened when I started interviewing current undergraduate students for the intern positions: most of them ended up using Python, and all of Python solutions used negative list indexing and ended up going off the rails because of it. I understand why they immediately thought to use negative indexing; we are asking them to grab things off the end of a sentence, after all. But it consistently sowed confusion, and never clarified.

There were two main problems that I saw when people used negative list indexing. One problem was that the interviewees would start with -1 and then decrement to -2, then -3, etc. This seems logical, but you end up reversing the order of the words in the rear segment, such that your output sentence from the above example becomes "fox brown dark a That is". What you really need to do is count "down" through the words in reverse, in the order -4, -3, -2, -1. But that's a bit harder to wrap your mind around than avoiding negative indexes entirely and simply counting up through the words in the order 2, 3, 4, 5. In particular I saw a lot of reversed parameters to the range function, which ends up executing the inside of the for-loop zero times. The problem looks like it's dealing with the end of the list, but it's really not; it's dealing with the middle of the list and counting up through to the end of it, for which using positive indexes is a lot more handy. But as soon as you say "end" to a novice programmer who's learned the "trick" of negative list indexes, that's where their mind immediately goes, and they tend to get stuck.

The second main problem with negative list indexing is that it doesn't mesh well with positive list indexing, both of which you end up needing to use to solve this problem. For instance, [0] is the first element in a list, and [1] the second, but [-1] isn't the second-to-last element in the list; it's the last one. This caused a lot of confusion. Positive list indexes are zero-indexed but negative list indexes are one-indexed. Fencepost errors abounded, or what you might even call reverse-fencepost errors, where interviewees expressed the index of the start of the second segment as either [-numWords +1] or [-numWords +-1] (both wrong but in different ways) when it should simply be [-numWords]. The interviewees simply didn't have enough experience using the two together to understand how they worked in unison, whereas when I prompted them to write the boundary conditions of both for-loops using positive indexing, the mental confusion usually went away and they were able to come up with a workable solution.

For the record, here is a correct solution in Python that uses negative list indexing. It doesn't look that tricky when you see it presented as an answer, but trust me when I say that no one was coming up with it on their own without access to a Python interpreter. The boundary condition of 0 on the first range in particular was highly un-intuitive to all of the interviewees.

def rearrange(sentence, numWords):
    words = sentence.split()
    answer = ""
    for i in xrange(-numWords, 0):
        answer += words[i] + " "
    for i in xrange(0, len(words) - numWords):
        answer += words[i] + " "
    return answer.strip()

And there's one final point that I can't help but bring up: reverse traversal of lists is cache inefficient. Granted it doesn't matter for this problem, and it probably won't matter for most things you do in Python, but if you're trying to write highly performant code in C/C++, like for scientific computation or a game engine, you're going to cause a lot more cache misses by trawling through memory in reverse order.

A Python script to auto-follow all Twitter followers

Tuesday, March 10th, 2009

In my recent fiddling around with Twitter I came across the Twitter API, which is surprisingly feature-complete. Since programming is one of my hobbies (as well as my occupation), I inevitably started fooling around with it and have already come up with something useful. I’m posting it here, so if you need to do the same thing that I am, you won’t have to reinvent the wheel.

One common thing that people do on Twitter is they follow everyone that follows them. This is good for social networking (or just bald self-promotion), as inbound links to your Twitter page show in the followers list of everyone that you’re following. You’d think Twitter itself would have a way to do this, but alas, it does not. So what I wanted to do is use a program to automatically follow everyone following me instead of having to manually follow each person.

Other sites that interface with Twitter will do it for you (such as TweetLater), but I’m not interested in signing up for another service, and I’m especially not interested in giving out my Twitter login credentials to anyone else. So I needed software that ran locally. A Google search turned up an auto-follow script written in Perl, but the download link requires registration with yet another site. I didn’t want to do that so I decided to program it for myself, which ended up being surprisingly simple.

My Auto-Follow script is written in Python. I decided to use Python because of the excellent Python Twitter library. It provides an all-Python interface to the Twitter API. You’ll need to download and install Python-Twitter (and its dependency, python-simplejson, if you don’t have it already; sudo apt-get install python-simplejson does the trick on Ubuntu GNU/Linux). Just follow the instructions on the Python-Twitter page; it’s really simple.

Now, create a new Python script named auto_follow.py and copy the following code into it:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#(c) 2009 Ben McIlwain, released under the terms of the GNU GPL v3.
import twitter
from sets import Set

username = 'your_username'
password = 'your_password'
api = twitter.Api(username=username, password=password)

following = api.GetFriends()
friendNames = Set()
for friend in following:
    friendNames.add(friend.screen_name)

followers = api.GetFollowers()
for follower in followers:
    if (not follower.screen_name in friendNames):
        api.CreateFriendship(follower.screen_name)

Yes, it really is that simple. I’d comment it, but what’s the point? I can summarize its operation in one sentence: It gets all of your friends and all of your followers, and then finds every follower that isn’t a friend and makes them a friend. Just make sure to edit the script to give it your actual username and password so that it can sign in.

Run the script and you will now be following all of your followers. Pretty simple, right? But you probably don’t want to have to keep running this program manually. Also, I’ve heard rumors that the Twitter API limits you to following 70 users per hour (as an anti-spam measure, I’m guessing), so if you have more than 70 followers you’re not following, you won’t be able to do it all at once. Luckily, there’s a solution for both problems: add the script as an hourly cronjob. This will keep who you follow synced with your followers over time, and if you have a large deficit in who you follow at the start (lucky bastard), it’ll slowly chip away at it each hour until they do get in sync. In Ubuntu GNU/Linux, adding the following line to a text file in /etc/cron.d/ (as root) should do it:

0 * * * * username python /path/to/auto_follow.py >/dev/null 2>&1

This will run the auto_follow script at the top of each hour. You’ll need to set the username to the user account you want the job to run under — your own user account is fine — and set the path to wherever you saved the auto_follow script. Depending on your GNU/Linux distribution and which cron scheduler you have installed, you may not need the username field, and this line might go in a different file (such as /etc/crontab). Refer to your distro’s documentation for more information.

So that’s it. That’s all it takes to automatically auto-follow everyone who’s following you — a dozen or so lines of Python, one crontab entry, and one excellent library and API. Enjoy.

My experiences as an open source developer

Monday, January 8th, 2007

In the second part of a series on my experiences that are not widely-shared and hopefully interesting, I will talk about my experiences as an open source software developer (the first part of this series was about being a newspaper columnist).

I am a developer on the Python Wikipedia Bot Framework, which is a collection of programs that perform automated tasks on wikis based on the MediaWiki software. Wikipedia and other Wikimedia projects are by far the largest users of MediaWiki, but there are lots of other ones out there too, and pyWiki is used by lots of people for various tasks.

Before I go over my experiences in-depth, I’ll start with an overview of everything I’ve done on pyWiki. Skip ahead to after the list if these details are too technical.

  • delete.py – A bot that deletes a list of pages. I wrote it from scratch.
  • templatecount.py – A bot that counts how many times any given number of templates are used. I wrote it from scratch.
  • category.py – A bot that handles category renaming and deletion. I made some changes to it and some libraries, catlib.py and wikipedia.py, to make it more flexible, more automated, and to handle English Wikipedia-specific “Categories for discussion” (CFD) tagging.
  • template.py – A bot that handles template substitution, renaming, and deletion. I made some changes to it (and the library pagegenerators.py) to handle operations on multiple templates simultaneously, as well as increasing flexibility. I will admit, I added the capability to delete any number of templates in one run with the hope that I would some day be able to use it on userboxes.
  • replace.py – A bot that uses regular expressions to modify text on pages. I modified it to handle case insensitive matching, amongst other things.
  • wow.py – An unreleased bot that I used to anonymize thousands of vandal userpages to prevent glorification of vandalism. I wrote it from scratch.
  • catmove.pl – A metabot* written in Perl that parses a list of category changes and does them all in one run. I wrote it from scratch.
  • cfd.pl – An unreleased automatic version of catmove.pl that pulls down the list of category changes directly from the wiki, parses them, and executes them, in one single command. I wrote it from scratch. Hopefully I will be able to release it soon (it may have some security issues that I want to make sure are entirely resolved first).

Cfd.pl is the “secret sauce” that lets Cydebot do its magic. To date, Cydebot has over 160,000 edits, most of them category-related. I attribute this to cfd.pl, which allows me to, with a single command, handle dozens of CFDs simultaneously, whereas people using other bots have to input each one manually. It’s no surprise that everyone else pretty much gave up CFD, leaving my highly-efficient bot to handle it all on its own.

I also had some involvement with Vandalbot, which is a Python anti-vandal bot that uses pyWiki. I ran a Vandalbot clone called AntiVandalBot off of my own server for many months, until somewhat recently, when AntiVandalBot was switched over to being hosted on a Wikimedia Foundation server. If you add up all of the edits committed by both Cydebot and AntiVandalBot then I have the highest number of bot edits on the English Wikipedia — of course, it’s not just my work. I merely came up with the account name and hosted it for awhile; Joshbuddy is the one who actually wrote the vast majority of Vandalbot, and Tawker is the one who hosted the original Tawkerbot2 for awhile (and who now hosts it on the Wikimedia Foundation server).

Working on an open source project is very fun, and rather unlike a programming job for pay in the “real world”. For one, it’s entirely volunteer. I work at my leisure, when I feel like it, or when I have a functionality that I or someone else needs. Programming can actually be relaxing and cathartic when there are no deadlines and I am undertaking a coding project simply for the sake of writing something.

All of the developers on pyWiki are very relaxed and they definitely “get” the open source movement. There’s no expectation of anyone having to get anything done. This can have its downsides, in that it might take awhile for something to be taken care of, but it also doesn’t scare anyone off who is worried about a large time commitment. To become a developer on pyWiki, all I had to do was ask the project head, and I was given write access to the CVS repository within a few days, even though I had never used Python before. The amount of trust is very refreshing, and I definitely feel an impetus not to let the other guys down by uploading code with bugs in it (so my testing is always rigorous).

There gets to be a point with computer languages where learning another one is simply no big deal. I wouldn’t want to estimate how many languages I’ve used by now, but it’s probably somewhere around a dozen. After the first few, though, one simply knows how to program, and learning another language is a simple manner of looking through the documentation to find the specific code to do exactly what one has in mind. That was the situation I was in with pyWiki; although I had never used Python before, I knew exactly what I wanted to accomplish and how to accomplish it: I merely needed to know how to do it in Python. Within a week I was hacking away at the code, adding significant new functionality. It should be noted that working on an existing project in a new language is much, much easier than trying to make something from scratch in a new language.

I would say that pyWiki is a medium-size open source project, which is probably exactly the right size for a first-time developer. It’s not so small that it ever goes stagnant; there are code changes submitted every day, and the mailing list is very active. Any reasonable message posted to it will get a response within a day, if not hours. On the other hand, pyWiki is not too large. It has no barriers to entry; anyone can get started hacking right away it and submitting code changes. Larger projects necessarily have large bureaucracies (large projects need to be managed, there’s no way around it), which means there’s an approval process for code changes, and it’s unlikely that anything written by a novice will actually end up making it into a release. Trying to work on a large project right off the bat can be disheartening because there’s very little one can actually do that doesn’t require an expert level of knowledge. Compare this to pyWiki, which lacks lot of functionality that even a novice would be able to code up (delete.py wasn’t hard at all; it’s simply that no one had done it yet).

I would encourage anyone who is interested in programming to find an open source project they enjoy that they can contribute to. It’s great experience, and it much more closely resembles what actually happens in industry than hacking away on a solo project. I’m sure it’s a great resume line item. The key is to find a project you want to work on because it benefits you. In my case, I was writing new functionality that I needed to get stuff done on Wikipedia.

And there’s just something very compelling about contributing to the worldwide corpus of free software. It’s a way to leave your mark on the world — I way to say, “I was here.”

Read the rest of this entry »