Archive for March, 2014

Python’s negative list indexing is confusing and shouldn’t be taught to novice programmers

Saturday, March 8th, 2014

I’ve been programming for a long time, in a wide variety of languages, but never did I see negative list indexing until I learned Python nine years ago to play around with making automated edits on Wikipedia. For those who don’t know, negative list indexing allows you to index into a list from the end instead of the beginning. Whereas arr[0] returns the first item in a list, arr[-1] returns the last item in a list, arr[-2] the penultimate item, etc. It seemed like such a no-brainer; why haven’t all languages implemented this feature? It’s useful, right? That’s what I thought until I recently interviewed a spate of internship program candidates for the upcoming summer, most of whom are learning programming in Python, and I saw firsthand the confusion that negative list indexing causes in novices developers.

I do a fair amount of interviewing at my job, with an average of three or so interviews per week. We’re more often hiring developers than not. Now that I’m leading a development team, I have a huge vested interest in finding the best possible talent because inferior developers directly affect the quality of my team’s output, thus making me look bad. The main thing I focus on during interviews is programming problems. You’d be amazed at how many people do just fine in the talking portion but then fail miserably when asked to actually code something. So I use a progression of increasingly difficult programming problems, the first of which is usually “Write a function that takes a sentence and a number as input, and returns the re-arranged sentence with the specified number of words moved from the back to the front”. So for the example sentence “That is a dark brown fox” and the number 4, the output would be “a dark brown fox That is”. I ask the interviewees to solve the problem in whatever language they’re most comfortable with, because I’d rather get good developers in general than someone with certain specific language skills.

A typical solution in C#, which I get a plurality of responses in, looks like this:

public string Rearrange(string sentence, int numWords)
{
    string[] words = sentence.Split(' ');
    string answer = "";
    for (int i = words.Length - numWords; i < words.Length; i++)
    {
        answer += words[i] + " ";
    }
    for (int j = 0; j < words.Length - numWords; j++)
    {
        answer += words[j] + " ";
    }
    return answer.Trim();
}

It's not an amazing solution; there's a faster and more efficient way to do it that doesn't require splitting up the input into a words array at all (I'll leave that as an exercise to the reader). But it gets the job done, and most people end up thinking about the problem in this way. Yet a curious thing happened when I started interviewing current undergraduate students for the intern positions: most of them ended up using Python, and all of Python solutions used negative list indexing and ended up going off the rails because of it. I understand why they immediately thought to use negative indexing; we are asking them to grab things off the end of a sentence, after all. But it consistently sowed confusion, and never clarified.

There were two main problems that I saw when people used negative list indexing. One problem was that the interviewees would start with -1 and then decrement to -2, then -3, etc. This seems logical, but you end up reversing the order of the words in the rear segment, such that your output sentence from the above example becomes "fox brown dark a That is". What you really need to do is count "down" through the words in reverse, in the order -4, -3, -2, -1. But that's a bit harder to wrap your mind around than avoiding negative indexes entirely and simply counting up through the words in the order 2, 3, 4, 5. In particular I saw a lot of reversed parameters to the range function, which ends up executing the inside of the for-loop zero times. The problem looks like it's dealing with the end of the list, but it's really not; it's dealing with the middle of the list and counting up through to the end of it, for which using positive indexes is a lot more handy. But as soon as you say "end" to a novice programmer who's learned the "trick" of negative list indexes, that's where their mind immediately goes, and they tend to get stuck.

The second main problem with negative list indexing is that it doesn't mesh well with positive list indexing, both of which you end up needing to use to solve this problem. For instance, [0] is the first element in a list, and [1] the second, but [-1] isn't the second-to-last element in the list; it's the last one. This caused a lot of confusion. Positive list indexes are zero-indexed but negative list indexes are one-indexed. Fencepost errors abounded, or what you might even call reverse-fencepost errors, where interviewees expressed the index of the start of the second segment as either [-numWords +1] or [-numWords +-1] (both wrong but in different ways) when it should simply be [-numWords]. The interviewees simply didn't have enough experience using the two together to understand how they worked in unison, whereas when I prompted them to write the boundary conditions of both for-loops using positive indexing, the mental confusion usually went away and they were able to come up with a workable solution.

For the record, here is a correct solution in Python that uses negative list indexing. It doesn't look that tricky when you see it presented as an answer, but trust me when I say that no one was coming up with it on their own without access to a Python interpreter. The boundary condition of 0 on the first range in particular was highly un-intuitive to all of the interviewees.

def rearrange(sentence, numWords):
    words = sentence.split()
    answer = ""
    for i in xrange(-numWords, 0):
        answer += words[i] + " "
    for i in xrange(0, len(words) - numWords):
        answer += words[i] + " "
    return answer.strip()

And there's one final point that I can't help but bring up: reverse traversal of lists is cache inefficient. Granted it doesn't matter for this problem, and it probably won't matter for most things you do in Python, but if you're trying to write highly performant code in C/C++, like for scientific computation or a game engine, you're going to cause a lot more cache misses by trawling through memory in reverse order.