Python’s negative list indexing is confusing and shouldn’t be taught to novice programmers

Saturday, March 8th, 2014

I’ve been programming for a long time, in a wide variety of languages, but never did I see negative list indexing until I learned Python nine years ago to play around with making automated edits on Wikipedia. For those who don’t know, negative list indexing allows you to index into a list from the end instead of the beginning. Whereas arr[0] returns the first item in a list, arr[-1] returns the last item in a list, arr[-2] the penultimate item, etc. It seemed like such a no-brainer; why haven’t all languages implemented this feature? It’s useful, right? That’s what I thought until I recently interviewed a spate of internship program candidates for the upcoming summer, most of whom are learning programming in Python, and I saw firsthand the confusion that negative list indexing causes in novices developers.

I do a fair amount of interviewing at my job, with an average of three or so interviews per week. We’re more often hiring developers than not. Now that I’m leading a development team, I have a huge vested interest in finding the best possible talent because inferior developers directly affect the quality of my team’s output, thus making me look bad. The main thing I focus on during interviews is programming problems. You’d be amazed at how many people do just fine in the talking portion but then fail miserably when asked to actually code something. So I use a progression of increasingly difficult programming problems, the first of which is usually “Write a function that takes a sentence and a number as input, and returns the re-arranged sentence with the specified number of words moved from the back to the front”. So for the example sentence “That is a dark brown fox” and the number 4, the output would be “a dark brown fox That is”. I ask the interviewees to solve the problem in whatever language they’re most comfortable with, because I’d rather get good developers in general than someone with certain specific language skills.

A typical solution in C#, which I get a plurality of responses in, looks like this:

public string Rearrange(string sentence, int numWords)
    string[] words = sentence.Split(' ');
    string answer = "";
    for (int i = words.Length - numWords; i < words.Length; i++)
        answer += words[i] + " ";
    for (int j = 0; j < words.Length - numWords; j++)
        answer += words[j] + " ";
    return answer.Trim();

It's not an amazing solution; there's a faster and more efficient way to do it that doesn't require splitting up the input into a words array at all (I'll leave that as an exercise to the reader). But it gets the job done, and most people end up thinking about the problem in this way. Yet a curious thing happened when I started interviewing current undergraduate students for the intern positions: most of them ended up using Python, and all of Python solutions used negative list indexing and ended up going off the rails because of it. I understand why they immediately thought to use negative indexing; we are asking them to grab things off the end of a sentence, after all. But it consistently sowed confusion, and never clarified.

There were two main problems that I saw when people used negative list indexing. One problem was that the interviewees would start with -1 and then decrement to -2, then -3, etc. This seems logical, but you end up reversing the order of the words in the rear segment, such that your output sentence from the above example becomes "fox brown dark a That is". What you really need to do is count "down" through the words in reverse, in the order -4, -3, -2, -1. But that's a bit harder to wrap your mind around than avoiding negative indexes entirely and simply counting up through the words in the order 2, 3, 4, 5. In particular I saw a lot of reversed parameters to the range function, which ends up executing the inside of the for-loop zero times. The problem looks like it's dealing with the end of the list, but it's really not; it's dealing with the middle of the list and counting up through to the end of it, for which using positive indexes is a lot more handy. But as soon as you say "end" to a novice programmer who's learned the "trick" of negative list indexes, that's where their mind immediately goes, and they tend to get stuck.

The second main problem with negative list indexing is that it doesn't mesh well with positive list indexing, both of which you end up needing to use to solve this problem. For instance, [0] is the first element in a list, and [1] the second, but [-1] isn't the second-to-last element in the list; it's the last one. This caused a lot of confusion. Positive list indexes are zero-indexed but negative list indexes are one-indexed. Fencepost errors abounded, or what you might even call reverse-fencepost errors, where interviewees expressed the index of the start of the second segment as either [-numWords +1] or [-numWords +-1] (both wrong but in different ways) when it should simply be [-numWords]. The interviewees simply didn't have enough experience using the two together to understand how they worked in unison, whereas when I prompted them to write the boundary conditions of both for-loops using positive indexing, the mental confusion usually went away and they were able to come up with a workable solution.

For the record, here is a correct solution in Python that uses negative list indexing. It doesn't look that tricky when you see it presented as an answer, but trust me when I say that no one was coming up with it on their own without access to a Python interpreter. The boundary condition of 0 on the first range in particular was highly un-intuitive to all of the interviewees.

def rearrange(sentence, numWords):
    words = sentence.split()
    answer = ""
    for i in xrange(-numWords, 0):
        answer += words[i] + " "
    for i in xrange(0, len(words) - numWords):
        answer += words[i] + " "
    return answer.strip()

And there's one final point that I can't help but bring up: reverse traversal of lists is cache inefficient. Granted it doesn't matter for this problem, and it probably won't matter for most things you do in Python, but if you're trying to write highly performant code in C/C++, like for scientific computation or a game engine, you're going to cause a lot more cache misses by trawling through memory in reverse order.

Using LINQ in C# to easily get a list of most often used words

Monday, May 20th, 2013

A pretty common programming interview question is to parse an input sentence and return the list of unique words used in that sentence. A further elaboration on that problem, the one that this post will be addressing, is to additionally calculate the number of occurrences of each word, and then return the top K words for some input value K. I’ll be demonstrating a simple solution to that problem in C#, both because I’ve been using it a lot recently and also because the choice of C# gives us access to LINQ, which is a powerful C# language feature that allows queries on collections using a SQL-like syntax. The top K problem is incredibly easy to solve in SQL, and boils down to SELECT TOP @K ..... FROM Words ORDER BY Word.Name DESC. The C# solution is similarly easy.

First, I’ll clarify the assumptions that I’m using (and that would be wise for an interviewee to address if none of these are made explicit):

  • I’m writing my solution to be case-insensitive. “case”, “Case”, and “CASE” will all thus count as the same word.
  • I’m not dealing with ties. If you ask for the top 3 words by occurrence and there are 5 words that are all used the most in equal numbers, then you’re still only going to get three words of those five, selected in no particular manner.
  • I’m going to use the C#’s language spec definition of a non-word character in regular expressions to separate the input sentence into words. The naive solution would be to split the string on only spaces, but then you’re not handling punctuation correctly.

And for the solution:

Dictionary<string, int> GetTopKWords(string input, int k)
	string[] words = Regex.Split(input, @"\W");
	var occurrences = new Dictionary<string, int>();
	foreach (var word in words)
		string lowerWord = word.ToLowerInvariant();
		if (!occurrences.ContainsKey(lowerWord))
			occurrences.Add(lowerWord, 1);
	return (from wp in occurrences.OrderByDescending(kvp => kvp.Value) select wp).Take(k).ToDictionary (kw => kw.Key, kw => kw.Value);

The vast majority of this code is responsible simply for finding the list of unique words the number of occurrences of each one. Once that is known, finding the top K words is a single line of code thanks to the power of LINQ. All we’re doing is ordering the words by frequency in descending order and taking the top K words. Without LINQ, there would be significantly more book-keeping code required to do this, which would make a good exercise for the reader (e.g. solve this problem in Java). The first roadblock you’ll probably run into is that you can’t simply flip the keys and values of the array, because the frequency counts aren’t unique. The best I’ve come up with is to construct a list of ordered tuples out of the dictionary of words, order it on the occurrence count part of each tuple, and then extract the first K elements from the resulting ordered list and return it. Let me know in the comments if you have a better solution.

Oh, and here’s an example input/output for the program, handling the display of output using LINQPad:

var input = "the quick brown fox is brown and jumps over the brown log over the long fire and quickly jumps to a brown fire fox";
GetTopKWords(input, 10);


Top K Words sample run

Here’s a pretty bad Unicode WTF

Tuesday, March 3rd, 2009

I’m doing some research on Unicode and compression algorithms right now for a side-project I’m working on, and I came across a highly ranked Google search result for a UTF-8 munging code snippet that is so idiotic I couldn’t let it pass without comment. If this post helps even one person who would’ve otherwise followed the linked advice, it is worth it.

First, some background. UTF-8 is a character encoding format that can pretty much handle any character under the Sun, from the English alphabet to Japanese kanji to obscure extinct languages. It even includes thousands of esoteric symbols used in smaller fields of study that you’ve probably never even heard of before. But the nice thing about UTF-8 is that it is variable-length. Standard ASCII characters (including everything on a standard English keyboard) only take one byte to represent. All of the common characters from other widely used languages typically take just two bytes to encode. It’s only the really obscure characters that require more than two bytes.

So now you see why the linked “solution” is so stupid. This guy says he is “designing a little client/server binary message format” and wants “a simple and quick way to encode strings”. Well, duh — use UTF-8, no ifs, ands, or buts about it. It’s simple, quick, and already implemented in any programming language you can think of, so it requires no additional coding. There are all sorts of dumb ways to unnecessarily reinvent the wheel in sotware engineering, but trying to come up with your own character encoding is particularly idiotic. It’s really tricky to get right because there are so many corner cases you’ll never even know existed until they cause your application to break. The Unicode Consortium exists for a reason — what they do is hard.

This guy even confesses that his expected input will probably not contain Unicode characters that are longer than 2 bytes. So there is no justification at all for what he does next — he creates a mangled version of UTF-8 that turns all Unicode characters 3 bytes and longer into question marks, instead of just leaving them as is. So instead of allowing a rare character to take an additional byte or two, it gets mangled. And to accomplish this, he has to create his own custom encoding solution that is an intentionally broken version of UTF-8. That’s the worst part — he’s wasting time creating completely unnecessary code, that will need to be maintained, that will need to be debugged — and for what?

Of course, none of the people responding to his thread point out that what he is trying to do is stupid. They just smile and hand him some rope.

What C# does better than Java

Thursday, October 16th, 2008

I spend 90% of the development time at my job using either Java or C# .NET, with a roughly equal split between the two. So I know a fair amount about both languages — enough to feel qualified to comment on the differences between them, anyway. Now being a Free Software guy, my obvious preference is for Java, which is actually Free as in freedom (finally, anyway) and runs on a large variety of platforms. Given the choice of which to use on my personal projects, Java is a no-brainer. The best IDE for Java, Eclipse, is absolutely Free. The best IDE for C#, Visual Studio is … well, it’s several hundred dollars and proprietary to boot. And it has the limitation of not running on or compiling for GNU/Linux; since I use Ubuntu as my home desktop operating system, that’s a deal breaker.

But just on a pure comparison between the languages, I have to say that C# is the better of the two. It’s not a fair comparison because C# is many years younger and was able to learn from all of Java’s mistake, but then again, that old canard about life not being fair still holds true. C# is the better language. It has lots of features that simply make it more pleasant to code in. One feature I would’ve killed for in Java while writing a recent project at work is properties. Here’s a sample of the code I wrote in Java:


And it went on and on for dozens of lines; you get the drift. This is getter and parentheses overload. There’s no real reason the code has to be this messy. And with C#, it isn’t. Here’s how the same code would look in C#:


And yes, you could accomplish the latter in Java by making all member variables public, but that’s a bad idea. In C# you don’t have to make all of the member variables public to do this — you simply define them as properties, which allows for fine-grained control on who can get and set each property, and without all of the messiness of having to corral dozens of different getter and setter functions for each member variable.

So if nothing else mattered, I would recommend and use C# to the exclusion of Java. But since other issues matter a lot more than programming conveniences, like software freedom, I do still recommend and use Java to the exclusion of C#. But Microsoft did put up a good effort.

A first-hand lesson in software optimization

Sunday, October 5th, 2008

The player is a striped bass, one of the arch predators in the bay.

The player is a striped bass, one of the arch predators in the bay.

One of my major computer science projects in college was creating an educational videogame for elementary school children called A Day in the Bay. We had eight people on the team, though only three of us (myself included) were programmers. It was a self-directed project, and we pretty much figured out things on our own as we went along. I’m not kidding about that last part — check out the CVS revisions for all the gory details. We not-so-seamlessly morphed our custom game engine through a couple completely different game types as our project evolved, and by the time we completed the final project, the overall structure of the code was rather incompatible with how it was being used.

But despite all of the problems, the most important programming lesson I learned from all of college came out of A Day in the Bay. That lesson was how to seriously optimize software, and it would’ve been entirely impossible to learn in the classroom, or even in a month-long programming project.

When we started writing our game, we were very laid back about it. We developers had gaming-caliber computers, and we only cut back on features when the game started lagging for us. So you can imagine our shock when we first tried testing the game on the many-year-old computers in a local elementary school’s computer lab. Oh, and our game was written in Java. Beginning to see some problems now?

The game was completely unplayable. We’re talking frame rates in the low single digits. We had to heavily optimize the game for performance in a short period of time. We did so by cutting out most of the game’s simulationist features. Our game, as originally envisioned, simulated an entire cross-section of the Chesapeake Bay, including plants, small fish, the larger fish eating them, etc. We had to keep track of thousands of fish and have all of them take actions in real time. We even had a state diagram for each fish which determined if it was running for its life, hunting for food, or just chilling out. Needless to say, it wasn’t working on the school system’s hardware.

So our first optimization was to establish a virtual boundary that was roughly one-and-a-half screen lengths in diameter around the viewable area. No processing time was spent on any of the creatures outside that rectangle. They were frozen, and only simulated when the player swam near them. This optimization wasn’t enough, though — we still had to keep track of thousands of in-game objects, which was overflowing the memory available to us on the classroom computers.

Read the rest of this entry »

Quantity trumps quality

Tuesday, August 5th, 2008

Jeff Atwood relates an insightful anecdote about quantity over quality that you may initially find counter-intuitive. A pottery-making class was broken up into two groups, with one half graded on the basis of creating a single perfect pot while the other half graded on how much pottery they produced (it was literally weighed, and grades given out for ranges in pounds). At the end of the class, the group that was producing the best pottery was the group that was going for quantity, because they had created such a large number of pots that the experience they gained overshadowed the pain-staking analysis of the other group.

Naturally, there are all sorts of parallels between this anecdote and many other areas, but I’d really like to relate it to software engineering. I’ve always been a “get straight to coding” kind of guy, doing just the bare minimum of planning necessary to start coding, and then writing the documentation along with the code. And after many years of doing this, my code consistently turns out pretty well. A big reason, I suspect, is simply because this approach leads to so much coding. I write programs for all sorts of little random fun things that I would never get around to if I had to spend a bunch of time painstakingly planning out each program beforehand. The best way to become a better coder is not to plan out how to do it, it’s to actually do it, a maxim which also applies to any other activity, including pottery.

Code commenting: one of the casualties of outsourcing

Monday, July 21st, 2008

During college I worked as a computer programmer intern at the National Institute of Standards and Technology. I had the opportunity to work on all sorts of nifty cutting-edge physics simulations using some serious science. Unfortunately, everything was written in VB 6, C++ .NET, or Fortran, but you can’t have it all, and .NET is actually pretty decent compared to some of the alternatives.

One of the programs I worked on was originally written by a Korean researcher working at NIST, thus technically not making it outsourcing, but the problems I’m about to describe are relevant nonetheless. The code was rather hard to understand, especially the variable names, which followed some kind of naming convention that was completely foreign to me. Luckily, the code was actually decently commented. In Korean. Not that it would’ve helped me if I was able to read Korean, because sometime between the original writing of the code and when it got to me, all of the nice UTF-8 comments were corrupted down to ASCII-128. So they appeared as complete gibberish that wouldn’t be understandable by anyone — if you’ve ever viewed binary executable data as text, you know what I’m talking about.

My best guess is that another American maintenance programmer before me edited the program in an IDE that wasn’t set up to understand UTF-8. He must’ve not noticed when all of the nicely formatted Korean comments turned into gibberish — or maybe he didn’t care. Either way, by the time the comments got to me, they were thoroughly worthless. Well, not quite. Their presence at least alerted me to sections of the code that required extra attention, because they were generally non-trivial.

Code maintainability is thus one of the biggest casualties of outsourcing. If the coders you’re outsourcing to don’t speak English, or if they at least don’t bother to comment the code in English, you’ll be facing significantly higher code maintenance costs down the line. That’s just something to keep in mind. In the long run, you save money by hiring local programmers. At least that’s the official line I’m sticking with, seeing as how doing so directly benefits me (hey, did I ever say I wasn’t a biased blogger?).

A better solution to the FizzBuzz interview problem

Wednesday, July 2nd, 2008

Many months ago, I wrote about a simple programming problem that I was administering to interviewees at work to assess their programming skills. The basic problem is this: loop through a range of integers, outputting different strings depending on whether each integer is divisible by one given number, by another given number, by both given numbers, or by neither. It’s a very basic weed-out problem that can be solved easily by any applicant with a basic understanding of control structures and modular arithmetic.

Last week, the comments section on that post erupted with new activity. I posted my own solution, which I was then criticized for because it used more than two division operations per integer (and as you should know, division is by far the most expensive basic arithmetic operation on a computer). So I got to brainstorming and I quickly came up with a solution that uses two divisions total, no matter how large the range is. I’m pretty convinced that my solution is very close to optimal (minus some minor fudging regarding how the if-statements are laid out and how the modular incrementing is handled). Here’s the solution in Java, simply because Java is the language most job applicants seemed to want to solve the problem in:

public class FizzBuzz {
// I got lazy and coded in the parameters as constants instead of as arguments.
static final int x = 3; // The first integer to test divisibility on
static final int y = 5; // The second integer to test divisibility on
static final int a = 27; // The starting number in the range to process
static final int b = 74; // The ending number in the range to process

public static void main(String[] args) {
int x1 = a % x; // These are our only
int y1 = a % y; // two divides!!
for (int i = a; i <= b; i++) { if (x1 == 0) System.out.print("Fizz"); if (y1 == 0) System.out.print("Buzz"); if (x1 != 0 && y1 != 0) System.out.print(i); System.out.println(""); x1 = ( x1 == x - 1 ? 0 : x1 + 1); y1 = ( y1 == y - 1 ? 0 : y1 + 1); } } } [/code] Pretty nifty, eh? Instead of having to perform expensive division operations on each integer in the loop, we're simply incrementing two modular counters. This solution immediately suggests a good part two for the FizzBuzz interview question. After the applicant demonstrates basic familiarity with how programming works by writing up the naive solution, test their analytical abilities by asking them to come up with a more efficient version of the same program. If they don't know it already, explain that division operations are expensive, and then ask them to minimize the total number of division operations necessary. There will be a fair number of people who can get through the first part of the problem but won't be able to get through the second part of the problem without a lot of hints. It's these people that you may want to avoid hiring, because being able to improve on the naive algorithm and find more optimized ways of doing things is very important in programming. So, if you happen to be in a hiring position at a programming company, do try this problem out, and let me know how it goes!

In search of stream-based desktop metaphors

Thursday, June 26th, 2008

I just ran across an excellent article comparing two competing desktop worldviews, documents and streams. The author argues that everything in our desktop environments is set up to support a document-based metaphor, when actually what is more relevant to the majority of our work these days is streams. He makes a very persuasive argument:

The prevailing UI paradigm today is built around the notion of document authoring. It expects that the main thing you do is create spreadsheets, word documents, presentations, and so on. There is a task bar to remind you of what documents you’re editing, there is cross-application cut and paste so you can put pieces of one document into another. You can place documents on your desktop surface itself, so you can organize your work. You can define which applications to use for which types of docs. You can set up a default printer to put your documents to hard copy. You can set up system-wide fonts to use in documents. You can put icons to apps and even documents onto your panel. And on and on. […]

Really, what I mostly do today is stream management. And I suspect this is true for the vast majority of people. I don’t deal with writing documents, but with changes to documents. I put comments onto things. I slap patches onto things. I tweak the states of things. Once in a rare while I may author a completely new thingee, but even there I usually end up working with it as a stream of changes that I build up over time (and usually in collaboration with a few other people who stream changes to me).

I’m sold.

The problem is, our virtual desktops (and pretty much all OSes fail equally at this) do not support stream-centric interfaces to data. I can create discrete files just fine, even organize them into nice little directories, but what about my precious streams? I’m talking about my constantly updating server logs, the weather, stocks quotes, news, emails, instant messages, IRC messages, downloads, and more. Everything is handled separately and discordantly.

I can use an ugly hackish little program that outputs system log tails directly to my desktop. I have a Firefox plugin that tells me the current weather and a couple days’ forecasts. My investing service offers a streaming stock quote desktop application, but it only runs on Windows. Mozilla Thunderbird and Azureus pop up email notices and download completion notices, respectively. Instant messages are handled by Pidgin while incoming IRC messages are handled by X-Chat, both of which blink in my taskbar. As for the news — I can use a KDE plugin called Knewsticker that snarfs up RSS feeds. And I haven’t yet found a good way to track, say, SVN commits to the pyWikipediaBot project, so I’m stuck with getting a new email on every commit. Brilliant.

Read the rest of this entry »

Why programmers make good editors

Thursday, June 19th, 2008

A couple days ago, whilst reading a post on a well-known blog (though I no longer remember which one), I noticed an unmatched parenthesis. A long parenthetical aside, fully two paragraphs in length, was not terminated with a matching right parenthesis. This is quite an easy mistake for most to make, and I do not fault the author overmuch. The length of a parenthetical aside is inversely proportional to how likely the reader is to remember he is still in a parenthetical aside by the time he reaches its end. In this case, the person doing the reading was also the post’s author going back through to edit it, and he simply missed the missing closing parenthesis.

But I’m the kind of person who notices these errors, and I’m also the kind of person who often thoroughly analyzes situations (note that I did not say “over-analyzes”), so I got to thinking, why do I notice these kinds of errors especially well when many people tend not to? I don’t think it’s just because I personally enjoy using parentheses so much that I keep a careful watch for abused parentheses everywhere I go, like some superheroic defender of downtrodden punctuation. No, that’s not it. Then it hit me. Keeping track of matching syntax is a very important activity in my day job — computer programming. Programmers run into time-consuming compiler errors early and often if they can’t keep their parentheses, angle brackets, curly braces, and square brackets tightly wrangled.

Therefore, it’s worthwhile to keep track of syntax nesting levels in your head as you write or read code, adding a mental “+1” for each opening character you come across and taking off a mental “-1” for each associated closing character. By the end of the chunk of code, you should be back to the number you started off with (for my fellow computer scientists out there, the best representation of this would be a stack (i.e., push a left symbol onto it and pop a right symbol off, and the stack should be empty at the end (this is how compilers work)). I’m not saying I’m perfect at it; when I’m twenty levels of parentheses deep in a particularly ugly Lisp subroutine, I have no choice but to rely on the compiler’s auto-indentation to make matching manageable. But I definitely think I’m better than most, simply because I regularly work in an environment where it matters a lot.

So as I’m reading prose and I encounter a left parenthesis, some kind of state subconsciously switches in my mind, and I go into parenthetical aside mode. I stay in that mode until a right parenthesis is encountered. If one isn’t forthcoming, I quickly scan ahead in the text to see if there even is one, or maybe there was one already that I simply missed. More often than not, the author has simply forgotten the closing parenthesis. It is my experience that long parenthetical asides are more rare than unclosed short ones. This same mental trick even works for parenthetical asides inside of parenthetical asides and even parenthetical asides inside of parenthetical asides inside of parenthetical asides, but that’s about as far as I go. Luckily for me, you don’t tend to see levels of parenthetical nesting four or more deep, and if you do, it’s probably some Lisp programmer forgetting in what medium they’re writing. If the latter is the case, watch out for cars and cdrs as well.

But I don’t just notice mismatching syntax errors in the written word; I tend to notice all errors (so long as I’m reading with the intent of editing, anyway; when I’m speed reading, I often miss errors on account of not seeing them). I’m a good editor — and not to sound vain, I’ll balance that out by saying I’m a terrible bowler. But I can’t help but think that being good at editing comes naturally to computer scientists. Many of the skills — noticing slight deviations from the rules, especially in the form of syntax — are exactly the same. Both the English language and all programming languages have well-defined rules about how words/clauses may or may not be used together. It’s simply a matter of identifying violations of the rules.

I will add one rather large caveat to my thesis: I’ve known many programmers who cannot spell worth a damn (maybe they flee to computer science because it involves very little essay writing?). Some of them have been dyslexic. I don’t know if anyone’s established a correlation between dyslexia and going into computer science, but I definitely think there is one. So I think programmers make good editors, with the exception of the many programmers who cannot spell well. But if the spelling is good, by virtue of their profession, I bet they’ll be darn good at noticing all of the other errors one encounters in prose.

And for those of you following along closely at home, did you notice the mismatched parenthesis in this post? In the comments below, let me know if you noticed it, and whether you are a programmer. Be honest! Let’s try to get some data that, while not conclusive, will at least be one step above anecdotal.