Using LINQ in C# to easily get a list of most often used words

Monday, May 20th, 2013

A pretty common programming interview question is to parse an input sentence and return the list of unique words used in that sentence. A further elaboration on that problem, the one that this post will be addressing, is to additionally calculate the number of occurrences of each word, and then return the top K words for some input value K. I’ll be demonstrating a simple solution to that problem in C#, both because I’ve been using it a lot recently and also because the choice of C# gives us access to LINQ, which is a powerful C# language feature that allows queries on collections using a SQL-like syntax. The top K problem is incredibly easy to solve in SQL, and boils down to SELECT TOP @K ..... FROM Words ORDER BY Word.Name DESC. The C# solution is similarly easy.

First, I’ll clarify the assumptions that I’m using (and that would be wise for an interviewee to address if none of these are made explicit):

  • I’m writing my solution to be case-insensitive. “case”, “Case”, and “CASE” will all thus count as the same word.
  • I’m not dealing with ties. If you ask for the top 3 words by occurrence and there are 5 words that are all used the most in equal numbers, then you’re still only going to get three words of those five, selected in no particular manner.
  • I’m going to use the C#’s language spec definition of a non-word character in regular expressions to separate the input sentence into words. The naive solution would be to split the string on only spaces, but then you’re not handling punctuation correctly.

And for the solution:

Dictionary<string, int> GetTopKWords(string input, int k)
{
	string[] words = Regex.Split(input, @"\W");
	var occurrences = new Dictionary<string, int>();
	
	foreach (var word in words)
	{
		string lowerWord = word.ToLowerInvariant();
		if (!occurrences.ContainsKey(lowerWord))
			occurrences.Add(lowerWord, 1);
		else
			occurrences[lowerWord]++;
	}
	return (from wp in occurrences.OrderByDescending(kvp => kvp.Value) select wp).Take(k).ToDictionary (kw => kw.Key, kw => kw.Value);
}

The vast majority of this code is responsible simply for finding the list of unique words the number of occurrences of each one. Once that is known, finding the top K words is a single line of code thanks to the power of LINQ. All we’re doing is ordering the words by frequency in descending order and taking the top K words. Without LINQ, there would be significantly more book-keeping code required to do this, which would make a good exercise for the reader (e.g. solve this problem in Java). The first roadblock you’ll probably run into is that you can’t simply flip the keys and values of the array, because the frequency counts aren’t unique. The best I’ve come up with is to construct a list of ordered tuples out of the dictionary of words, order it on the occurrence count part of each tuple, and then extract the first K elements from the resulting ordered list and return it. Let me know in the comments if you have a better solution.

Oh, and here’s an example input/output for the program, handling the display of output using LINQPad:

var input = "the quick brown fox is brown and jumps over the brown log over the long fire and quickly jumps to a brown fire fox";
GetTopKWords(input, 10);

outputs:

Top K Words sample run

What C# does better than Java

Thursday, October 16th, 2008

I spend 90% of the development time at my job using either Java or C# .NET, with a roughly equal split between the two. So I know a fair amount about both languages — enough to feel qualified to comment on the differences between them, anyway. Now being a Free Software guy, my obvious preference is for Java, which is actually Free as in freedom (finally, anyway) and runs on a large variety of platforms. Given the choice of which to use on my personal projects, Java is a no-brainer. The best IDE for Java, Eclipse, is absolutely Free. The best IDE for C#, Visual Studio is … well, it’s several hundred dollars and proprietary to boot. And it has the limitation of not running on or compiling for GNU/Linux; since I use Ubuntu as my home desktop operating system, that’s a deal breaker.

But just on a pure comparison between the languages, I have to say that C# is the better of the two. It’s not a fair comparison because C# is many years younger and was able to learn from all of Java’s mistake, but then again, that old canard about life not being fair still holds true. C# is the better language. It has lots of features that simply make it more pleasant to code in. One feature I would’ve killed for in Java while writing a recent project at work is properties. Here’s a sample of the code I wrote in Java:

writeOut(data.getAccount().getContract().getAddress().getAddress1());
writeOut(data.getAccount().getContract().getAddress().getAddress2());
writeOut(data.getAccount().getContract().getAddress().getCity());
writeOut(data.getAccount().getContract().getAddress().getZipCode());
writeOut(data.getAccount().getClient().getCoSigner().getFullName());

And it went on and on for dozens of lines; you get the drift. This is getter and parentheses overload. There’s no real reason the code has to be this messy. And with C#, it isn’t. Here’s how the same code would look in C#:

writeOut(data.Account.Contract.Address.Address1);
writeOut(data.Account.Contract.Address.Address2);
writeOut(data.Account.Contract.Address.City);
writeOut(data.Account.Contract.Address.ZipCode);
writeOut(data.Account.Client.CoSigner.FullName);

And yes, you could accomplish the latter in Java by making all member variables public, but that’s a bad idea. In C# you don’t have to make all of the member variables public to do this — you simply define them as properties, which allows for fine-grained control on who can get and set each property, and without all of the messiness of having to corral dozens of different getter and setter functions for each member variable.

So if nothing else mattered, I would recommend and use C# to the exclusion of Java. But since other issues matter a lot more than programming conveniences, like software freedom, I do still recommend and use Java to the exclusion of C#. But Microsoft did put up a good effort.

Don’t ever be ashamed of your code

Friday, June 13th, 2008

Are you ever ashamed of your code? Don’t be! Being ashamed of your code is harmful, as artfully explained by Ben Collins-Sussman. It’s better to make your mistakes in the open where they can be quickly corrected than in private where they can fester for months, even years. Note that we aren’t necessarily talking about open source code here. Being ashamed of your code could also mean not sharing code with other people at your company.

Ben uses some anecdotes to illustrate just how badly situations can get when programmers (or small groups of programmers) sit on their code for months on end without any outside sanity checking whatsoever. But these anecdotes are more humorous than necessary, as it’s pretty much a truism in computer science that coding off on your own in secret is a bad idea. The people who are doing it know it’s bad, and the only reason they persist is because they are ashamed. Oftentimes they’ll rationalize it by saying “I’ll just clean it up before I let others see it” — which, when combined with procrastination, can mean no one else sees it for months or even years. And if poor architecture decisions have been made, as they often are, the problem is too large for a simple clean up; a partial or full rewrite is necessary. This is not the situation you want to find yourself in.

Luckily, I can’t say I’ve ever felt ashamed of my code. And that’s not for lack of writing some truly terrible programs, either. I just value the feedback I get from others more than any personal attachment I might have to my code. In other words, I don’t take it personally. And to demonstrate that, I’m going to post a truly terrible program I wrote back in high school. My only excuse is that I was young and ignorant.

The program in question is “makeSite”, a program I wrote to create my blog-like website before the word “blog” even existed and before any real blogging software had been developed. I was writing what was effectively a blog at the time (you can see an archive of it here), but I got tired of having to hand-edit the HTML to copy over a previous entry and modify it each time I wrote a new entry. So, naturally enough, I wrote a C++ program to statically compile a bunch of text “data” files containing my own custom pseudo-HTML-like syntax into a website. I won’t defend the decision to do it this way, other than to say that I didn’t know any better. What this effectively meant was that every time I updated any part of my site, even to fix a one-character typo, my entire site had to be re-compiled by re-running the program, a task that, because my program wasn’t very efficient, was taking minutes after my site grew to be rather big. I toyed with the idea of some sort of incremental site compilation, only updating the pages corresponding to the changed data files, but I never got that working.

I think it’ll help to illustrate how bad this program truly is by individually discussing some of the more egregious parts of it.

#include “apstring.h”

For those of you who aren’t familiar with the “apstring” string library, here’s a hint: ap stands for Advanced Placement. That’s right, instead of using a standard, widely used string library (like “string.h”), I used the apstring library (by the College Board), because that’s what we were taught in class. It was just like a real string library, only it didn’t have as many features. Frankly, there’s no excuse for its existence, as tests should conform to reality and not the other way around. If you ever see it in production code, you should run like hell.

headFile.open(“pages.dat”);
output.open(“pages2.dat”);
while (headFile.get(ch))
output < < ch; output.close(); headFile.close();[/code] Yes, you really are looking at a character-by-character copy of a file. Never mind that there's an OS function to do this in one line (and much more efficiently, I might add). But the reason I did it this way is even worse than the way I did it, if that’s possible: I wanted a second copy of the file so I could parse through the original string-by-string, and then when I hit upon a page that was a subpage of another page, I would consult this copy to find out what its parent page was. This was to get around the problem of not being able to have two file handles open to the same file. I suppose the concept of just loading the whole file into memory and parsing through that didn’t occur to me. And notice the hard-coded file names; that’s a nice touch.

Read the rest of this entry »

Pining for the coding fjords

Wednesday, May 14th, 2008

I’m deep into the third week of a technical writing project at work, and boy do I miss coding! It turns out there was a good reason I went into computer science in college and got a job as a programmer afterwards; I really am passionate about it. That feeling just gets lost when I’m doing programming that I don’t really find enjoyable (and let’s be honest: unless you work for a game company, and often not even then, the kind of programming you’re doing isn’t fun).

So I’ve started by writing some simple algorithmic programs in C++: linked list classes, binary search trees; you know, nothing spectacular, but not nearly as trivial as a “Hello world!” either. I just wanted a refresher on C++ because I haven’t used it in awhile. And boy did I forget some things! The programs were algorithmically correct on the first try (which is absolutely not something I would be able to do when I was first learning these data structures back in high school, so at least those skills stuck with me), but the syntax was horrid. Imagine a tsunami of syntax errors impacting a rain forest of pointer errors. Uh, yeah, something like that.

Pride being one of the worst sins in a programmer, I immediately looked up some C++ tutorials on Google to find out what I was doing wrong. I had missing semicolons at the end of class declarations (I forgot all about that quirk; Java doesn’t need them), I was missing function declarations (again, not needed in Java), and I was even accidentally passing a pointer to a function that really wanted a value by reference (don’t ask). But it wasn’t too bad; after a few minutes of cleaning out the errors, the program compiled successfully and ran perfectly on the first try. It was fun using pointers again after these past couple of years of not touching them. They’re fun in a masochistic kind of way. I think it has something to do with working closely enough to the hardware to be able to allocate individual blocks of memory and daisy-chain them all together. It’s cool when you think about what you’re doing.

These programs were still pretty simple though. I don’t have a real desire to make a much more complex project in C++, because it does start getting ugly after you’ve grown accustomed to the niceties of C#, Java, and Python, so now I’m off to code up a game in Java. I have a general idea of what I want to write, which I’ve already made allusions to, but I don’t want to talk more about it in case the nifty idea in my head utterly stumbles upon execution. It’s happened before, so I’ve learned the fine art of hedging.

After some preliminary research, I’ve decided that I’ll use the Lightweight Java Game Library (LWJGL) to code up what I have in mind. I don’t want to make the same mistake that I did back in college when we wrote the game A Day in the Bay for our four-year undergraduate research project. We wrote it from scratch in pure Java AWT (not even using Swing!), so we spent a lot of time reinventing the wheel, and to boot, our wheel wasn’t particularly good because it didn’t have any hardware acceleration*. I don’t want to have to spend so time much time doing performance optimizations this time around, so I’ll just use LWJGL and hopefully get to pure coding a lot more quickly.

Now unfortunately, it can be awfully intimidating first starting off with something as complex as LWJGL. Just look at their API; it doesn’t make much immediate sense even if you are a Java developer. This is the critical juncture where many people get turned off and decide to write something on their own (as we did back in college). This is a big mistake. It’s a lot less effort to learn how to use a new library than to essentially write it from scratch. So how do I learn a new library? Examples! And luckily, LWJGL has some awesome examples of fully-fledged games to peruse. Check out Space Invaders and 3D Asteroids. Using these as a reference, and then the API for other functionality not used by these sample games, I’ll have my game up and running in no time.

So keep your eyes peeled. I’ll eventually have some progress to report back on, and then at some point the fruits of my labor will be downloadable for your enjoyment. I’ve already set up my SVN server and committed a skeleton class file, so there’s no turning back! Programming that’s fun again, here I come!

*While we’re mixing metaphors this badly, I’ll just come out and declare that a wheel with hardware acceleration is an automobile. Or, you know, a motorized unicycle.

Celebrating ten years on the web

Friday, July 20th, 2007

It just dawned on me, I’ve had a website on the World Wide Web pretty much continuously starting from 1997 (when I was in sixth grade). Nowadays that doesn’t sound so special, because most middle schoolers use instant messaging and email and have MySpace pages and whatever, but dammit, ten years ago, that meant something. Most of my schoolmates didn’t even have Internet connections at their homes yet, and so the computer labs at school provided a continuing sense of wonder.

Luckily, I use the squirrel-like archiving system, meaning it may take me awhile to find anything from awhile ago, but by God, I still have it somewhere. And so I’ve been able to dig up every webpage I ever wrote and post it on this server. Yup, really. My most recent website from before university was called Fyre’s Domain. I admit I don’t quite understand the name of the website, because while my server was indeed named Fyre, I never went by that name myself. Fyre’s Domain was also the first website I hosted on my own server, rather than using the web hosting of my ISP.

Fyre’s Domain was actually a bit of a programming project for me. Basically I wanted to write a blog (a format that I used in most of my previous sites before the name itself existed), but no widespread blogging software existed yet. So I wrote my own. In C++. It parsed text that I wrote directly into text files (yes, that’s how I updated the site) and wrote it into proper HTML, making a top page that had the five most recent entries and then a deep archive page. The program didn’t have any incremental updating or dynamic accessing, so every time I fixed even a single typo in one of the text files I had to re-run the program, which would read all of the text files and write out all of the static HTML files again. Hey, I didn’t say it was any good!

As part of my work on Fyre’s Domain, I also compiled archives of all of my older websites, themselves rescued from old archives on one of our old Windows desktops (the chain of custody is now about five computers long). Apparently I wrote four major revisions of a site called Bigmack’s World. This is also strange, as I never went by the online handle Bigmack; that was my dad’s. But we did share a dial-up account, and the site was hosted at something like http://www.erols.com/~bigmack/, so I suppose it makes a bit of sense. Bigmack’s World V1 is a blast to the past, replete with background MIDI music, annoying animated GIFs lifted from other websites, and the main navigation system is an image map! How’s this for an unreadable Web 0.5 background? I made it myself! What’s sad is how many of the links on that site are broken, but it’s also interesting that a fair number of them are still alive, essentially unchanged from ten years ago.

So, pardon my reminiscing. I hope you agree with my assessment that my websites have steadily increased in quality over time, though I don’t profess to be a web design guru by any means. I just don’t like information rot, and as long as I still have the files for all of my old sites, I like to keep hosting them.

BioPerl is an excellent package

Sunday, March 11th, 2007

I’m taking a Bioinformatics Algorithms course here at the University of Maryland. Our first real programming project is quite the doozy. We have to implement our own Smith-Waterman local alignment algorithm, which matches up a given amino acid sequence against a given database of other sequences. The basic algorithm uses dynamic programming and a huge matrix that has as many rows and columns as the two sequences are long. Add on top of that the large number of heuristics that can (and some should) be used to improve the alignment, and it’s quite the programming project.

Luckily, there’s BioPerl. BioPerl is an amazing set of modules for Perl that does just about anything one would want to do in Bioinformatics. Of course, it has a module implementing local alignment (and even BLAST), but we aren’t allowed to use those on our project. Still, it has lots of other modules that we can use. It has classes for storing DNA, RNA, and amino acid sequences, scoring matrices (such as Blosum62), subroutines for reading and writing from a large variety of formats (including Fasta, of course), and much, much more.

We’re given our choice of any programming language to use to complete the assignment, something I’ve never seen before in a class (rarely they’ll give you a choice of two or three, but never “any”). Yet I don’t envy anyone who isn’t using Perl. Yeah, Perl isn’t an optimal language for this kind of thing, especially because, as a scripting language, its performance is much lower than, say, C++. But it has BioPerl. Perl is the most widely-used programming language in the field of Bioinformatics.

The only problem with choosing to use Perl is that I’m not going to win any of the extra credit points. The teacher is giving significant extra credit points to the top three algorithms (measured in time it takes to complete a run), but, by selecting Perl, I’m completely removing myself from the running against the people who are using C/C++. It just won’t even be close. It might be more fair if the programs were ranked against others in the same language, because then there might be some incentive for me to optimize my heuristics, but since even the best heuristics aren’t even going to place using Perl, I have no incentive to do anything beyond the very minimal algorithm.

No blog posts today

Sunday, February 18th, 2007

Sorry, there are no blog posts today. This was a rough weekend, and I just finished a programming assignment where I had to write my own blocking message-sending protocol using sockets in C. So I’m burnt out, and thus, no blog posts today. Ceci n’est pas un blog post.