File extensions and the pigeonhole principle
Earlier today I downloaded some map data from the United States Geological Survey. It’s file extension was .adf. Hrmmm, that was something I wasn’t really familiar with, so I headed off to that great oracle of Internet knowledge, Google, to find my answer, and Google did not disappoint.
Apparently ADF is a file format for a bunch of different programs. It’s can be, among other things:
- A saved game file for A.R.S.E.N.A.L.: Extended Power, a game I’ve never heard of that has a curiously acronymed name. I can’t be bothered to look up what the acronym stands for, but even without looking, I can guarantee it’ll be utterly contrived.
- An “Actual Drawing File”. Your guess is as good as mine. Presumably contains actual drawing data, unlike the “Fake Drawing File” format.
- An “Adapter Description File”, used in some obsoleted suite of IBM software. If I had a nickel for every piece of software that IBM has abandoned by the side of the roadway, I’d be able to afford some actual IBM software.
- An “Admin Config File”. No other information is provided, so presumably it contains such vital information as height, weight, favorite food, preferred Linux distribution, favorite Star Trek movie (not V!), etc.
- An Amiga Disk File. These are still around?!
- An “Antenna Data File”. Seeing as how I just got into ham radio, that’s an interesting coincidence. I bet some ham somewhere out there is using ADFs to design his killer Yagi array.
- A “Dog Creek QC Mask File”. I’m not sure how they got “ADF” out of that name, so your guess is as good as mine.
- A Grand Theft Auto Vice City Radio Station file — alright, finally one I’m familiar with! I actually played that game, and on the computer to boot! So I’ve actually used these ADF files before. Unfortunately, this isn’t quite the kind of thing the USGS makes available for download, though they really should.
- An “I-DEAS Associated Data File”. This is some kind of executable binary code having something to do with Matlab. Yeah, I’ll stay far away from it, for both reasons.
- A “MicroSim PCBoard Photoplot Aperture Definition File”. Man, we’re getting incredibly esoteric here. I wonder how many ADF files of this type there are on the entire planet?
- Something to do with “ReliaSoft ALTA 1″. My guess would be something to do with quilting and shuffleboard (wow, obscure reference?), but apparently ALTA actually stands for something: Quantitative Accelerated Life Testing Analysis. So, uh, murder by spreadsheet it is?
- “Wyatt Technology ASTRA(r)Chromatography”. Finally, one that needs no explanation.
- An ArcView ARC/INFO Coverage Data File. You know, map stuff.
Obviously the last one is the correct one in the context of the file I downloaded, but I think I had more fun looking at the disparate things ADF has been used for in the past than in actually being able to open the file itself. All of these different uses of the same file extension brought to mind the pigeonhole principle, a mathematical principle which says, in essence, if you have more pigeons than pigeonholes, then at least one pigeonhole must have more than one pigeon in it. It sounds obvious and trivial, but when properly applied in mathematical proofs, it can actually yield some surprising and nontrivial results. We used it in a lot of my theory computer science courses at University of Maryland.
The pigeonholes in this case are all of the possible three letter file extensions. Ignoring the file extensions that use numbers, we have 26^3, or 17,576, possible extensions. That sounds like a lot, but when you consider how many thousands of software development outfits there are out there, you can easily see how the space of possible three letter file extensions isn’t big enough. There are far more pigeons than pigeonholes, as the case of the ADF extension above shows. Picture a dozen pigeons violently squabbling over each pigeonhole, with feathers flying everywhere, and you have the general idea.
This can cause all sorts of problems. Let’s say you have both ArcView and Photoplot installed on your computer, and you double-click on a .ADF file. You know what kind of file it is (presumably because you created it), but Windows doesn’t! It will either have ArcView or Photoplot set up as the default viewer for ADF files, but not both. So if you have ADF files associated with ArcView but what you’re trying to open is really a Photoplot file, it will crash and burn, and you’ll to choose the program to use to open that particular file manually. It’s messy, and frankly, many novice computer users don’t even know about the Right Click->Open With dialog.
These occurrences are rare, but they do come up every so often (.dat is a popular one for collisions), and it’s definitely something to think about. Some file extensions such as .doc and .gif are so widely used that no developer in their right mind would choose one of those as the file extension for their own program, because a collision is essentially guaranteed. But other extensions such as ADF that aren’t nearly so popular will have a larger number of programs using them. Collisions become more abundant.
So the message to software developers is clear: when developing a new application, choose your file extension with care. I would recommend going with a four letter extension, as you’re reducing the chances of a collision by more than 26 times. And also search a file extension database such as Filext before deciding on an extension to use, paying special attention to applications in similar fields to your own. Grand Theft Auto and ArcView are completely unrelated programs, and so don’t pose much of a risk of interfering with one another. But if you were making your own mapping program, ADF would be a terrible choice, since the odds of someone who’s using your program also having ArcView are much better than random, and the next thing you know, you’re dealing with lots of customer support tickets from customers complaining that your files are being opened by the wrong program.
And even that old stand-by of IT support, telling them to reboot, won’t fix the problem!
April 4th, 2008 at 17:38
This is just one (of many) reasons that type identification by extension sucks! Proper operating systems use file magic to figure out what files are from their content. (i.e. like the file command). Sadly, most of the world is not using a proper operating system. ;)
This is why, originally, all the Xiph multimedia codecs were supposed to be stored in files with the “ogg” extension. Vorbis audio? .ogg, Theora Video? .ogg Timesync lyrics? .ogg Ogg encap midi? .ogg. Lossless Flac audio? Speex audio? The importance of this is clear when you consider that any of these can be mixed: If you were trying to give separate extensions for every codec mixture, then the number of combinations becomes enormous quickly. The format is setup so that even fairly stupid tools can quickly find out “whats inside” and pick an appropriate player. Sadly, that just doesn’t fly in Windows world… and it seems that a lot of windows users want to use different players for files containing video and files not containing video (why can’t the windows world produce an all around good player? :) )…. and most of them never deal with or aware of the dozen other codecs and data types supported in the ogg container.
Eventually the developers got tired of hearing the complaints and now we have “oga” and “ogv”. Bleh.
April 5th, 2008 at 14:12
Yes, file identification by ending is one of those horrors of the Windows world. It was only made worse when MS decided to hide all extensions by default in Windows XP. This attitude reached its disturbing and pathetic zenith in the MS Exchange 2000 Web interface, which will not let you download attachments with the extension .xml, as they could be “harmful to your computer.” Try it out if you don’t believe me.
Here in Unix, we’ve had the ability for many years to guess the type of a file (with remarkable accuracy) for years, using the
filecommand. Binaries and text files, among others, historically have no extension, but I have yet to find a Unix system that has trouble distinguishing between the two. Konqueror will even give me preview thumbnails of text files.So the real solution is to treat the file extension as one minor part of a file’s metadata, in much the same way that your music files probably have some sort of directory structure, but you ordinarily browse them using much richer information in Amarok or iTunes or whatever. This doesn’t negate the point of your post at all, as it’s still a good idea to make a wise choice of extension for when you’re just glancing at a group of files.
April 5th, 2008 at 15:42
% locate .adf | wc -l
377
Most of these date from when I made images of my old Amiga floppies for use with UAE, for packrattitude’s sake.