Thursday, January 13, 2011

Russian Roulette

When I develop software I of course use various development tools.  One of these tools (which shall rename nameless) has been giving me an error dialog every time I ran it.  The error was quite cryptic (the prevalence of which is a whole other story) so it didn't exactly prompt me to go looking for solution.

Eventually however I decided that another issue I was seeing with the product might be related to this, and I set out to hunt the problem down.  A quick search on the net gave me the answer quite quickly - lots of people have had this problem - some random configuration file I knew nothing about had become corrupted.  Or to be more precise, most people had been seeing their file get truncated, which indeed was the case for me.

At this point things were seeming quite familiar, because I have seen this truncated text file issue all too often when using and working on software products.  While I don't have a name for it, it's one of my favourite anti-patterns.

It's pretty simple really:
  • open text file for writing (deleting the old version if it exists)
  • write text file
  • close text file
It's so simple, it's done frequently by programmers.  In fact, I'd go so far as to say that of all the times that programs rewrite configuration files, most of the time it's done like this.

As a programmer you shouldn't have to be the brightest bulb on the ceiling to figure out what happens if your program crashes or otherwise dies in the middle of saving this text file.  The result is a file which is either not there, or truncated - possibly with no way to get the data back.

There are many reasons that this occurs:
  • power loss
  • the user having to forcefully kill the program because it doesn't scale well to the amount of data that the user has and startup is taking forever
  • bugs in your program or the underlying platform
And this isn't some in-theory-only kind of thing - I've lost real data multiple times on multiple products due to this.

What are the solutions?  One solution is to use a database.  They're reusable data storage solutions that have been tested for reliability.  They're a binary format, usually with fixed width records, so you don't need to erase the entire file in order to update one thing, and they'll scale much better to large amounts of data as a result.

In the end when I was designing the storage system for Internote3, I did stick to XML.  I did however make sure I avoided this anti-pattern.

An alternative is something like this:
  • move the old file (if it exists) to a backup location
  • follow the normal process
Then when loading:
  • load the main file
  • if the main file isn't there or doesn't parse, load the backup file instead (possibly with an error message)
Am I convinced this is perfect?  Not at all.  In particular, I worry about races when this is placed on network file systems, and similar sorts of things.

But I was thankful that Internote3 operated in this manner, when a bug in Firefox was causing occasional failures in my saving code, and the system I had written automatically recovered as it was designed, before I had had a chance to work around the problem.

As a programmer, you should never use this anti-pattern when you're loading and resaving a configuration file.  It's unfortunate that File I/O APIs don't usually have an easy way to do the safe process, which is probably the main reason why people get this wrong so often.

Losing my users' data is something that keeps me awake at night, and it should be the same for any developer.  Please, stop playing Russian Roulette with your users' data.