Thursday, January 13, 2011

Russian Roulette

When I develop software I of course use various development tools.  One of these tools (which shall rename nameless) has been giving me an error dialog every time I ran it.  The error was quite cryptic (the prevalence of which is a whole other story) so it didn't exactly prompt me to go looking for solution.

Eventually however I decided that another issue I was seeing with the product might be related to this, and I set out to hunt the problem down.  A quick search on the net gave me the answer quite quickly - lots of people have had this problem - some random configuration file I knew nothing about had become corrupted.  Or to be more precise, most people had been seeing their file get truncated, which indeed was the case for me.

At this point things were seeming quite familiar, because I have seen this truncated text file issue all too often when using and working on software products.  While I don't have a name for it, it's one of my favourite anti-patterns.

It's pretty simple really:
  • open text file for writing (deleting the old version if it exists)
  • write text file
  • close text file
It's so simple, it's done frequently by programmers.  In fact, I'd go so far as to say that of all the times that programs rewrite configuration files, most of the time it's done like this.

As a programmer you shouldn't have to be the brightest bulb on the ceiling to figure out what happens if your program crashes or otherwise dies in the middle of saving this text file.  The result is a file which is either not there, or truncated - possibly with no way to get the data back.

There are many reasons that this occurs:
  • power loss
  • the user having to forcefully kill the program because it doesn't scale well to the amount of data that the user has and startup is taking forever
  • bugs in your program or the underlying platform
And this isn't some in-theory-only kind of thing - I've lost real data multiple times on multiple products due to this.

What are the solutions?  One solution is to use a database.  They're reusable data storage solutions that have been tested for reliability.  They're a binary format, usually with fixed width records, so you don't need to erase the entire file in order to update one thing, and they'll scale much better to large amounts of data as a result.

In the end when I was designing the storage system for Internote3, I did stick to XML.  I did however make sure I avoided this anti-pattern.

An alternative is something like this:
  • move the old file (if it exists) to a backup location
  • follow the normal process
Then when loading:
  • load the main file
  • if the main file isn't there or doesn't parse, load the backup file instead (possibly with an error message)
Am I convinced this is perfect?  Not at all.  In particular, I worry about races when this is placed on network file systems, and similar sorts of things.

But I was thankful that Internote3 operated in this manner, when a bug in Firefox was causing occasional failures in my saving code, and the system I had written automatically recovered as it was designed, before I had had a chance to work around the problem.

As a programmer, you should never use this anti-pattern when you're loading and resaving a configuration file.  It's unfortunate that File I/O APIs don't usually have an easy way to do the safe process, which is probably the main reason why people get this wrong so often.

Losing my users' data is something that keeps me awake at night, and it should be the same for any developer.  Please, stop playing Russian Roulette with your users' data.

Thursday, August 12, 2010

Welcome to the Club (Long)

For the past year or so, I've been working on and off, on upgrading a Firefox2 extension called Internote, that puts persistent notes on top of web pages. Finally I have a beta version available, after tremendous effort, and much of that effort should have been unnecessary.

Sometimes it's easy enough and quite enjoyable to see new work land on your codebase, and Firefox is a great platform to work with. But sometimes it feels like you're being hit over the head with a club.

I realise this is a massive post, but hey, that's proportional to the amount of unnecessary work Firefox made me do.

The problem was that in Firefox3, the platform had changed in a way which broke a lot of extensions. In particular, you could no longer use the old simple way of displaying content over a web page. This broke Internote severely.

I don't come to bury Firefox, as the whole extension mechanism is quite brilliant, and a large driver of Firefox's market share. For that matter, Firefox is certainly not the first platform to have ever introduced a major feature regression.

But I intend to issue a reminder that when platform developers make poorly thought out decisions, they can have substantial impacts on developers of end-user software and therefore end-users.

Firefox uses a technology called XUL, which is an XML language for defining application user interfaces. It also uses Javascript and CSS. It's therefore very similar to designing a modern web application, except you have more access to internals, no need to communicate with servers and only need to worry about browser version differences instead of vendor differences.

Internote was originally written by another author who had lost interest in maintaining the project. Several people have attempted to bring it up to speed, with varying success. Some of these versions even had major security holes. As a consequence thousands of users went without an extension that was important to their productivity for 2 to 3 years, resulting in an enormous amount of complaining on its Mozilla Add-On page.

As an old user of Internote, I bemoaned its loss as much as anyone. Eventually I upgraded to Firefox3, and started to live without it. That is, until one day, when I got a hankering to learn Firefox extension development and fix Internote.

How hard could it be, right? Right? Wrong.

Don't get me wrong, I went through the usual learning process and quickly came up to speed on a new development environment, as I have done countless times before. But when it came to the displaying-content-over-the-page issue, my nightmares had only just begun.

I had reasonably quickly determined (with the help of the relevant IRC channel) that the only secure way of doing this in Firefox3 was to use "noautohide" popups, and my nightmare of wrangling the Firefox popup system had begun.

On top of seeming like a big kludge to begin with, my initial prototypes were less than impressive. Here's just a few of the shenanigans I encountered, none of which were a problem in Firefox2:
  • Either Firefox or the OS insists on keeping popups on-screen. This makes sense for normal popups, but where the user expectation is that a note has a fixed position on the page, it's a nightmare. So I must adjust the size of the note if the note disappears off the bottom right of the screen, and internally scroll the note if it disappears off the top left.
  • On a related note, having a "popup" straddle monitors is a big problem, because the popup jumps to some incorrect location. Hopefully this only happens in practice when dragging the window between screens. Someday I plan to use two popups split between screens for this issue, and I hope that only being able to have one focused won't be an issue for users in practice.
  • When using Internote, a user can drag around a note. A major problem is that for "anchored popups" (popups which move when you move the window), Firefox's popup move method is entirely broken, and shows no sign of anyone fixing it - no one even responded for a year after I supplied a simple test case demonstrating the issue. The most obvious workaround is to hide and reopen the popup, which is horrendously flickery.
  • It's quite slow to scroll multiple popups when the page is scrolled, resulting in annoying lag.
  • All kinds of focus issues, including an inability to tab between notes, and problems when clicking from another window onto a popup.
  • Various problems with modal dialogs, including popups not being automatically disabled by them, and they occasionally appear over popups.
  • The popups aren't really a part of the window's content, so appearance and disappearance is often delayed when you minimise or restore a window. The popups also don't show on Windows7's Aero Peek feature, and don't properly become part of Windows7's minimise and restore animations. I'm sure similar stuff happens on other operating systems too.
  • If the user requests it, Internote's notes are supposed to be translucent (showing the page underneath), however translucency is not supported on Linux.
  • I haven't found a way to control the Z-order of popups, except reopening them moves them to the top.
And believe me, there's more where that came from.

Luckily, most of these issues can be worked around somehow or are trivial enough to ignore. But those workarounds ended up being an enormous amount of work.

One of my happiest moments was realising I didn't have to use individual popups, but could instead put a single transparent popup pane over the entire page, and place non-transparent content in there. Amazingly, when you click on transparent areas of a popup, Firefox lets the clicks through to the page. It sounds like a bit of a kludge, but it solved the scroll issues, the popup move issues, Z-order issues, and it got the TAB key to work.

On the other hand, one of my saddest moments was realising that this didn't work on the Mac, where the clicks wouldn't get through to the page. So to get the best user experience I must provide both implementations. Luckily long before discovering this issue I realised I should modularise this, which certainly made it a lot simpler.

After this amount of work, I basically had to rewrite the entire codebase. And since I wasn't getting paid for this, to relieve the monotony I redesigned many components and added many features.

And so, a year later, here we are. I have a beta which works around most of these issues reasonably, at least for Windows and Mac.  I haven't figured out Linux yet which seems to have even more bugs, and I'm still here plotting workarounds for the multiple monitor and flickery-move-on-Mac issues.

It does seem like there is some light at the end of the tunnel. I'm watching Bugzilla carefully for signs that this important functionality might be returning. Bug #313190 seems to be what I want, and it seems mostly dependent on bug #130078 which it seems might be fixed for Firefox4.

But even when it arrives, I'm not going to be able to set Firefox4 (or whatever) as a minimum requirement any time soon, so I'll have to do version detection, and add a third implementation for displaying notes. But if it just works like it should, that's something that I'll do with relish.

Welcome to the club.