Beware of the Train

PSA: Working Effectively With Legacy Code by Michael Feathers

2017-10-18T18:16:45Z

I am only about a hundred pages into this book, but it had paid for itself after fifty. I wish I'd read it ten years ago. Actually, I wish I'd read it nineteen years ago when I started my first programming job, but it hadn't been written then. Techies, if you haven't read it, I strongly advise you to do so at your earliest convenience. It's about how to deal with a Catch-22 that's come up over and over again in my programming career:

I can't safely modify, or even understand, this code, because it has no tests.
I can't test it without modifying it.

Feathers describes techniques for bringing code under test with the minimal amount of disruption, then refactoring it towards maintainability. Some of the advice I'd worked out for myself, but having names and a structure to hang ad-hoc insights on is great. The book concentrates on object-oriented and procedural languages, but a lot of the techniques should generalise to other paradigms.

Check out this book on Goodreads: https://www.goodreads.com/book/show/44919.Working_Effectively_with_Legacy_Code

comments

2015 in review

2015-12-31T23:56:43Z

Last year I made three New Year's Resolutions:

Get better at dealing with money.
Run a marathon.
Make a first ascent in the Greater Ranges.

Number 2 was an obvious success: I finished the Edinburgh Marathon in 4:24:04, and raised nearly £900 for the Against Malaria Foundation. I'd been hoping to get a slightly faster time than that, but I lost several weeks of training to a chest infection near to the end of my training programme, so in the end I was very happy to finish under 4:30. The actual running was... mostly Type II fun, but also much less miserable than many of my training runs, even at mile 21 when I realised that literally everything below my navel hurt. Huge thanks to everyone who sponsored me!

Number 3 was an equally obvious failure. My climbing partner and I picked out an unclimbed mountain in Kyrgyzstan and got a lot of the logistics sorted, but then he moved house and started a new job a month before we were due to get on a plane to Bishkek. With only a few weeks to go and no plane tickets or insurance bought yet (and them both being much more expensive than we'd expected - we'd checked prices months earlier, but forgot how steeply costs rise as time goes on), we regretfully pulled the plug. We're planning to try again in 2016 - let's hope all the good lines don't get nabbed by Johnny-come-lately Guardian readers.

Number 1 was a partial success. I tried a number of suggestions from friends who appear to have their financial shit more together than me (not hard), but couldn't get any of them to stick. I was diagnosed with ADHD at the end of 2014; I don't want to use that as an excuse, but it does mean that some things that come easily to most people are genuinely difficult for me - and financial mismanagement is apparently very common among people with ADHD. The flip-side, though, is that I have a license to do crazy or unusual things if they help me be effective, because I have an actual medical condition.

I've now set up the following system:

my salary (minus taxes and pension contributions) is paid into Account 1;
a couple of days later, most of it is transferred by standing order into Account 2;
all bills are paid from Account 2 by direct debit, and Account 2 should maintain enough of a balance for them to always clear;
money left in Account 1 is available for spending on day-to-day things;
if I pay for something on a credit card, I pay it off from Account 1 (if small) or Account 2 (if big) as soon as possible;
Account 2 pays interest up to a certain ceiling; above that I'm going to transfer money out into a tax-efficient Account 3, which pays less interest but which doesn't have a ceiling.

I'll have to fine-tune the amount left in Account 1 with practice, but this system should ensure that bills get paid, I can easily see how much money I have left to spend for the month, and very little further thought or effort on my part is required.

While I was in there, I took the opportunity to set up a recurring donation to the Against Malaria Foundation for a few percent of my net salary - less than the 10% required to call yourself an Official Good Person by the Effective Altruism movement, but I figure I can work up to it.

It's too early to say whether the system will work out, but setting it up has already been a beneficial exercise - before, I had seven accounts with five different providers, most of them expired and paying almost zero interest (in one file, I found seven years' worth of letters saying "Your investment has expired and is now paying 0.1% gross interest, please let us know what you want us to do with it.") I now have only the three accounts described above, from two different providers, so it should be much easier to keep track of my overall financial position. Interest rates currently suck in general, but Accounts 2 and 3 at least pay a bit.

I've also started a new job that pays more, and wormwood_pearl's writing is starting to bring in some money. We're trying not to go mad and spend our newfound money several times over, but we're looking to start replacing some broken kit over the next few months rather than endlessly patching things up.

What else has happened to us?

I had a very unsuccessful winter climbing season last year; I was ill a lot from the stress of marathon training, and when I wasn't ill the weather was terrible. I had a couple of good sessions at the Glasgow ice-climbing wall, but only managed one actual route. Fortunately, it was the classic Taxus on Beinn an Dothaidh, which I'd been wanting to tick for a while. I also passed the half-way mark on the Munros on a beautiful crisp winter day in Glencoe.

One by one, my former research group's PhD students finished, passed their vivas, submitted their corrections, and went off, hopefully, to glittering academic careers or untold riches in Silicon Valley. Good luck to them all.

In June, I did the training for a Mountain Leadership award, the UK's introductory qualification for leading groups into the hills. Most of the others on the course were much fitter than me and more competent navigators, but the instructor said I did OK. To complete the award, I'll need to log some more Quality Mountain Days and do a week-long assessment.

In July, we went to Mat Brown's wedding in Norfolk, and caught up with some friends we hadn't seen IRL for far too long. Unlike last year, when it felt like we were going to a wedding almost every weekend, we only went to one wedding this year; I'm glad it was such a good one. Also, it was in a field with camping available, which really helped to keep our costs down.

In July, I started a strength-training cycle. I've spent years thinking that my physical peak was during my teens, when I was rowing competitively (albeit badly) and training 15-20 hours a week, so I was surprised to learn that I was able to lift much more now than I could then - 120kg squats versus around 90kg (not counting the 20kg of body weight I've gained since then). Over the next few weeks, I was able to gain a bit more strength, and by the end I could squat 130kg. I also remembered how much I enjoy weight training - so much less miserable than cardio.

In August, we played host to a few friends for the Edinburgh Fringe, and saw some great shows, of which my favourite was probably Jurassic Park.

In September, we went to Amsterdam with friends for a long weekend, saw priceless art and took a canal tour; then I got back, turned around within a day and went north for a long-awaited hiking trip to Knoydart with my grad-school room-mate. There are two ways to get to Knoydart: either you can take the West Highland Line right to the end at Mallaig, then take the ferry, or you can get off at Glenfinnan (best known for the viaduct used in the Harry Potter films) and walk North for three days, sleeping in unheated huts known as bothies. We did the latter, only it took us six days because we bagged all the Munros en route. I'm very glad we did so. The weather was cold but otherwise kind to us, the insects were evil biting horrors from Hell, and the starfields were amazing. It wasn't Kyrgyzstan, but it was the best fallback Europe had to offer.

In October, I started a new job at Red Hat, working on the OpenStack project, which is an open-source datacenter management system. It's a huge, intimidating codebase, and I'm taking longer than I'd like to find my feet, but I like my team and I'm slowly starting to get my head around it.

That's about it, and it's five minutes to the bells - Happy New Year, and all the best for 2016!

comments

Building OCaml with Redo

2015-04-20T21:21:04Z

Context.

Calculating the dependency graph of your build ahead-of-time can be fiendishly difficult, or even impossible. Redo brings two, I think brilliant, insights to bear on this problem.

If you have to build the target in the course of calculating its dependencies, that's totally OK, because that's what you really wanted to do in the first place.
You don't actually have to know the entire build graph when you start building; you only need to know enough dependencies such that
- if a given target T needs to be rebuilt, at least one of the dependencies you know about for T will have been affected;
- in the course of building T, you will discover the remaining dependencies of T and rebuild any stale ones.

Let's try to write do-files for building OCaml modules.

In default.cmi.do:

redo-ifchange $2.mli
ocamlc $2.mli

In default.cmo.do:

redo-ifchange $2.cmi $2.ml
redo-ifchange `ocamldep $2.mli $2.ml`
ocamlc $2.ml

Note: these do-files will not actually work, because redo insists that you write your output to a temporary file called $3 so it can atomically rename the newly-built file into place, and ocamlc is equally insistent that it knows better than you what its output files should be called. However, this annoying interaction of their limitations is irrelevant to the dependency-checking algorithm, so I'll pretend that they do work :-) I'll try to construct a workaround and post it on GitHub. Update: I have now done so!

The redo-ifchange command says "if you know how to build my arguments, then build them; if any of them changes in the future, then the target built by this script will be out-of-date". So to build X.cmi, we observe that it depends on X.mli (which probably won't need building), and then build it. To build X.cmo, we observe that it depends on both X.ml and X.cmi (which will be rebuilt if need be). Then we invoke ocamldep to get a list of other files imported by X.ml, build those if any are out of date, and finally invoke ocamlc on X.ml to produce X.cmo.

Let's see how this plays out in the following scenario:

We build X.cmo.
We try to build it again.
We edit X.ml, adding a new dependency Y.
We rebuild X.cmo.

First, redo runs default.cmo.do, discovers that X.cmo depends on X.ml and X.cmi, recursively builds X.cmi (determining that it depends on X.mli), and finally compiles X.ml, producing X.cmo.

When we run redo-ifchange X.cmo again, redo will check its database of dependencies and observe that X.cmo depends transitively on X.cmi, X.ml and X.mli but that none of them have changed; hence, it will correctly do nothing.

Then we add the dependency, and run redo-ifchange X.cmo. Redo will again check its database of dependencies and note that X.ml has changed, so it must re-run default.cmo.do. First it notes that X.cmo depends on X.cmi and X.ml: it checks its database and sees that X.cmi depends on X.mli, which hasn't changed, so it leaves X.cmi alone. Next it re-runs ocamldep X.mli X.ml and hands the output to redo-ifchange: this tells redo that X.cmo now depends on Y.cmi. Y.cmi doesn't exist yet, so it builds it using the rules in default.cmi.do. Finally it compiles X.ml into X.cmo.

This system should work provided that all your dependencies live within the filesystem, or can be brought within it; however, if this is not the case then you probably have bigger problems :-)

comments

Readings in databases

2015-04-04T14:37:04Z

Learning about compilers made me a more confident programmer - a part of my toolchain that had felt like magic was revealed to be merely a complex (and, sometimes, beautiful) collection of mechanisms, and I gained a much greater understanding of how the code I typed was put into practice by the computer. It occurred to me recently that though I've been using relational databases for years (I first learned about normal forms in 2000ish, from a book called MS Access Unlocked or something similarly unimpressive), I don't actually have much idea of how they work. You enter a SQL query and it gets parsed as normal and then, er... something something query analyser... something B+-tree index query plan AND AS IF BY MAGIC it becomes a nice fast bit of looping and pointer arithmetic.

Clearly this was not good enough. So I asked on Twitter for reading recommendations. Here's what I got.

David Meier, The Theory of Relational Databases, 1983 (PDFs). As the name suggests, this looks heavy on the relational algebra and light on implementation.
C. J. Date, An Introduction to Database Systems (Amazon link to paper book), 2003. Apparently this "contains a lot about internals. The writing style is quite verbose though."
Abiteboul, Hull and Vianu, Foundations of Databases (PDFs). This appears to cover SQL and relational algebra in the first half of the book, and Datalog in the second. Which sounds very interesting, but not quite what I was looking for.
Stratis Viglas, Advanced Databases (PDF). Slides from a 2015 undergraduate course given at the University of Edinburgh. Covers topics like on-disk layout, external sorting, query optimization, transaction processing, B+-trees and hash joins - the stuff I was after, in other words.
Raghu Ramakrishnan and Johannes Gehrke, Database Management Systems, 2002 (Amazon link, though the first Google hit is a presumably-illegal PDF of the full text!) The course textbook for Viglas' course, this appears to cover relational algebra, practical SQL programming, the DB implementation stuff in the course, and quite a lot more.
Julia Evans (aka b0rk) wrote a nice sequence of blog posts in which she delves into SQLite internals; I should re-read these.
The SQLite documentation looks pretty good, and includes some information about internals.

It'll clearly take me a while to get through all that! My plan is to read through Viglas' notes, have a go at some of the exercises from his course (which also appear to be online), and then take a look at either Meier's book or AHV. Anything else I should be looking at? Does my strategy sound reasonable?

comments

CouchDB - JSON and B-trees and REST, oh my!

2013-06-02T20:21:12Z

I've been learning about the NoSQL database CouchDB, mainly from the Definitive Guide, but also from the Coursera Introduction to Data Science course and through an informative chat with necaris, who has used it extensively at Esplorio. The current draft of the Definitive Guide is rather out-of-date and has several long-open pull requests on GitHub, which doesn't exactly inspire confidence, but CouchDB itself appears to be actively maintained. I have yet to use CouchDB in anger, but here's what I've learned so far:

CouchDB is, at its core, an HTTP server providing append-only access to B-trees of versioned JSON objects via a RESTful interface. Say what now? Well, you store your data as JavaScript-like objects (which allow you to nest arrays and hash tables freely); each object is indexed by a key; you access existing objects and insert new ones using the standard HTTP GET, PUT and DELETE methods, specifying and receiving data in JavaScript Object Notation; you can't update objects, only replace them with new objects with the same key and a higher version number; and it's cheap to request all the objects with keys in a given range.

The JSON is not by default required to conform to any particular schema, but you can add validation functions to be called every time data is added to the database. These will reject improperly-formed data.

CouchDB is at pains to be RESTful, to emit proper cache-invalidation data, and so on, and this is key to scaling it out: put a contiguous subset of (a consistent hash of) the keyspace on each machine, and build a tree of reverse HTTP proxies (possibly caching ones) in front of your database cluster.

CouchDB's killer feature is probably master-to-master replication: if you want to do DB operations on a machine that's sometimes disconnected from the rest of the cluster (a mobile device, say), then you can do so, and sync changes up and down when you reconnect. Conflicts are flagged but not resolved by default; you can resolve them manually or automatically by recording a new version of the conflicted object. Replication is also used for load-balancing, failover and scaling out: you can maintain one or more machines that constantly replicate the master server for a section of keyspace, and you can replicate only a subset of keyspace onto a new database when you need to expand.

CouchDB doesn't guarantee to preserve all the history of an object, and in particular replications only seem to send the most recent version; I think this precludes Git-style three-way merge from the conflicting versions' most recent common ancestor (and forget about Darcs-style full-history merging!).

The cluster-management story isn't as good as for some other systems, but there are a couple of PaaS offerings.

Queries/views and non-primary indexes are both handled using map/reduce. If you want to index on something other than the primary key - posts by date, say - then you write a map query which emits (date, post) pairs. These are put into another B-tree, which is stored on disk; clever things are done to mark subtrees invalid as new data comes in, and changes to the query result or index are calculated lazily. Since indices are stored as B-trees, it's cheap to get all the objects within a given range of secondary keys: all posts in February, for instance.

CouchDB's reduce functions are crippled: attempting to calculate anything that isn't a scalar or a fixed-size object is considered Bad Form, and may cause your machine(s) to thrash. AFAICT you can't reduce results from different machines by this mechanism: CouchDB Lounge requires you to write extra merge functions in Twisted Python.

Map, reduce and validation functions (and various others, see below) are by default written in JavaScript. But CouchDB invokes an external interpreter for them, so it's easy to extend CouchDB with a new query server. Several such have been written, and it's now possible to write your functions in many different languages.

There's a very limited SQL view engine, but AFAICT nothing like Hive or Pig that can take a complex query and compile it down into a number of chained map/reduce jobs. The aforementioned restrictions on reduce functions mean that the strategy I've been taught for expressing joins as map/reduce jobs won't work; I don't know if this limitation is fundamental. But it's IME pretty rare to require general joins in applications: usually you want to do some filtering or summarisation on at least one side.

CouchDB can't quite make up its mind whether it wants to be a database or a Web application framework. It comes by default with an administration web app called Futon; you can also use it to store and execute code for rendering objects as HTML, Atom, etc. Such code (along with views, validations etc) is stored in special JSON objects called "design documents": best practice is apparently to have one design document for each application that needs to access the underlying data. Since design documents are ordinary JSON objects, they are propagated between nodes by replications.

However, various standard webapp-framework bits are missing, notably URL routing. But hey, you can always use mod_rewrite...

There's a tool called Erica (and an older one called CouchApp) which allows you to sync design documents with more conventional source-code directories in your filesystem.

CouchDB is written in Erlang, and the functional-programming influence shows up in other places: most types of user-defined function are required to be free of side-effects, for instance. Then there's the aforementioned uses of lazy evaluation and the append-only nature of the system as a whole. You can extend it with your own Erlang code or embed it into an Erlang application, bypassing the need for HTTP requests.

tl;dr if you've ever thought "data modelling and synchronisation are hard, let's just stick a load of JSON files in Git" (as I have, on several occasions), then CouchDB is probably a good fit to your needs. Especially if your analytics needs aren't too complicated.

comments

The most evil code I've ever written?

2011-07-26T10:31:17Z

Back in 2003 when I was working for $hateful_defence_contractor, we were dealing with a lot of quantities expressed in dB. Occasionally there was a need to add these things - no, I can't remember why. Total power output from multiple sources, or something. Everyone cursed about this. So I wrote a desk calculator script, along these lines:

#!/usr/bin/perl

print "> ";
while (<>) {
   s|-?\d+(\.\d+)?([eE][-+]?\d+)?|10**($&/10)|oeg;
   print 10*log(eval $_)/log(10)."\n> ";
}

I've always thought of this as The Most Evil Code I've Ever Written. For those of you who don't speak fluent regex, it reads a line of input from the user, interprets everything that looks like a number as a number of decibels, replaces each decibel-number with the non-decibel equivalent, evaluates the resulting string as Perl code, and then converts the result back into decibels. Here's an example session:

> 1 + 1
4.01029995663982
> 10 * 10 
20
> cos(1)
-5.13088257108395

Some of you are no doubt thinking "Well of course that code's evil, it's written in Perl!" But no. Here's the same algorithm written in Python:

#!/usr/bin/python -u
import re, math, sys

def repl(match):
        num = float(match.group(0))
        return str(10**(num/10))

number = re.compile(r'-?\d+(\.\d+)?([eE][-+]?\d+)?')
while 1:
        line = sys.stdin.readline()
        if len(line) == 0:
                break
        line = re.sub(number, repl, line)
        print 10*math.log10(eval(line))

If anything, the Perl version is simpler and has lower accidental complexity. If Perl's not the best imaginable language for expressing that algorithm, it's damn close.

[I also tried to write a Haskell version using System.Eval.Haskell, but I got undefined reference to `__stginit_pluginszm1zi5zi1zi4_SystemziEvalziHaskell_' errors, suggesting my installation of cabal is b0rked. Anyone know what I need to do to fix it? Also, I'm sure my Python style can be greatly improved - suggestions most welcome.]

No, I thought of it as evil because it's doing an ugly thing: superficially altering code with regexes and then using string eval? And who the hell adds decibels, anyway?

Needless to say, it was the most successful piece of code I wrote in the year I spent in that job.

I was talking about string eval to Aaron Crane the other day, and I mentioned this program. His response surprised me:

I disagree; I think it’s a lovely piece of code. It may not be a beautiful jewel of elegant abstractions for a complex data model, true. But it’s small, simple, trivial to write, works on anything with a Perl interpreter (of pretty much any vintage, and with no additional dependencies), and clearly correct once you’ve thought about the nature of the arithmetic you’re doing. While it’s not something you’d ship as part of a safety-critical system, for example, I can’t see any way it could be realistically improved as an internal tool, aimed at users who are aware of its potential limitations.

[Full disclosure: the Perl version above didn't work first time. But the bug was quickly found, and it did work the second time :-)]

The lack of external dependencies (also a virtue of the Python version, which depends only on core modules) was very much intentional: I wrote my program so it could be trivially distributed (by samizdat, if necessary). Most of my colleagues weren't Perl programmers, and if I'd said "First, install Regexp::Common from CPAN...", I'd have lost half my potential audience. As it was, the tool was enthusiastically adopted.

So, what do you all think? Is it evil or lovely? Or both? And what's the most evil piece of code that you've written?

Edit: Aaron also pointed me at this program, which is both lovely and evil in a similar way. If you don't understand what's going on, type

perl -e 'print join q[{,-,+}], 1..9'

and

perl -e 'print glob "1{,-,+}2"'

at the command-line.

comments