pozorvlak: (Default)
Wednesday, October 18th, 2017 07:09 pm
I am only about a hundred pages into this book, but it had paid for itself after fifty. I wish I'd read it ten years ago. Actually, I wish I'd read it nineteen years ago when I started my first programming job, but it hadn't been written then. Techies, if you haven't read it, I strongly advise you to do so at your earliest convenience. It's about how to deal with a Catch-22 that's come up over and over again in my programming career:
  • I can't safely modify, or even understand, this code, because it has no tests.
  • I can't test it without modifying it.
Feathers describes techniques for bringing code under test with the minimal amount of disruption, then refactoring it towards maintainability. Some of the advice I'd worked out for myself, but having names and a structure to hang ad-hoc insights on is great. The book concentrates on object-oriented and procedural languages, but a lot of the techniques should generalise to other paradigms.

Check out this book on Goodreads: https://www.goodreads.com/book/show/44919.Working_Effectively_with_Legacy_Code

Working Effectively with Legacy Code by Michael Feathers
pozorvlak: (Default)
Thursday, December 31st, 2015 08:18 pm

Last year I made three New Year's Resolutions:

  1. Get better at dealing with money.
  2. Run a marathon.
  3. Make a first ascent in the Greater Ranges.

Number 2 was an obvious success: I finished the Edinburgh Marathon in 4:24:04, and raised nearly £900 for the Against Malaria Foundation. I'd been hoping to get a slightly faster time than that, but I lost several weeks of training to a chest infection near to the end of my training programme, so in the end I was very happy to finish under 4:30. The actual running was... mostly Type II fun, but also much less miserable than many of my training runs, even at mile 21 when I realised that literally everything below my navel hurt. Huge thanks to everyone who sponsored me!

Number 3 was an equally obvious failure. My climbing partner and I picked out an unclimbed mountain in Kyrgyzstan and got a lot of the logistics sorted, but then he moved house and started a new job a month before we were due to get on a plane to Bishkek. With only a few weeks to go and no plane tickets or insurance bought yet (and them both being much more expensive than we'd expected - we'd checked prices months earlier, but forgot how steeply costs rise as time goes on), we regretfully pulled the plug. We're planning to try again in 2016 - let's hope all the good lines don't get nabbed by Johnny-come-lately Guardian readers.

Number 1 was a partial success. I tried a number of suggestions from friends who appear to have their financial shit more together than me (not hard), but couldn't get any of them to stick. I was diagnosed with ADHD at the end of 2014; I don't want to use that as an excuse, but it does mean that some things that come easily to most people are genuinely difficult for me - and financial mismanagement is apparently very common among people with ADHD. The flip-side, though, is that I have a license to do crazy or unusual things if they help me be effective, because I have an actual medical condition.

I've now set up the following system:

  • my salary (minus taxes and pension contributions) is paid into Account 1;
  • a couple of days later, most of it is transferred by standing order into Account 2;
  • all bills are paid from Account 2 by direct debit, and Account 2 should maintain enough of a balance for them to always clear;
  • money left in Account 1 is available for spending on day-to-day things;
  • if I pay for something on a credit card, I pay it off from Account 1 (if small) or Account 2 (if big) as soon as possible;
  • Account 2 pays interest up to a certain ceiling; above that I'm going to transfer money out into a tax-efficient Account 3, which pays less interest but which doesn't have a ceiling.

I'll have to fine-tune the amount left in Account 1 with practice, but this system should ensure that bills get paid, I can easily see how much money I have left to spend for the month, and very little further thought or effort on my part is required.

While I was in there, I took the opportunity to set up a recurring donation to the Against Malaria Foundation for a few percent of my net salary - less than the 10% required to call yourself an Official Good Person by the Effective Altruism movement, but I figure I can work up to it.

It's too early to say whether the system will work out, but setting it up has already been a beneficial exercise - before, I had seven accounts with five different providers, most of them expired and paying almost zero interest (in one file, I found seven years' worth of letters saying "Your investment has expired and is now paying 0.1% gross interest, please let us know what you want us to do with it.") I now have only the three accounts described above, from two different providers, so it should be much easier to keep track of my overall financial position. Interest rates currently suck in general, but Accounts 2 and 3 at least pay a bit.

I've also started a new job that pays more, and [profile] wormwood_pearl's writing is starting to bring in some money. We're trying not to go mad and spend our newfound money several times over, but we're looking to start replacing some broken kit over the next few months rather than endlessly patching things up.

What else has happened to us?

I had a very unsuccessful winter climbing season last year; I was ill a lot from the stress of marathon training, and when I wasn't ill the weather was terrible. I had a couple of good sessions at the Glasgow ice-climbing wall, but only managed one actual route. Fortunately, it was the classic Taxus on Beinn an Dothaidh, which I'd been wanting to tick for a while. I also passed the half-way mark on the Munros on a beautiful crisp winter day in Glencoe.

One by one, my former research group's PhD students finished, passed their vivas, submitted their corrections, and went off, hopefully, to glittering academic careers or untold riches in Silicon Valley. Good luck to them all.

In June, I did the training for a Mountain Leadership award, the UK's introductory qualification for leading groups into the hills. Most of the others on the course were much fitter than me and more competent navigators, but the instructor said I did OK. To complete the award, I'll need to log some more Quality Mountain Days and do a week-long assessment.

In July, we went to Mat Brown's wedding in Norfolk, and caught up with some friends we hadn't seen IRL for far too long. Unlike last year, when it felt like we were going to a wedding almost every weekend, we only went to one wedding this year; I'm glad it was such a good one. Also, it was in a field with camping available, which really helped to keep our costs down.

In July, I started a strength-training cycle. I've spent years thinking that my physical peak was during my teens, when I was rowing competitively (albeit badly) and training 15-20 hours a week, so I was surprised to learn that I was able to lift much more now than I could then - 120kg squats versus around 90kg (not counting the 20kg of body weight I've gained since then). Over the next few weeks, I was able to gain a bit more strength, and by the end I could squat 130kg. I also remembered how much I enjoy weight training - so much less miserable than cardio.

In August, we played host to a few friends for the Edinburgh Fringe, and saw some great shows, of which my favourite was probably Jurassic Park.

In September, we went to Amsterdam with friends for a long weekend, saw priceless art and took a canal tour; then I got back, turned around within a day and went north for a long-awaited hiking trip to Knoydart with my grad-school room-mate. There are two ways to get to Knoydart: either you can take the West Highland Line right to the end at Mallaig, then take the ferry, or you can get off at Glenfinnan (best known for the viaduct used in the Harry Potter films) and walk North for three days, sleeping in unheated huts known as bothies. We did the latter, only it took us six days because we bagged all the Munros en route. I'm very glad we did so. The weather was cold but otherwise kind to us, the insects were evil biting horrors from Hell, and the starfields were amazing. It wasn't Kyrgyzstan, but it was the best fallback Europe had to offer.

In October, I started a new job at Red Hat, working on the OpenStack project, which is an open-source datacenter management system. It's a huge, intimidating codebase, and I'm taking longer than I'd like to find my feet, but I like my team and I'm slowly starting to get my head around it.

That's about it, and it's five minutes to the bells - Happy New Year, and all the best for 2016!

pozorvlak: (Default)
Wednesday, May 13th, 2015 06:24 pm

Over the seven years or so I've been using Git, there have been a number of moments where I've felt like I've really levelled up in my understanding or usage of Git. Here, in the hope of accelerating other people's learning curves, are as many as I can remember. I'd like to thank Joe Halliwell for introducing me to Git and helping me over the initial hurdles, and Aaron Crane for many helpful and enlightening discussions over the years - especially the ones in which he cleared up some of my many misunderstandings.

Learning to use a history viewer

Even if all you're doing is commit/push/pull, you're manipulating the history graph. Don't try to imagine it in your mind - get the computer to show it to you! That way you can see the effect your actions have, tightening your feedback loop and improving your learning rate. It's also a lot easier to debug problems when you can see what's going on. I started out using gitk --all to view history; now I mostly use git lg, which is an alias for log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%Creset' --abbrev-commit --date=relative, but the effect is much the same.

The really important lessons to internalise at this stage are

  • Git maintains your history as a graph of snapshots (you'll hear people using the term DAG, for "directed acyclic graph").
  • Branches and tags are just pointers into this graph.

If you understand that, then congratulations! You now understand 90% of what's important about Git. Just keep that core model in mind, and it should be possible to reason about the rest.

Relatedly: you'll probably want to graduate to the command-line eventually, but don't be ashamed to use a graphical client for any remotely unfamiliar task.

Understanding hash-based identifiers

You know those weird c56ab7f identifiers Git uses for commits? Those are actually serving a very important role. Mark Jason Dominus has written a great explanation, and I also suggest having a poke around in the .git/objects directory of one of your repos. MJD talks about "blobs" and "trees" which are also identified with hashes, but you don't actually have to think about them much in day-to-day Git usage: commits are the most important objects.

Learning to use git rebase

This shouldn't really qualify as a GLUE - my very first experience with Git was using the git-svn plugin, which forces you to rebase at all times. However, even if you're not interoperating with SVN, you should give rebase a go. What does it do, you ask? Well, the clue's in the name: it takes the sequence of commits you specify, and moves them onto a new base: hence "re-base". See also the MJD talk above, which explains rebasing in more detail.

Note to GUI authors: the obvious UI for this is selecting a load of commits and dragging-and-dropping them to somewhere else in the history graph. Sadly, I don't know of any GUIs that actually do this. Edit: MJD agrees, and is seeking collaborators for work on such a GUI.

Rebasing occasionally gets stuck when it encounters a conflict. The thing to realise here is that the error messages are really good. Keep calm, read the instructions, then follow them, and everything should turn out OK (but if it doesn't, feel free to skip to the next-but-two GLUE).

Giving up on git mergetool

git mergetool allows you to open a graphical diff/merge tool of your choice for handling conflicts. Don't use it. Git's internal merge has complete knowledge of your code's history, which enables it to do a better job of resolving conflicts than any external merge tool I'm aware of. If you use a third-party tool, expect to waste a lot of time manually resolving "conflicts" that the computer could have handled itself; the internal merge algorithm will handle these just fine, and present you with only the tricky cases.

Learning to use git rebase --interactive

I thought this was going to be hard and scary, but it's actually really easy and pleasant to use, I promise! The name is terrible - while git rebase --interactive can do a normal rebase as part of its intended work, this is usually a bad idea. What it's actually for is rewriting history. Have you made two commits that should really be squashed into one? Git rebase --interactive! Want to re-order some commits? Git rebase --interactive! Delete some entirely? Git rebase --interactive! Split them apart into smaller commits? Git rebase --interactive! The interface to this is extremely friendly by Git CLI standards: specify the commit before the first one you want to rewrite, and Git will open an editor window with a list of the commits-to-rewrite. Follow the instructions. If that's not clear enough, read this post.

This should also be represented in a GUI by dragging-and-dropping, but I don't know of any clients that do this.

Learning about git reflog

Git stores a list of every commit you've had checked out. Look at the output of git reflog after a few actions: this allows you to see what's happened, and is thus valuable for learning. More usefully, it also allows you to undo any history rewriting. Here's the thing: Git doesn't actually rewrite history. It writes new history, and hides the old version (which is eventually garbage-collected after a month). But you can still get the old version back if you have a way to refer to it, and that's what git reflog gives you. Knowing that you can easily undo any mistakes allows you to be a lot bolder in your experiments.

Here's a thing I only recently learned: Git also keeps per-branch reflogs, which can be accessed using git reflog $branchname. This saves you some log-grovelling. Sadly, these only contain local changes: history-rewritings done by other people won't show up.

Reading the code of git-rebase--interactive

Interactive rebase is implemented as a shell script, which on my system lives at /usr/lib/git-core/git-rebase--interactive. It's pretty horrible code, but it's eye-opening to see how the various transformations are implemented at the low-level "plumbing" layer.

I actually did this as part of a (failed) project to implement the Darcs merge algorithm on top of Git. I still think this would be a good idea, if anyone wants to have a go. See also this related project, which AFAICT has some of the same advantages and better asymptotic complexity than the Darcs merge algorithm.

Learning how the recursive merge algorithm works

You don't actually need to know this, but it's IMHO pretty elegant. Here's a good description.

Learning what HEAD actually means

I learned this from reading two great blog posts by Federico Mena Quintero: part 1, part 2. Those posts also cleared up a lot of my confusion about how remotes and remote-tracking branches work.

Learning to read refspec notation

The manual is pretty good here. Once you know how refspec notation works, you'll notice it's used all over and a lot of things will click into place.

Learning what git reset actually does

I'd been using git reset in stereotyped ways since the beginning: for instance, I knew that git reset --hard HEAD meant "throw away all uncommitted changes in my working directory" and git reset --hard [commit hash obtained from reflog] meant "throw away a broken attempt at history-rewriting". But it turns out that this is not the core of reset. Again, MJD has written a great explanation, but here's the tl;dr:

  1. It points the HEAD ref (aka the current branch - remember, branches are just pointers) at a new 'target' commit, if you specified one.
  2. Then it copies the tree of the HEAD commit to the index, unless you said --soft.
  3. Finally, it copies the contents of the index to the working tree, if you said --hard.

Once you understand this, git reset becomes part of your toolkit, something you can apply to new problems. Is a branch pointing to the wrong commit? There's a command that does exactly what you need, and it's git reset. Here's yet another MJD post, in which he explains a nonobvious usage for git reset which makes perfect sense in light of the above. Here's another one: suppose someone has rewritten history in a remote branch. If you do git pull you'll create a merge commit between your idea of the current commit and upstream's idea; if you later git push it you'll have created a messy history and people will be annoyed with you. No problem! git fetch upstream $branch; git checkout $branch; git reset --hard upstream/$branch.

git add -p and friends

Late Entry! The git add -p command, which interactively adds changes to the index (the -p is short for --patch), wasn't much of a surprise to me because I was used to darcs record; however, it seems to be a surprise to many people, so (at Joe Halliwell's suggestion) it deserves a mention here. And only tonight Aaron informed me of the existence of its cousins git reset -p, git commit -p and git checkout -p, which are totally going in my toolkit.

Whew!

I hope that helped someone! Now, over to those more expert than me - what should be my next enlightenment experience?

Tags:
pozorvlak: (Default)
Monday, April 20th, 2015 08:29 pm
Context.

Calculating the dependency graph of your build ahead-of-time can be fiendishly difficult, or even impossible. Redo brings two, I think brilliant, insights to bear on this problem.
  1. If you have to build the target in the course of calculating its dependencies, that's totally OK, because that's what you really wanted to do in the first place.
  2. You don't actually have to know the entire build graph when you start building; you only need to know enough dependencies such that
    • if a given target T needs to be rebuilt, at least one of the dependencies you know about for T will have been affected;
    • in the course of building T, you will discover the remaining dependencies of T and rebuild any stale ones.
Let's try to write do-files for building OCaml modules.

In default.cmi.do:
redo-ifchange $2.mli
ocamlc $2.mli

In default.cmo.do:
redo-ifchange $2.cmi $2.ml
redo-ifchange `ocamldep $2.mli $2.ml`
ocamlc $2.ml
Note: these do-files will not actually work, because redo insists that you write your output to a temporary file called $3 so it can atomically rename the newly-built file into place, and ocamlc is equally insistent that it knows better than you what its output files should be called. However, this annoying interaction of their limitations is irrelevant to the dependency-checking algorithm, so I'll pretend that they do work :-) I'll try to construct a workaround and post it on GitHub. Update: I have now done so!

The redo-ifchange command says "if you know how to build my arguments, then build them; if any of them changes in the future, then the target built by this script will be out-of-date". So to build X.cmi, we observe that it depends on X.mli (which probably won't need building), and then build it. To build X.cmo, we observe that it depends on both X.ml and X.cmi (which will be rebuilt if need be). Then we invoke ocamldep to get a list of other files imported by X.ml, build those if any are out of date, and finally invoke ocamlc on X.ml to produce X.cmo.

Let's see how this plays out in the following scenario:
  1. We build X.cmo.
  2. We try to build it again.
  3. We edit X.ml, adding a new dependency Y.
  4. We rebuild X.cmo.
First, redo runs default.cmo.do, discovers that X.cmo depends on X.ml and X.cmi, recursively builds X.cmi (determining that it depends on X.mli), and finally compiles X.ml, producing X.cmo.

When we run redo-ifchange X.cmo again, redo will check its database of dependencies and observe that X.cmo depends transitively on X.cmi, X.ml and X.mli but that none of them have changed; hence, it will correctly do nothing.

Then we add the dependency, and run redo-ifchange X.cmo. Redo will again check its database of dependencies and note that X.ml has changed, so it must re-run default.cmo.do. First it notes that X.cmo depends on X.cmi and X.ml: it checks its database and sees that X.cmi depends on X.mli, which hasn't changed, so it leaves X.cmi alone. Next it re-runs ocamldep X.mli X.ml and hands the output to redo-ifchange: this tells redo that X.cmo now depends on Y.cmi. Y.cmi doesn't exist yet, so it builds it using the rules in default.cmi.do. Finally it compiles X.ml into X.cmo.

This system should work provided that all your dependencies live within the filesystem, or can be brought within it; however, if this is not the case then you probably have bigger problems :-)
pozorvlak: (Default)
Saturday, April 4th, 2015 03:38 pm
Learning about compilers made me a more confident programmer - a part of my toolchain that had felt like magic was revealed to be merely a complex (and, sometimes, beautiful) collection of mechanisms, and I gained a much greater understanding of how the code I typed was put into practice by the computer. It occurred to me recently that though I've been using relational databases for years (I first learned about normal forms in 2000ish, from a book called MS Access Unlocked or something similarly unimpressive), I don't actually have much idea of how they work. You enter a SQL query and it gets parsed as normal and then, er... something something query analyser... something B+-tree index query plan AND AS IF BY MAGIC it becomes a nice fast bit of looping and pointer arithmetic.

Clearly this was not good enough. So I asked on Twitter for reading recommendations. Here's what I got.
  • David Meier, The Theory of Relational Databases, 1983 (PDFs). As the name suggests, this looks heavy on the relational algebra and light on implementation.
  • C. J. Date, An Introduction to Database Systems (Amazon link to paper book), 2003. Apparently this "contains a lot about internals. The writing style is quite verbose though."
  • Abiteboul, Hull and Vianu, Foundations of Databases (PDFs). This appears to cover SQL and relational algebra in the first half of the book, and Datalog in the second. Which sounds very interesting, but not quite what I was looking for.
  • Stratis Viglas, Advanced Databases (PDF). Slides from a 2015 undergraduate course given at the University of Edinburgh. Covers topics like on-disk layout, external sorting, query optimization, transaction processing, B+-trees and hash joins - the stuff I was after, in other words.
  • Raghu Ramakrishnan and Johannes Gehrke, Database Management Systems, 2002 (Amazon link, though the first Google hit is a presumably-illegal PDF of the full text!) The course textbook for Viglas' course, this appears to cover relational algebra, practical SQL programming, the DB implementation stuff in the course, and quite a lot more.
  • Julia Evans (aka b0rk) wrote a nice sequence of blog posts in which she delves into SQLite internals; I should re-read these.
  • The SQLite documentation looks pretty good, and includes some information about internals.

It'll clearly take me a while to get through all that! My plan is to read through Viglas' notes, have a go at some of the exercises from his course (which also appear to be online), and then take a look at either Meier's book or AHV. Anything else I should be looking at? Does my strategy sound reasonable?
pozorvlak: (Default)
Monday, August 19th, 2013 09:10 pm
If I ever design a first course in compilers, I'll do it backwards: we'll start with code generation, move on to static analysis, then parse a token stream and finally lex source text. As we go along, students will be required to implement a small compiler for a simple language, using the techniques they've just learned. The (excellent) Stanford/Coursera compilers course does something similar, but they proceed in the opposite direction, following the dataflow in the final compiler: first they cover lexing, then parsing, then syntax analysis, then codegen. The first Edinburgh compilers course follows roughly the same plan of lectures, and I expect many other universities' courses do too.

I think a backwards course would work better for two reasons:
  1. Halfway through the Stanford course, you have a program that can convert source text into an intermediate representation with which you can't do very much. Halfway through the backwards course, you'd have a compiler for an unfriendly source language: you could write programs directly in whatever level of IR you'd got to (I'm assuming a sensible implementation language that doesn't make entering data literals too hard), and compile them using code you'd written to native code. I think that would be pretty motivating.
  2. When I did the Stanford course, all the really good learning experiences were in the back end. Writing a Yacc parser was a fiddly but largely straightforward experience; writing the code generator taught me loads about how your HLL code is actually executed and how runtime systems work. I also learned some less obvious things like the value of formal language specifications¹. Most CS students won't grow up to be compiler hackers, but they will find it invaluable to have a good idea of what production compilers do to their HLL code; it'll be much more useful than knowing about all the different types of parsing techniques, anyway². Students will drop out halfway through the course, and even those who make it all the way through will be tired and stressed by the end and will thus take in less than they did at the beginning: this argues for front-loading the important material.
What am I missing?

¹ I learned this the hard way. I largely ignored the formal part of the spec when writing my code generator, relying instead on the informal description; then towards the end of the allocated period I took a closer look at it and realised that it provided simple answers to all the thorny issues I'd been struggling with.
² The answer to "how should I write this parser?" in an industrial context is usually either "with a parser generator" or "recursive descent". LALR parsers such as those produced by Yacc are a pain to debug if you don't understand the underlying theory, true, but that's IMHO an argument for using a parser generator based on some other algorithm, most of which are more forgiving.
pozorvlak: (Default)
Sunday, June 2nd, 2013 09:15 pm
I've been learning about the NoSQL database CouchDB, mainly from the Definitive Guide, but also from the Coursera Introduction to Data Science course and through an informative chat with [personal profile] necaris, who has used it extensively at Esplorio. The current draft of the Definitive Guide is rather out-of-date and has several long-open pull requests on GitHub, which doesn't exactly inspire confidence, but CouchDB itself appears to be actively maintained. I have yet to use CouchDB in anger, but here's what I've learned so far:

  • CouchDB is, at its core, an HTTP server providing append-only access to B-trees of versioned JSON objects via a RESTful interface. Say what now? Well, you store your data as JavaScript-like objects (which allow you to nest arrays and hash tables freely); each object is indexed by a key; you access existing objects and insert new ones using the standard HTTP GET, PUT and DELETE methods, specifying and receiving data in JavaScript Object Notation; you can't update objects, only replace them with new objects with the same key and a higher version number; and it's cheap to request all the objects with keys in a given range.

  • The JSON is not by default required to conform to any particular schema, but you can add validation functions to be called every time data is added to the database. These will reject improperly-formed data.

  • CouchDB is at pains to be RESTful, to emit proper cache-invalidation data, and so on, and this is key to scaling it out: put a contiguous subset of (a consistent hash of) the keyspace on each machine, and build a tree of reverse HTTP proxies (possibly caching ones) in front of your database cluster.

  • CouchDB's killer feature is probably master-to-master replication: if you want to do DB operations on a machine that's sometimes disconnected from the rest of the cluster (a mobile device, say), then you can do so, and sync changes up and down when you reconnect. Conflicts are flagged but not resolved by default; you can resolve them manually or automatically by recording a new version of the conflicted object. Replication is also used for load-balancing, failover and scaling out: you can maintain one or more machines that constantly replicate the master server for a section of keyspace, and you can replicate only a subset of keyspace onto a new database when you need to expand.

  • CouchDB doesn't guarantee to preserve all the history of an object, and in particular replications only seem to send the most recent version; I think this precludes Git-style three-way merge from the conflicting versions' most recent common ancestor (and forget about Darcs-style full-history merging!).

  • The cluster-management story isn't as good as for some other systems, but there are a couple of PaaS offerings.

  • Queries/views and non-primary indexes are both handled using map/reduce. If you want to index on something other than the primary key - posts by date, say - then you write a map query which emits (date, post) pairs. These are put into another B-tree, which is stored on disk; clever things are done to mark subtrees invalid as new data comes in, and changes to the query result or index are calculated lazily. Since indices are stored as B-trees, it's cheap to get all the objects within a given range of secondary keys: all posts in February, for instance.

  • CouchDB's reduce functions are crippled: attempting to calculate anything that isn't a scalar or a fixed-size object is considered Bad Form, and may cause your machine(s) to thrash. AFAICT you can't reduce results from different machines by this mechanism: CouchDB Lounge requires you to write extra merge functions in Twisted Python.

  • Map, reduce and validation functions (and various others, see below) are by default written in JavaScript. But CouchDB invokes an external interpreter for them, so it's easy to extend CouchDB with a new query server. Several such have been written, and it's now possible to write your functions in many different languages.

  • There's a very limited SQL view engine, but AFAICT nothing like Hive or Pig that can take a complex query and compile it down into a number of chained map/reduce jobs. The aforementioned restrictions on reduce functions mean that the strategy I've been taught for expressing joins as map/reduce jobs won't work; I don't know if this limitation is fundamental. But it's IME pretty rare to require general joins in applications: usually you want to do some filtering or summarisation on at least one side.

  • CouchDB can't quite make up its mind whether it wants to be a database or a Web application framework. It comes by default with an administration web app called Futon; you can also use it to store and execute code for rendering objects as HTML, Atom, etc. Such code (along with views, validations etc) is stored in special JSON objects called "design documents": best practice is apparently to have one design document for each application that needs to access the underlying data. Since design documents are ordinary JSON objects, they are propagated between nodes by replications.

  • However, various standard webapp-framework bits are missing, notably URL routing. But hey, you can always use mod_rewrite...

  • There's a tool called Erica (and an older one called CouchApp) which allows you to sync design documents with more conventional source-code directories in your filesystem.

  • CouchDB is written in Erlang, and the functional-programming influence shows up in other places: most types of user-defined function are required to be free of side-effects, for instance. Then there's the aforementioned uses of lazy evaluation and the append-only nature of the system as a whole. You can extend it with your own Erlang code or embed it into an Erlang application, bypassing the need for HTTP requests.

tl;dr if you've ever thought "data modelling and synchronisation are hard, let's just stick a load of JSON files in Git" (as I have, on several occasions), then CouchDB is probably a good fit to your needs. Especially if your analytics needs aren't too complicated.
pozorvlak: (Hal)
Thursday, February 7th, 2013 12:41 pm
Quaffing the last of my quickening cup,
I chuck fair Josie, my predatory protégée, behind her ear.
Into my knapsack I place fell Destruction,
my weapon in a thousand fights against the demon Logic
(not to mention his dread ally the Customer
who never knows exactly what she wants, but always wants it yesterday).
He sleeps lightly, but is ready
to leap into action, confounding the foe
with his strings of enchanted rubies and pearls.
To my thigh I strap Cecilweed, the aetherial horn
spun from rare African minerals in far Taiwan
and imbued with subtle magics by the wizards of Mountain View.
Shrugging on my Cuirass of Visibility,
I mount Wellington, my faithful iron steed
his spine wrought in the mighty forge of Diamondback
his innards cast by the cunning smiths of Shimano
and ride off, dodging monsters the height of a house
towards the place the ancients knew as Sràid na Banrighinn
The Street of the Queen.

Just wanna clarify that in lines 5 and 6 I'm not talking about the Growstuff customers, all of whom have been great.
pozorvlak: (Hal)
Thursday, January 24th, 2013 09:59 pm
[Wherein we review an academic conference in the High/Low/Crush/Goal/Bane format used for reviewing juggling conventions on rec.juggling.]

High: My old Codeplay colleague Ally Donaldson's FAT-GPU workshop. He was talking about his GPUVerify system, which takes CUDA or OpenCL programs and either proves them free of data races and synchronisation-barrier conflicts, or finds a potential bug. It's based on an SMT solver; I think there's a lot of scope to apply constraint solvers to problems in compilation and embedded system design, and I'd like to learn more about them.

Also, getting to see the hotel's giant fishtank being cleaned, by scuba divers.

Low: My personal low point was telling a colleague about some of the problems my depression has been causing me, and having him laugh in my face - he'd been drinking, and thought I was exaggerating for comic effect. He immediately apologised when I told him that this wasn't the case, but still, not fun. The academic low point was the "current challenges in supercomputing" tutorial, which turned out to be a thinly-disguised sales pitch for the sponsor's FPGA cards. That tends not to happen at maths conferences...

Crush: am I allowed to have a crush on software? Because the benchmarking and visualisation infrastructure surrounding the Sniper x86 simulator looks so freaking cool. If I can throw away the mess of Makefiles, autoconf and R that serves the same role in our lab I will be very, very happy.

Goal: Go climbing on the Humboldthain Flakturm (fail - it turns out that Central Europe is quite cold in January, and nobody else fancied climbing on concrete at -7C). Get my various Coursera homeworks and bureaucratic form-filling done (fail - damn you, tasty German beer and hyperbolic discounting!). Meet up with [livejournal.com profile] maradydd, who was also in town (fail - comms and scheduling issues conspired against us. Next time, hopefully). See some interesting talks, and improve my general knowledge of the field (success!).

Bane: I was sharing a room with my Greek colleague Chris, who had a paper deadline on the Wednesday. This meant he was often up all night, and went to bed as I was getting up, so every trip into the room to get something was complicated by the presence of a sleeping person. He also kept turning the heating up until it was too hot for me to sleep. Dually, of course, he had to share his room with a crazy Brit who kept getting up as he was going to bed and opening the window to let freezing air in...
pozorvlak: (Hal)
Sunday, December 9th, 2012 09:17 pm
I've been using Mercurial (also known as hg) as the version-control system for a project at work. I'd heard good things about it - a Git-like system with a cleaner UI and better documentation - and was glad of the excuse to try it out. Unfortunately, I was disappointed by what I found. The docs are good, and the UI's a bit cleaner, but it's still got some odd quirks - the difference between hg resolve and hg resolve -m catches me every bloody time, for instance. Unlike Git, you aren't prompted to set missing configuration options interactively. Some of the defaults are crazy, like not sending long output to a pager. And having got used to easy, safe history-rewriting in Git, I was horrified to learn that Mercurial offered no such guarantees of safety: up until version 2.2, the equivalent of a simple commit --amend could cause you to lose work permanently. Easy history-rewriting is a big deal; it means that you never have to choose between committing frequently and only pushing easily-reviewable history.

But I persevered, and with a bit of configuration I was able to make hg more like Git more comfortable. Here's my current .hgrc:
[ui]
username = Pozorvlak <pozorvlak@example.com>
merge = internal:merge
[pager]
pager = LESS='FSRX' less
[extensions]
rebase =
record =
histedit = ~/usr/etc/hg/hg_histedit.py
fetch =
shelve = ~/usr/etc/hg/hgshelve.py
pager =
mq =
color =

You'll need at least the username line, because of the aforementioned lack of interactive configuration. The pager = LESS='FSRX' less and pager = lines send long output to less instead of letting it all spew out and overflow your console scrollback buffer. merge = internal:merge tells it to use its internal merge algorithm as a merge tool, and put ">>>>" gubbins in files in the event of conflicts. Otherwise it uses meld for merges on my machine; meld is very pretty but not history-aware, and history-aware merges are at least 50% of the point of using a DVCS in the first place. The rebase extension allows you to graft a sequence of changesets onto another part of the history graph, like git rebase; the record extension allows you to select only some of the changes in your working copy for committing, like git add -p or darcs record; the fetch extension lets you do pull-and-merge in one operation - confusingly, git pull and git fetch are the opposite way round from hg fetch and hg pull. The mq extension turns on patch queues, which I needed for some hairy operation or other once. The non-standard histedit extension works like git rebase --interactive but not, I believe, as safely - dropped commits are deleted from the history graph entirely rather than becoming unreachable from an active head. The non-standard shelve extension works like git stash, though less conveniently - once you've shelved one change you need to give a name to all subsequent ones. Perhaps a Mercurial expert reading this can tell me how to delete unwanted shelves? Or about some better extensions or settings I should be using?
pozorvlak: (Hal)
Thursday, December 6th, 2012 11:41 pm

I've been running benchmarks again. The basic workflow is

  1. Create some number of directories containing the benchmark suites I want to run.
  2. Tweak the Makefiles so benchmarks are compiled and run with the compilers, simulators, libraries, flags, etc, that I care about.
  3. Optionally tweak the source code to (for instance) change the number of iterations the benchmarks are run for.
  4. Run the benchmarks!
  5. Check the output; discover that something is broken.
  6. Swear, fix the problem.
  7. Repeat until either you have enough data or the conference submission deadline gets too close and you are forced to reduce the scope of your experiments.
  8. Collate the outputs from the successful runs, and analyse them.
  9. Make encouraging noises as the graduate students do the hard work of actually writing the paper.

Suppose I want to benchmark three different simulators with two different compilers for three iteration counts. That's 18 configurations. Now note that the problem found in stage 5 and fixed in stage 6 will probably not be unique to one configuration - if it affects the invocation of one of the compilers then I'll want to propagate that change to nine configurations, for instance. If it affects the benchmarks themselves or the benchmark-invocation harness, it will need to be propagated to all of them. Sounds like this is a job for version control, right? And, of course, I've been using version control to help me with this; immediately after step 1 I check everything into Git, and then use git fetch and git merge to move changes between repositories. But this is still unpleasantly tedious and manual. For my last paper, I was comparing two different simulators with three iteration counts, and I organised this into three checkouts (x1, x10, x100), each with two branches (simulator1, simulator2). If I discovered a problem affecting simulator1, I'd fix it in, say, x1's simulator1 branch, then git pull the change into x10 and x100. When I discovered a problem affecting every configuration, I checked out the root commit of x1, fixed the bug in a new branch, then git merged that branch with the simulator1 and simulator2 branches, then git pulled those merges into x10 and x100.

Keeping track of what I'd done and what I needed to do was frankly too cognitively demanding, and I was constantly bedevilled by the sense that there had to be a Better Way. I asked about this on Twitter, and Ganesh Sittampalam suggested "use Darcs" - and you know, I think he's right, Darcs' "bag of commuting patches" model is a better fit to what I'm trying to do than Git's "DAG of snapshots" model. The obvious way to handle this in Darcs would be to have six base repositories, called "everything", "x1", "x10", "x100", "simulator1" and "simulator2"; and six working repositories, called "simulator2_x1", "simulator2_x10", "simulator2_x100", "simulator2_x1", "simulator2_x10" and "simulator2_x100". Then set up update scripts in each working repository, containing, for instance

#!/bin/sh
darcs pull ../base/everything
darcs pull ../base/simulator1
darcs pull ../base/x10
and every time you fix a bug, run for i in working/*; do $i/update; done.

But! It is extremely useful to be able to commit the output logs associated with a particular state of the build scripts, so you can say "wait, what went wrong when I used the -static flag? Oh yeah, that". I don't think Darcs handles that very well - or at least, it's not easy to retrieve any particular state of a Darcs repo. Git is great for that, but whenever I think about duplicating the setup described above in Git my mind recoils in horror before I can think through the details. Perhaps it shouldn't - would this work? Is there a Better Way that I'm not seeing?

pozorvlak: (Hal)
Thursday, December 6th, 2012 09:45 pm
Inspired by Falsehoods Programmers Believe About Names, Falsehoods Programmers Believe About Time, and far, far too much time spent fighting autotools. Thanks to Aaron Crane, [livejournal.com profile] totherme and [livejournal.com profile] zeecat for their comments on earlier versions.

It is accepted by all decent people that Make sucks and needs to die, and that autotools needs to be shot, decapitated, staked through the heart and finally buried at a crossroads at midnight in a coffin full of millet. Hence, there are approximately a million and seven tools that aim to replace Make and/or autotools. Unfortunately, all of the Make-replacements I am aware of copy one or more of Make's mistakes, and many of them make new and exciting mistakes of their own.

I want to see an end to Make in my lifetime. As a service to the Make-replacement community, therefore, I present the following list of tempting but incorrect assumptions various build tools make about building software.

All of the following are wrong:
  • Build graphs are trees.
  • Build graphs are acyclic.
  • Every build step updates at most one file.
  • Every build step updates at least one file.
  • Compilers will always modify the timestamps on every file they are expected to output.
  • It's possible to tell the compiler which file to write its output to.
  • It's possible to tell the compiler which directory to write its output to.
  • It's possible to predict in advance which files the compiler will update.
  • It's possible to narrow down the set of possibly-updated files to a small hand-enumerated set.
  • It's possible to determine the dependencies of a target without building it.
  • Targets do not depend on the rules used to build them.
  • Targets depend on every rule in the whole build system.
  • Detecting changes via file hashes is always the right thing.
  • Detecting changes via file hashes is never the right thing.
  • Nobody will ever want to rebuild a subset of the available dirty targets.
  • People will only want to build software on Linux.
  • People will only want to build software on a Unix derivative.
  • Nobody will want to build software on Windows.
  • People will only want to build software on Windows.
    (Thanks to David MacIver for spotting this omission.)
  • Nobody will want to build on a system without strace or some equivalent.
  • stat is slow on modern filesystems.
  • Non-experts can reliably write portable shell script.
  • Your build tool is a great opportunity to invent a whole new language.
  • Said language does not need to be a full-featured programming language.
  • In particular, said language does not need a module system more sophisticated than #include.
  • Said language should be based on textual expansion.
  • Adding an Nth layer of textual expansion will fix the problems of the preceding N-1 layers.
  • Single-character magic variables are a good idea in a language that most programmers will rarely use.
  • System libraries and globally-installed tools never change.
  • Version numbers of system libraries and globally-installed tools only ever increase.
  • It's totally OK to spend over four hours calculating how much of a 25-minute build you should do.
  • All the code you will ever need to compile is written in precisely one language.
  • Everything lives in a single repository.
  • Files only ever get updated with timestamps by a single machine.
  • Version control systems will always update the timestamp on a file.
  • Version control systems will never update the timestamp on a file.
  • Version control systems will never change the time to one earlier than the previous timestamp.
  • Programmers don't want a system for writing build scripts; they want a system for writing systems that write build scripts.

[Exercise for the reader: which build tools make which assumptions, and which compilers violate them?]

pozorvlak: (Default)
Thursday, September 13th, 2012 04:01 pm
I've recently submitted a couple of talk proposals to upcoming conferences. Here are the abstracts.

Machine learning in (without loss of generality) Perl

London Perl Workshop, Saturday 24th November 2012. 25 minutes.

If you read a book or take a course on machine learning, you'll probably spend a lot of time learning about how to implement standard algorithms like k-nearest neighbours or Naive Bayes. That's all very interesting, but we're Perl programmers - all that stuff's on CPAN already. This talk will focus on how to use those algorithms to attack problems, how to select the best ML algorithm for your task, and how to measure and improve the performance of your machine learning system. Code samples will be in Perl, but most of what I'll say will be applicable to machine learning in any language.

Classifying Surfaces

MathsJam: The Annual Conference, 17th-18th November 2012. 5 minutes.

You may already know Euler's remarkable result that if a polyhedron has V vertices, E edges and F faces, then V - E + F = 2. This is a special case of the beautiful classification theorem for closed surfaces. I will state this classification theorem, and give a quick sketch of a proof.
pozorvlak: (Default)
Sunday, September 9th, 2012 01:12 pm
Remember how a few years ago PCs were advertised with the number of MHz or GHz their processors ran at prominently featured? And how the numbers were constantly going up? You may have noticed that the numbers don't go up much any more, but now computers are advertised as "dual-core" or "quad-core". The reason that changed is power consumption. Double the clock speed of a chip, and you more than double its power consumption: with the Pentium 4 chip, Intel hit a clock speed ceiling as their processors started to generate more heat than could be removed.

But Moore's Law continues in operation: the number of transistors that can be placed on a given area of silicon has continued to double every eighteen months, as it has done for decades now. So how can chip makers make use of the extra capacity? The answer is multicore: placing several "cores" (whole, independent processing units) onto the same piece of silicon. Your chip can still do twice as much work as the one from eighteen months ago, but only if you split that work up into independent tasks.

This presents the software industry with a problem. We've been conditioned over the last fifty years to think that the same program will run faster if you put it on newer hardware. That's not true any more. Computer programs are basically recipes for use by particularly literal-minded and stupid cooks; imagine explaining how to cook a complex meal over the phone to someone who has to be told everything. If you're lucky, they'll have the wit to say "Er, the pan's on fire: that's bad, right?". Now let's make the task harder: you're on the phone to a room full of such clueless cooks, and your job is to get them to cooperate in the production of a complex dinner due to start in under an hour, without getting in each other's way. Sounds like a farce in the making? That's basically why multicore programming is hard.

But wait, it gets worse! The most interesting settings for computation these days are mobile devices and data centres, and these are both power-sensitive environments; mobile devices because of limited battery capacity, and data centres because more power consumption costs serious money on its own and increases your need for cooling systems which also cost serious money. If you think your electricity bill's bad, you should see Google's. Hence, one of the major themes in computer science research these days is "you know all that stuff you spent forty years speeding up? Could you please do that again, only now optimise for energy usage instead?". On the hardware side, one of the prominent ideas is heterogeneous multicore: make lots of different cores, each specialised for certain tasks (a common example is the Graphics Processing Units optimised for the highly-parallel calculations involved in 3D rendering), stick them all on the same die, farm the work out to whichever core is best suited to it, and power down the ones you're not using. To a hardware person, this sounds like a brilliant idea. To a software person, this sounds like a nightmare: now imagine that our Hell's Kitchen is full of different people with different skills, possibly speaking different languages, and you have to assign each task to the person best suited to carrying it out.

The upshot is that heterogeneous multicore programming, while currently a niche field occupied mainly by games programmers and scientists running large-scale simulations, is likely to get a lot more prominent over the coming decades. And hence another of the big themes in computer science research is "how can we make multicore programming, and particularly heterogeneous multicore programming, easier?" There are two aspects to this problem: what's the best way of writing new code, and what's the best way of porting old code (which may embody complex and poorly-documented requirements) to take advantage of multicore systems? Some of the approaches being considered are pretty Year Zero - the functional programming movement, for instance, wants us to write new code in a tightly-constrained way that is more amenable to automated mathematical analysis. Others are more conservative: for instance, my colleague Dan Powell is working on a system that observes how existing programs execute at runtime, identifies sections of code that don't interfere with each other, and speculatively executes them in parallel, rolling back to a known-good point if it turns out that they do interfere.

This brings us to the forthcoming Coursera online course in Heterogeneous Parallel Programming, which teaches you how to use the existing industry-standard tools for programming heterogeneous multicore systems. As I mentioned earlier, these are currently niche tools, requiring a lot of low-level knowledge about how the system works. But if I want to contribute to projects relating to this problem (and my research group has a lot of such projects) it's knowledge that I'll need. Plus, it sounds kinda fun.

Anyone else interested?
pozorvlak: (polar bear)
Wednesday, June 13th, 2012 06:01 pm
1. Start tracking my weight and calorie intake again, and get my weight back down to a level where I'm comfortable. I've been very slack on the actual calorie-tracking, but I have lost nearly a stone, and at the moment I'm bobbing along between 11st and about 11st 4lb. It would be nice to be below 11st, but I find I'm actually pretty comfortable at this weight as long as I'm doing enough exercise. So, I count that as a success.

2. Start making (and testing!) regular backups of my data. I'm now backing up my tweets with TweetBackup.com, but other than that I've made no progress on this front. Possibly my real failure was in not making all my NYRs SMART, so they'd all be pass/fail; as it is, I'm going to declare this one not yet successful.

3. Get my Gmail account down to Inbox Zero and keep it there. This one's a resounding success. Took me about a month and a half, IIRC. Next up: Browser Tab Zero.

4. Do some more Stanford online courses. There was a long period at the beginning of the year where they weren't running and we wondered if the Stanford administrators had stepped in and quietly deep-sixed the project, but then they suddenly started up again in March or so. Since then I've done Design and Analysis of Algorithms, which was brilliant; Software Engineering for Software as a Service, which I dropped out of 2/3 of the way through but somehow had amassed enough points to pass anyway; and I'm currently doing Compilers (hard but brilliant) and Human-Computer Interaction, which is way outside my comfort zone and on which I'm struggling. Fundamentals of Pharmacology starts up in a couple of weeks, and Cryptography starts sooner than that, but I don't think I'll be able to do Cryptography before Compilers finishes. Maybe next time they offer it. Anyway, I think this counts as a success.

5. Enter and complete the Meadows Half-Marathon. This was a definite success: I completed the 19.7km course in 1 hour and 37 minutes, and raised over £500 for the Against Malaria Foundation.

6. Enter (and, ideally, complete...) the Lowe Alpine Mountain Marathon. This was last weekend; my partner and I entered the C category. Our course covered 41km, gained 2650m of height, and mostly consisted of bog, large tufts of grass, steep traverses, or all three at once; we completed it in 12 hours and 33 minutes over two days and came 34th out of a hundred or so competitors. I was hoping for a faster time, but I think that's not too bad for a first attempt. Being rained on for the last two hours was no fun at all, but the worst bit was definitely the goddamn midges, which were worse than either of us had ever seen before. The itching's now just about subsided, and we're thinking of entering another one at a less midgey time of year: possibly the Original Mountain Marathon in October or the Highlander Mountain Marathon next April. Apparently the latter has a ceilidh at the mid-camp, presumably in case anyone's feeling too energetic. Anyway, this one's a success.

5/6 - I'm quite pleased with that. And I'm going to add another one (a mid-year resolution, if you will): I notice that my Munro-count currently stands at 136/284 (thanks to an excellent training weekend hiking and rock climbing on Beinn a' Bhuird); I hereby vow to have climbed half the Munros in Scotland by the end of the year. Six more to go; should be doable.
pozorvlak: (kittin)
Saturday, April 7th, 2012 12:20 am
I've been doing some work with Wordpress off and on for the last couple of weeks - migrating a site that uses a custom CMS onto a Wordpress installation - and a couple of times I've run into the following vexing problem when setting up a local Wordpress installation for testing. I couldn't find anything about it on the web, and it took me several hours to debug, so here's a writeup in case someone else has the same problem.

Steps to reproduce: install Wordpress 3.0.5 (as provided by Ubuntu). Using the command-line mysql client, load in a database dump from a Wordpress 3.3.1 site. Visit http://localhost/wordpress (or wherever you've got it installed).

Symptoms: instead of your deathless prose, you see an entirely blank browser window. HTTP headers are sent correctly, but no page content is produced. However, http://localhost/wordpress/wp-admin is displayed correctly, and all your content is in the database.

What's actually going on: Wordpress has decided that the TwentyTen theme is broken, so it's reverting to the default theme. It is hence looking for a theme called "Wordpress Default". But the default theme is actually just called "Default". So it doesn't find a theme, and, since display is handled by the theme files, nothing gets displayed.

How to fix it: go into the admin interface, and select Appearance->Themes. Change the theme to "Default". Your blog is now visible again!

If you wish, you can now change the theme back to TwentyTen: it turns out that it's not actually broken at all.

Thanks to Konstantin Kovshenin for suggesting I turn WP_DEBUG to true in wp-config.php. This allowed me to eventually track down the problem (though, annoyingly, the "theme not found" error was only displayed on the admin page, so I didn't see it for a while).

Next question: this is clearly a bug, but it's a bug in a superseded version. Where should I report it?

Edit: on further thought, I think this may be more to do with the site whose dump I was loading in using a theme that I don't have installed. In which case, the bug may well affect the latest version of Wordpress. But I haven't yet proved this to my satisfaction.
pozorvlak: (pozorvlak)
Tuesday, January 3rd, 2012 12:58 am
I don't normally make New Year's resolutions, but what the hell.

1. Start tracking my weight and calorie intake again, and get my weight back down to a level where I'm comfortable. This morning it was 12st 1.9 - not terribly high in the scheme of things, but it's almost as high as it was when I first started dieting (though I think a bit more of it may be muscle now) and it's definitely high enough to negatively impact my sense of well-being.

What went wrong? Well, I'm gonna quote from Hyperbole and a Half: "trying to use willpower to overcome the apathetic sort of sadness that accompanies depression is like a person with no arms trying to punch themselves until their hands grow back. A fundamental component of the plan is missing and it isn't going to work." A scheme for weight loss that depends on willpower is similarly doomed if you're too depressed to stick to it. So this time I'm going to try to make changes to my eating habits that require less willpower. Any suggestions would be most welcome.

2. Start making (and testing!) regular backups of my data. I lost several years of mountain photographs last year when the external hard drive I was keeping them on died: I don't want that to happen again.

3. Get my Gmail account down to Inbox Zero and keep it there. It's currently at Inbox 1713, most of which is junk, but it's just *easier* to deal with an empty inbox, and not have to re-scan the same old things to look for the interesting new stuff.

I have a few more Ambitious Plans, but they don't really count as resolutions:

1. Do some more Stanford online courses. I'm currently signed up to Human-Computer Interaction, Design and Analysis of Algorithms, Software Engineering for Software as a Service, and Information Theory. Fortunately they don't all run concurrently!

[BTW, they're not all computing courses: [livejournal.com profile] wormwood_pearl is signed up to Designing Green Buildings, for instance.]

2. Enter (and complete!) the Meadows Half-Marathon in March. I started training for this back in December, but then I got ill and Christmas happened, so today was my first run for a while and it wasn't much fun. Never mind; I've got time to get back on course.

3. If that goes well, enter (and, ideally, complete...) the Lowe Alpine Mountain Marathon. As I understand things, it's basically two 20km-ish fell runs back-to-back, with a night camping in between. Oh, and you have to carry all your camping kit with you. In the high classes people do the whole thing at a run, but in the lower classes (which I'd be entering) there's apparently a bit more run/walk/run going on. Philipp and I did nearly 40km in one day on the South Glen Shiel ridge in November, and then went for another hike the next day, so I should be able to at least cover the distance. Providing I don't get too badly lost, of course :-)



The only way to progress in anything. The trick, of course, is not biting off enough to cause you damage.
pozorvlak: (Default)
Thursday, December 29th, 2011 12:08 pm
I've been writing some Perl code recently, and a couple of ideas have occurred to me.
  1. Closures are great when you have several behaviours which can all vary independently; objects are great when you have several behaviours that must all vary in sync with one another. Put another way: if your problem is combinatorial explosion of objects, try using closures instead; if your problem is ensuring consistency among different behaviours, try grouping them into objects. Languages like Perl make it relatively easy to do either, so use the best tool for the job. The trick is to spot that you're using the wrong representation before you either create a kajillion Strategy classes or knock together a half-arsed object system out of hashes of closures.
  2. In the correct quantity, housekeeping tasks can be your friends. By "housekeeping tasks" I mean tasks that don't directly contribute to solving the problem at hand, but which still add some value. An example from yesterday was converting a small class hierarchy to use the Moose object system. Other examples might be writing per-method documentation, cleaning up your version control history, and minor refactorings. If there's too much of this stuff to do, it can get dispiriting - you want to be solving the problem, not mucking about with trivia! But if there's not too much, it can be helpful: you have something to do while you're stuck on the main problem, and the housekeeping work is close enough to the main problem that it stays in your brain's L2 cache, where a background process can work away at it. If you're so stuck that you literally can't make any progress, your choices are (a) think very very hard and get depressed, (b) go and do something totally different, in the process forgetting lots of important details. I find both of these to be less productive.

[If you're interested, my code is of course on GitHub: some fixes to the CPAN module List::Priority, and some code for benchmarking all the priority queues on CPAN. Any suggestions or patches would be very welcome!]
pozorvlak: (Default)
Wednesday, October 26th, 2011 06:55 pm
Man, maintaining our Makefiles by hand really sucks. I think I'm going to write a Makefile generator. Do you think that's a good idea?

Certainly not! Go and wash your computer out with soap.

But then my computer will be ruined, and I won't be able to code!

Yes. That's the idea.
pozorvlak: (kittin)
Wednesday, October 26th, 2011 11:28 am
I've been doing Stanford's online Introduction to Artificial Intelligence and Machine Learning courses. Both have their good points and bad points: here are my comments so far.

[The Introduction to Databases course looks interesting too, but I only have a finite number of hours in the day!]

Calibration: I have a first-class undergraduate degree from Oxford and a PhD from Glasgow, both in pure mathematics (mostly algebra/category theory). I've taught first- and second-year undergraduate mathematics. I've spent about four years working as a professional programmer, but I've been programming for long before that as an amateur. I have some limited dabblings in ML, and none in AI.

Logo: Friendly robot in a mortar-board versus scary cyborg with glowing red eyes. MLClass wins this one hands-down. This may sound like a trivial point, but I find it makes a surprising difference to how keen I am to visit each site.

Expected background: I've been really surprised by how unwilling both courses (but especially MLClass) are to assume understanding of what I consider to be high-school mathematics - matrix multiplication, log functions, conditional probability, simple calculus. Aren't Stanford undergrads expected to know this stuff? On the other hand, not assuming much mathematical background makes the material accessible to more people, furthering their (very noble!) goal of opening up first-class education, so I'm willing to give them a pass on this one.

Pace: Both courses are a lot slower than I'm used to from Oxford and Glasgow. AIClass is going faster, though, so this round goes to them.

Technology: MLClass easily wins this one. Their videos are available for download and off-line viewing, can be viewed at different speeds, and don't start until you're ready for them. They have a nifty system for submitting programming assignments. AIClass' reliance on YouTube means that they're inaccessible in countries which block that site. The AIClass servers have gone down every time a homework assignment has been due (though to be fair, they have over twice as many students as MLClass).

AIClass: please adopt the MLClass technology if you run the course again next year.

Quizzes: I really like the way both courses use quizzes embedded in the videos to test your understanding of the material. The AIClass quizzes are more frequent and more testing, but I'm going to give this round to MLClass because they make sure each quiz shows enough information on-screen to answer the question. With AIClass I usually have to re-wind the video several times in order to make sure I fully understand the problem.

Programming exercises: MLClass has programming exercises, AIClass doesn't (though you're encouraged to write your own programs to help you answer the homework assignments and quizzes). Point to MLClass.

Lecturers: I really like Andrew Ng's lecturing style in MLClass. He's bright and engaging. For AIClass, Sebastian Thrun is pretty good, but Peter Norvig (while obviously a giant in his field) reminds me inescapably of Ferris Bueller's economics teacher. Point to MLClass.

I make that 5-2 to MLClass. But both courses are fascinating, and it's exciting to be involved in such a bold experiment in distributed learning.