pozorvlak: (Default)
Wednesday, May 13th, 2015 06:24 pm

Over the seven years or so I've been using Git, there have been a number of moments where I've felt like I've really levelled up in my understanding or usage of Git. Here, in the hope of accelerating other people's learning curves, are as many as I can remember. I'd like to thank Joe Halliwell for introducing me to Git and helping me over the initial hurdles, and Aaron Crane for many helpful and enlightening discussions over the years - especially the ones in which he cleared up some of my many misunderstandings.

Learning to use a history viewer

Even if all you're doing is commit/push/pull, you're manipulating the history graph. Don't try to imagine it in your mind - get the computer to show it to you! That way you can see the effect your actions have, tightening your feedback loop and improving your learning rate. It's also a lot easier to debug problems when you can see what's going on. I started out using gitk --all to view history; now I mostly use git lg, which is an alias for log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%Creset' --abbrev-commit --date=relative, but the effect is much the same.

The really important lessons to internalise at this stage are

  • Git maintains your history as a graph of snapshots (you'll hear people using the term DAG, for "directed acyclic graph").
  • Branches and tags are just pointers into this graph.

If you understand that, then congratulations! You now understand 90% of what's important about Git. Just keep that core model in mind, and it should be possible to reason about the rest.

Relatedly: you'll probably want to graduate to the command-line eventually, but don't be ashamed to use a graphical client for any remotely unfamiliar task.

Understanding hash-based identifiers

You know those weird c56ab7f identifiers Git uses for commits? Those are actually serving a very important role. Mark Jason Dominus has written a great explanation, and I also suggest having a poke around in the .git/objects directory of one of your repos. MJD talks about "blobs" and "trees" which are also identified with hashes, but you don't actually have to think about them much in day-to-day Git usage: commits are the most important objects.

Learning to use git rebase

This shouldn't really qualify as a GLUE - my very first experience with Git was using the git-svn plugin, which forces you to rebase at all times. However, even if you're not interoperating with SVN, you should give rebase a go. What does it do, you ask? Well, the clue's in the name: it takes the sequence of commits you specify, and moves them onto a new base: hence "re-base". See also the MJD talk above, which explains rebasing in more detail.

Note to GUI authors: the obvious UI for this is selecting a load of commits and dragging-and-dropping them to somewhere else in the history graph. Sadly, I don't know of any GUIs that actually do this. Edit: MJD agrees, and is seeking collaborators for work on such a GUI.

Rebasing occasionally gets stuck when it encounters a conflict. The thing to realise here is that the error messages are really good. Keep calm, read the instructions, then follow them, and everything should turn out OK (but if it doesn't, feel free to skip to the next-but-two GLUE).

Giving up on git mergetool

git mergetool allows you to open a graphical diff/merge tool of your choice for handling conflicts. Don't use it. Git's internal merge has complete knowledge of your code's history, which enables it to do a better job of resolving conflicts than any external merge tool I'm aware of. If you use a third-party tool, expect to waste a lot of time manually resolving "conflicts" that the computer could have handled itself; the internal merge algorithm will handle these just fine, and present you with only the tricky cases.

Learning to use git rebase --interactive

I thought this was going to be hard and scary, but it's actually really easy and pleasant to use, I promise! The name is terrible - while git rebase --interactive can do a normal rebase as part of its intended work, this is usually a bad idea. What it's actually for is rewriting history. Have you made two commits that should really be squashed into one? Git rebase --interactive! Want to re-order some commits? Git rebase --interactive! Delete some entirely? Git rebase --interactive! Split them apart into smaller commits? Git rebase --interactive! The interface to this is extremely friendly by Git CLI standards: specify the commit before the first one you want to rewrite, and Git will open an editor window with a list of the commits-to-rewrite. Follow the instructions. If that's not clear enough, read this post.

This should also be represented in a GUI by dragging-and-dropping, but I don't know of any clients that do this.

Learning about git reflog

Git stores a list of every commit you've had checked out. Look at the output of git reflog after a few actions: this allows you to see what's happened, and is thus valuable for learning. More usefully, it also allows you to undo any history rewriting. Here's the thing: Git doesn't actually rewrite history. It writes new history, and hides the old version (which is eventually garbage-collected after a month). But you can still get the old version back if you have a way to refer to it, and that's what git reflog gives you. Knowing that you can easily undo any mistakes allows you to be a lot bolder in your experiments.

Here's a thing I only recently learned: Git also keeps per-branch reflogs, which can be accessed using git reflog $branchname. This saves you some log-grovelling. Sadly, these only contain local changes: history-rewritings done by other people won't show up.

Reading the code of git-rebase--interactive

Interactive rebase is implemented as a shell script, which on my system lives at /usr/lib/git-core/git-rebase--interactive. It's pretty horrible code, but it's eye-opening to see how the various transformations are implemented at the low-level "plumbing" layer.

I actually did this as part of a (failed) project to implement the Darcs merge algorithm on top of Git. I still think this would be a good idea, if anyone wants to have a go. See also this related project, which AFAICT has some of the same advantages and better asymptotic complexity than the Darcs merge algorithm.

Learning how the recursive merge algorithm works

You don't actually need to know this, but it's IMHO pretty elegant. Here's a good description.

Learning what HEAD actually means

I learned this from reading two great blog posts by Federico Mena Quintero: part 1, part 2. Those posts also cleared up a lot of my confusion about how remotes and remote-tracking branches work.

Learning to read refspec notation

The manual is pretty good here. Once you know how refspec notation works, you'll notice it's used all over and a lot of things will click into place.

Learning what git reset actually does

I'd been using git reset in stereotyped ways since the beginning: for instance, I knew that git reset --hard HEAD meant "throw away all uncommitted changes in my working directory" and git reset --hard [commit hash obtained from reflog] meant "throw away a broken attempt at history-rewriting". But it turns out that this is not the core of reset. Again, MJD has written a great explanation, but here's the tl;dr:

  1. It points the HEAD ref (aka the current branch - remember, branches are just pointers) at a new 'target' commit, if you specified one.
  2. Then it copies the tree of the HEAD commit to the index, unless you said --soft.
  3. Finally, it copies the contents of the index to the working tree, if you said --hard.

Once you understand this, git reset becomes part of your toolkit, something you can apply to new problems. Is a branch pointing to the wrong commit? There's a command that does exactly what you need, and it's git reset. Here's yet another MJD post, in which he explains a nonobvious usage for git reset which makes perfect sense in light of the above. Here's another one: suppose someone has rewritten history in a remote branch. If you do git pull you'll create a merge commit between your idea of the current commit and upstream's idea; if you later git push it you'll have created a messy history and people will be annoyed with you. No problem! git fetch upstream $branch; git checkout $branch; git reset --hard upstream/$branch.

git add -p and friends

Late Entry! The git add -p command, which interactively adds changes to the index (the -p is short for --patch), wasn't much of a surprise to me because I was used to darcs record; however, it seems to be a surprise to many people, so (at Joe Halliwell's suggestion) it deserves a mention here. And only tonight Aaron informed me of the existence of its cousins git reset -p, git commit -p and git checkout -p, which are totally going in my toolkit.


I hope that helped someone! Now, over to those more expert than me - what should be my next enlightenment experience?

pozorvlak: (Hal)
Sunday, December 9th, 2012 09:17 pm
I've been using Mercurial (also known as hg) as the version-control system for a project at work. I'd heard good things about it - a Git-like system with a cleaner UI and better documentation - and was glad of the excuse to try it out. Unfortunately, I was disappointed by what I found. The docs are good, and the UI's a bit cleaner, but it's still got some odd quirks - the difference between hg resolve and hg resolve -m catches me every bloody time, for instance. Unlike Git, you aren't prompted to set missing configuration options interactively. Some of the defaults are crazy, like not sending long output to a pager. And having got used to easy, safe history-rewriting in Git, I was horrified to learn that Mercurial offered no such guarantees of safety: up until version 2.2, the equivalent of a simple commit --amend could cause you to lose work permanently. Easy history-rewriting is a big deal; it means that you never have to choose between committing frequently and only pushing easily-reviewable history.

But I persevered, and with a bit of configuration I was able to make hg more like Git more comfortable. Here's my current .hgrc:
username = Pozorvlak <pozorvlak@example.com>
merge = internal:merge
pager = LESS='FSRX' less
rebase =
record =
histedit = ~/usr/etc/hg/hg_histedit.py
fetch =
shelve = ~/usr/etc/hg/hgshelve.py
pager =
mq =
color =

You'll need at least the username line, because of the aforementioned lack of interactive configuration. The pager = LESS='FSRX' less and pager = lines send long output to less instead of letting it all spew out and overflow your console scrollback buffer. merge = internal:merge tells it to use its internal merge algorithm as a merge tool, and put ">>>>" gubbins in files in the event of conflicts. Otherwise it uses meld for merges on my machine; meld is very pretty but not history-aware, and history-aware merges are at least 50% of the point of using a DVCS in the first place. The rebase extension allows you to graft a sequence of changesets onto another part of the history graph, like git rebase; the record extension allows you to select only some of the changes in your working copy for committing, like git add -p or darcs record; the fetch extension lets you do pull-and-merge in one operation - confusingly, git pull and git fetch are the opposite way round from hg fetch and hg pull. The mq extension turns on patch queues, which I needed for some hairy operation or other once. The non-standard histedit extension works like git rebase --interactive but not, I believe, as safely - dropped commits are deleted from the history graph entirely rather than becoming unreachable from an active head. The non-standard shelve extension works like git stash, though less conveniently - once you've shelved one change you need to give a name to all subsequent ones. Perhaps a Mercurial expert reading this can tell me how to delete unwanted shelves? Or about some better extensions or settings I should be using?
pozorvlak: (Hal)
Thursday, December 6th, 2012 11:41 pm

I've been running benchmarks again. The basic workflow is

  1. Create some number of directories containing the benchmark suites I want to run.
  2. Tweak the Makefiles so benchmarks are compiled and run with the compilers, simulators, libraries, flags, etc, that I care about.
  3. Optionally tweak the source code to (for instance) change the number of iterations the benchmarks are run for.
  4. Run the benchmarks!
  5. Check the output; discover that something is broken.
  6. Swear, fix the problem.
  7. Repeat until either you have enough data or the conference submission deadline gets too close and you are forced to reduce the scope of your experiments.
  8. Collate the outputs from the successful runs, and analyse them.
  9. Make encouraging noises as the graduate students do the hard work of actually writing the paper.

Suppose I want to benchmark three different simulators with two different compilers for three iteration counts. That's 18 configurations. Now note that the problem found in stage 5 and fixed in stage 6 will probably not be unique to one configuration - if it affects the invocation of one of the compilers then I'll want to propagate that change to nine configurations, for instance. If it affects the benchmarks themselves or the benchmark-invocation harness, it will need to be propagated to all of them. Sounds like this is a job for version control, right? And, of course, I've been using version control to help me with this; immediately after step 1 I check everything into Git, and then use git fetch and git merge to move changes between repositories. But this is still unpleasantly tedious and manual. For my last paper, I was comparing two different simulators with three iteration counts, and I organised this into three checkouts (x1, x10, x100), each with two branches (simulator1, simulator2). If I discovered a problem affecting simulator1, I'd fix it in, say, x1's simulator1 branch, then git pull the change into x10 and x100. When I discovered a problem affecting every configuration, I checked out the root commit of x1, fixed the bug in a new branch, then git merged that branch with the simulator1 and simulator2 branches, then git pulled those merges into x10 and x100.

Keeping track of what I'd done and what I needed to do was frankly too cognitively demanding, and I was constantly bedevilled by the sense that there had to be a Better Way. I asked about this on Twitter, and Ganesh Sittampalam suggested "use Darcs" - and you know, I think he's right, Darcs' "bag of commuting patches" model is a better fit to what I'm trying to do than Git's "DAG of snapshots" model. The obvious way to handle this in Darcs would be to have six base repositories, called "everything", "x1", "x10", "x100", "simulator1" and "simulator2"; and six working repositories, called "simulator2_x1", "simulator2_x10", "simulator2_x100", "simulator2_x1", "simulator2_x10" and "simulator2_x100". Then set up update scripts in each working repository, containing, for instance

darcs pull ../base/everything
darcs pull ../base/simulator1
darcs pull ../base/x10
and every time you fix a bug, run for i in working/*; do $i/update; done.

But! It is extremely useful to be able to commit the output logs associated with a particular state of the build scripts, so you can say "wait, what went wrong when I used the -static flag? Oh yeah, that". I don't think Darcs handles that very well - or at least, it's not easy to retrieve any particular state of a Darcs repo. Git is great for that, but whenever I think about duplicating the setup described above in Git my mind recoils in horror before I can think through the details. Perhaps it shouldn't - would this work? Is there a Better Way that I'm not seeing?

pozorvlak: (babylon)
Monday, October 17th, 2011 12:38 pm
I was recently delighted to receive an email from someone saying that he'd just started a PhD with my old supervisor, and did I have any advice for him? You'll be unsurprised to learn that I did; I thought I'd post it here in the hope that someone else might find it useful. Some of what follows is specific to my supervisor, my field, or my discipline; most is more general. Your mileage may vary.
  • Your main problem for the next 3/4 years will be maintaining morale. Don't beat yourself up for slow/no progress. Do make sure you're eating, sleeping and exercising properly. Consider doing some reading about cognitive behavioural therapy so you can spot negative thought-patterns before they start to paralyse you.
  • Try to get some structure in your life. Weekly meetings are a minimum. Set yourself small deadlines. Don't worry overly if you miss them: if this stuff were easy to schedule, they wouldn't call it "research".
  • Sooner or later you'll discover that something you're working on has already been done, probably by Kelly. Do not panic. Chances are that one of the following is true:
    • his technique applies in some different domain (actually check this, because folklore often assigns greater utility to theorems than they actually possess)
    • your technique is obviously different (so there's an equivalence theorem to prove - or maybe not...)
    • your technique can be generalised or specialised or reapplied in some way that his can't.
  • Start writing now. I know everyone says this, but it's still good advice. It doesn't matter if you don't think you've got anything worth writing up yet. Write up background material. Write up rough notes. The very act of writing things up will suggest new ideas. And it will get you familiar with TeX, which is never a bad thing. As a category theorist, you will probably need to become more familiar with TeX than the average mathematician. And writing is mostly easier than doing mathematics - important, since you'll need something to do on those days when you just don't have enough energy for actual research.
  • Even if you don't start writing, you should certainly start maintaining a bibliography file, with your own notes in comments.
  • Speaking of fluctuating energy, you should read Terry Tao's advice on time management for mathematicians.
  • Keep your TeX source in version control. It's occasionally very helpful to be able to refer back and find out what changed when and why, and using a properly-designed system avoids the usual mess of thesis.old.tex.bak files lying around in your filesystem. I like Git, but other systems exist. Mercurial is meant to be especially nice if you haven't used version control before.
  • Make sure you have up-to-date backups (perhaps via a source-code hosting site like GitHub or BitBucket). And try to ensure you have access to a spare machine. You don't want to be futzing around with screwdrivers and hard drive enclosures when you've got a deadline.
  • Tom's a big fan of using rough sheets of paper to write on in supervision meetings [and perhaps your supervisor will be too, O reader]. You'll need to find a way of filing these or otherwise distilling them so that they can be referred to later. I never managed this.
  • For my own rough working, I like paper notebooks, which I try to carry around with me at all times. Your mileage may vary. Some people swear by a personal wiki, and in particular the TiddlyWiki/Dropbox combo.
  • Speaking of filing: the book Getting Things Done (which I recommend, even if I don't manage to follow most of its advice myself) recommends a simple alphabetical filing system for paper documents, with those fold-over cardboard folders (so you can pick up your whole file for a given topic and cart it around with you). I find this works pretty well. Make sure you have some spare folders around so you can easily spin up new files as needed.
  • Don't be afraid to read around your field, even if your supervisor advises you not to. I really wish I'd ignored mine and read more about rewriting systems, for instance.
  • Try to seize that surge of post-conference inspiration. My major theorem was proved in the airport on the way back from a conference. Also, airports make great working environments at 2am when hardly anybody's around :-)
  • Don't forget that if things get too bad, you can quit. Sometimes that's the best choice. I know several people who've dropped out of PhD programmes and gone on to happy lives.
  • The supply of newly-minted PhDs now outstrips the number of academic jobs available to them, and category theory's a niche and somewhat unfashionable field (in maths, at least - you may well have more luck applying to computer science departments. Bone up on some type theory). When you get to the end of your studies, expect finding an academic job to take a long time and many iterations. Try to have a backup plan in case nothing comes up. Let's hope the economy's picked up by then :-)
pozorvlak: (Default)
Monday, May 30th, 2011 06:44 pm
I have just written the following commit message:
"Branch over unconditional jump" hack for arbitrary-length brcc.
 - brcc (branch and compare) instructions can have static branch-prediction
   hints, but can only jump a limited distance.
 - Calculating the distance of a jump at expand-time is hard.
 - So instead of emitting just a brcc instruction, we emit an unconditional
   jump to the same place and a branch over it.
 - I added a simple counter to emit/state to number the labels thus introduced.
 - With this commit, I forfeit all right to criticise the hackiness of anyone
   else's code.
 - Though I doubt that will stop me.
pozorvlak: (Default)
Tuesday, April 5th, 2011 01:06 pm
Here are some bits of code I've released recently:

UK mountain weather forecast aggregator

The Mountain Weather Information Service do an excellent job, providing weather forecasts for all the mountain areas in the UK - most weather forecast sites only give forecasts for inhabited areas, and the weather at sea level often differs in interesting ways from the nearby weather at 1000m. However, their site's usability could be better. They assume that you're already in an area and want to know what the weather's going to be like for the next couple of days¹, but it's more normal for me to know what day I'm free to go hillwalking, and to want to know where I'll get the best weather.

So I decided to write a screen-scraper to gather and collate the information for me. I'd heard great things about Python's BeautifulSoup library and its ability to make sense of non-compliant, real-world HTML, so this seemed like a great excuse to try it out; unfortunately, BeautifulSoup completely failed me, only returning the head of the relevant pages. Fortunately, Afternoon and [livejournal.com profile] ciphergoth were on hand with Python advice; they told me that BeautifulSoup is now largely deprecated in favour of lxml. This proved much better: now all I needed to handle was the (lack of) structure of the pages...

There's a live copy running at mwis.assyrian.org.uk; you can download the source code from GitHub. There are a bunch of improvements that could be made to this code:
  1. The speed isn't too bad, but it could be faster. An obvious improvement is to stop doing eight HTTP GETs in series!
  2. There's no API.
  3. Your geographic options are limited: either the whole UK, or England & Wales, or Scotland. Here in the Central Belt, I'm closer to the English Lake District than I am to the North-West Highlands.
  4. The page design is fugly severely functional. Any design experts wanna suggest improvements? Readability on mobile devices is a major bonus.
  5. MWIS is dependent on sponsorship for their website-running costs, and for the English and Welsh forecasts. I don't want to take bread out of their mouths, so I should probably add yet more heuristics to the scraper to pull out the "please visit our sponsors" links.
  6. Currently all HTML is generated with raw print statements. It would be nicer to use a templating engine of some sort.
A possible solution to (1) and (2) is to move the scraper itself to ScraperWiki, and replace my existing CGI script with some JavaScript that pulls JSON from ScraperWiki and renders it. Anyway, if anyone feels like implementing any of these features for me, I'll gratefully accept your patches :-)


While I was developing the MWIS scraper, I found it was annoying to push to GitHub and then ssh to my host (or rather, switch to a window in which I'd already ssh'ed to my host) and pull my changes. So I wrote the World's Simplest Deployment Script. I've been finding it really useful, and you're welcome to use it yourself.

[In darcs, of course, one would just push to two different repos. Git doesn't really like you pushing to non-bare repositories, so this isn't such a great idea. If you want to know what an industrial-strength deployment setup would look like, I suggest you read this post about the continuous deployment setup at IMVU.]

bfcc - BrainF*** to C compiler

I was on the train, looking through the examples/ directory in the LLVM source tree, and noticed the example BrainF*** front-end. For some reason, it hadn't previously occurred to me quite how simple it would be to write a BF compiler. So I started coding, and had one working by the time I got back to Glasgow (which may sound a long time, but I was on my way back from an Edinburgh.pm meeting and was thus somewhat drunk). You can get it here. [livejournal.com profile] aaroncrane suggested a neat hack to provide O(1) arithmetic under certain circumstances: I should add this, so I can claim to have written an optimising BF compiler :-)

All of these programs are open source: share and enjoy. They're all pretty much trivial, but I reckon that creating and releasing something trivial is a great improvement over creating or releasing nothing.

¹ Great Britain is a small, mountainous island on the edge of the North Atlantic. Long-term weather forecasting is a lost cause here.
pozorvlak: (Default)
Wednesday, March 30th, 2011 10:57 pm
In The Art of Unix Programming, Eric Raymond lists among his basics of the Unix philosophy the "Rule of Generation":
14. Rule of Generation: Avoid hand-hacking; write programs to write programs when you can.
He goes into this idea in more detail in chapter 9 of the same book.

I used to believe this was a good idea, and in many situations (here's a great example) it is. But my current work project, which makes heavy use of code generators and custom minilanguages, has been a crash course (sometimes literally) in the downsides. Here's the latest example.

I've recently been merging in some code a colleague wrote about a year ago, just before I started. As you'd expect, with a year's drift this was a non-trivial exercise, but I eventually got all the diffs applied in (I thought) the right places. Protip: if forced to interact with a Subversion repository, use git as your client. It makes your life so much less unpleasant. Anyway, I finished the textual part of the merge, and compiled the code.

Screens-full of error messages. Oh well, that's not so unexpected.

I'm a big fan of Tilton's Law: "solve the first problem". The chances are good that the subsequent problems are just cascading damage from the first problem; no sense in worrying about them until you've fixed that one. Accordingly, I looked only at the first message: "The variable 'state' has not been declared at line 273".

Hang on...

Git checkout colleagues-branch. Make. No errors.

Git checkout merge-branch. Make. Screens-full of errors.

Git checkout colleagues-branch. Grep for a declaration of "state". None visible.

Clearly, there was some piece of voodoo that I'd failed to merge correctly.

I spent days looking through diffs for something, anything, that I'd failed to merge properly that might be relevant to this problem. I failed.

I then spent some serious quality time with the code-generation framework's thousand-page manual, looking for some implicit-declaration mechanism that might explain why "state" was visible in my colleague's branch, but not in mine. I failed.

Finally, I did what I probably should have done in the first place, and took a closer look at the generated code. The error messages that I was seeing referred to the DSL source code rather than the generated C code, because the code-generator emitted #line directives to reset the C compiler's idea of the current file and line; I could therefore find the relevant section of generated code by grepping for the name of the buggy source file in the gen/ directory.

The framework uses code generators for all sorts of things (my favourite generator being the shell script that interprets a DSL to build a Makefile which is used to build another Makefile), but this particular one was used to implement a form of polymorphism: the C snippet you provide is pasted into a honking great switch statement, which switches on some kind of type tag.

I found the relevant bit of generated code, and searched back to the beginning of the function. Yep, "state" was indeed undeclared in that function. And the code generator had left a helpful comment to tell me which hook I needed to use to declare variables or do other setup at the beginning of the function. So that was the thing I'd failed to merge properly!

Git checkout colleagues-branch. Grep for the hook. No results.

And then it hit me.

Like all nontrivial compilers, ours works by making several transformation passes over the code. The first pass parses your textual source-code and spits out a machine-independent tree-structured intermediate representation (IR). There then follow various optimization and analysis passes, which take in IR and return IR. Then the IR is expanded into a machine-specific low-level IR, and finally the low-level IR is emitted as assembly language.

The code that was refusing to compile was part of the expansion stage. But at the time that code was written, the expansion stage didn't exist: we went straight from the high-level IR to assembly. Adding an expansion stage had been my first task on being hired. Had we been using a language that supported polymorphism natively, that wouldn't have been a problem: the code would have been compiled anyway, and the errors would have been spotted; a smart enough compiler would have pointed out that the function was never called. But because we were using a two-stage generate-and-compile build process, we were in trouble. Because there was no expansion stage in my colleague's branch, the broken code was never pasted into a C file, and hence never compiled. My colleague's code was, in fact, full of compile-time errors, but appeared not to be, because the C compiler never got a look at it.

And then I took a closer look at the screensfull of error messages, and saw that I could have worked that out right at the beginning: subsequent error messages referred to OUTFILE, and the output file isn't even open at the expansion stage. Clearly, the code had originally been written to run in the emit phase (when both state and OUTFILE were live), and he'd got half-way through converting it to run at expansion-time before having to abandon it.

Lessons learned:
  1. In a generated-code scenario, do not assume that any particular snippet has been compiled successfully just because the whole codebase builds without errors.
  2. Prefer languages with decent native abstraction mechanisms to code generators.
  3. At least skim the subsequent error messages before dismissing them and working on the first bug: they may provide useful context.
  4. Communication: if I'd enquired more carefully about the condition of the code to be merged I could have saved myself a lot of time.
  5. Bear in mind the possibility that you might not be the guilty one.
  6. Treat ESR's pronouncements with even greater caution in future. Same goes for Kenny Tilton, or any other Great Prognosticator.
Any more?

Edit: two more:
  1. If, despite (2), you find yourself writing a snippet-pasting code generator, give serious thought to providing "this snippet is unused" warnings.
  2. Learn to spot when you're engaged in fruitless activity and need to step back and form a better plan. In my case, the time crawling through diffs was wasted, and I probably could have solved the problem much quicker if I'd rolled up my sleeves and tried to understand the actual code.
Thanks to [livejournal.com profile] gareth_rees and jerf.
pozorvlak: (Default)
Sunday, February 20th, 2011 09:49 pm
My post about unit-testing Agda code using elisp was posted to the dependent types subreddit (of which I was previously unaware). There, the commenter stevana pointed out that you could achieve the same effect using the following Agda code:
test1 : 2 + 1 == 3
test1 = refl

test2 : 3 + 0 == 3
test2 = refl
If your test fails, you'll get a compile-time error on that line. Aaaaaaaargh. I do wish someone had told me that before I spent a week mucking about with elisp. Well, that's not quite true - I learned a lot about Emacs, I had some fun writing the code, I learned some things about Agda that I might otherwise not have done, I got some code-review from [livejournal.com profile] aaroncrane, and I had a chance to try out using darcs and patch-tag for a collaborative project (the experience, I'm afraid, didn't measure up well against git and GitHub).

What I'd failed to realise, of course, is that the == type constructor describes computable equality: a == b is inhabited if the compiler can compute a proof that a is equal to b. "They both normalise to the same thing" is such a proof, and obviously one that the compiler's smart enough to look for on its own.

Ordinarily, a testcase is better than a compile-time assertion with the same meaning, because you can attach a debugger to a failing test and investigate exactly how your code is broken. This doesn't mean that types are stupid - types draw their power from their ability to summarise many different testcases. But in this case, I don't think there's much reason to prefer our approach to stevena's: the slight increase in concision that agda-test allows is balanced out by its inability to coerce both expressions to the same type, which often means the user has to type a lot more code. Stevena's approach does not suffer from this problem.

This also makes me more positive about dependent typing: if simple testcases can be incorporated into your type-system so easily, then maybe the "types are composable tests" slogan has some truth to it.

But seriously, guys, macros.
pozorvlak: (Default)
Friday, February 4th, 2011 02:11 pm
On Monday I went to the first day of Conor McBride's course Introduction to Dependently-Typed programming in Agda. "What's dependently-typed programming?" you ask. Well, when a compiler type-checks your program, it's actually (in a sense which can be made precise) proving theorems about your code. Assignments of types to expressions correspond to proofs of statements in a formal logical language; the precise logical language in which these statements are expressed is determined by your type system (union types correspond to "or", function types correspond to "if... then", that kind of thing). This correspondence goes by the fancy name of "the Curry-Howard isomorphism" in functional programming and type-theory circles. In traditional statically-typed languages these theorems are mostly pretty uninteresting, but by extending your type system so it corresponds to a richer logical language you can start to state and prove some more interesting theorems by expressing them as types, guaranteeing deep properties of your program statically. This is the idea behind dependent typing. A nice corollary of their approach is that types in dependently-typed languages (such as Agda, the language of the course) can be parametrised by values (and not just by other types, as in Haskell), so you can play many of the same type-level metaprogramming games as in C++ and Ada, but in a hopefully less crack-fuelled way. I spent a bit of time last year playing around with Edwin Brady's dependently-typed systems language Idris, but found the dependent-typing paradigm hard to wrap my head around. So I was very pleased when Conor's course was announced.

The course is 50% lab-based, and in these lab sessions I realised something important: fancy type-system or no, I need to test my code, particularly when I'm working in such an unfamiliar language. Dependent typing may be all about correctness by construction, but I'm not (yet?) enlightened enough to work that way - I need to see what results my code actually produces. I asked Conor if there were any way to evaluate individual Agda expressions, and he pointed me at the "Evaluate term to normal form" command in Emacs' Agda mode (which revealed that I had, indeed, managed to produce several incorrect but well-typed programs). Now, that's better than nothing, but it's clearly inadequate as a testing system - you have to type the expression to evaluate every time, and check the result by eye. I asked Conor if Agda had a more extensive unit-testing framework, and he replied "I'm not aware of such a thing. The culture is more 'correctness by construction'. Testing is still valuable."

So I wrote one.

I've written - or at least hacked on - a test framework of some sort at almost every programming job I've had (though these days I try to lean on Test::More and friends as much as possible). It gets hella tedious. This one was a bit different, though. One message that came out of Conor's lectures was that the correct way to represent equality in dependently-typed languages is somewhat controversial; as a beginner, I didn't want to dip a toe into these dangerous waters until I had a clearer idea of the issues. But the basic operation of any testing system is "running THIS should yield THAT". Fortunately, there was a way I could punt on the problem. Since Agda development seems to be closely tied to the interactive Emacs mode, I could deliver my system as a set of Emacs commands; the actual testing could be done by normalising the expression under consideration and testing the normalised form for string-equality with the expected answer.

This was less easy than I'd expected; it turns out that agda-mode commands work by sending commands to a slave GHCi process, which generates elisp to insert the results into the appropriate buffer. I'm sure that agda-mode's authors had some rationale for this rather bizarre design, but it makes agda-mode a lot harder to build on than it could be. However (in close collaboration with Aaron Crane, who both contributed code directly and guided me through the gargantuan Emacs API like a Virgil to my Dante) I eventually succeeded. There are two ways to get our code:
darcs get http://patch-tag.com/r/pozorvlak/agda-test
git clone git://github.com/pozorvlak/agda-test.git
Then load the file agda-test.el in your .emacs in the usual way. Having done so, you can add tests to your Agda file by adding comments of the form
For instance,
{- test 2+1: (suc (suc zero)) +N (suc zero) is (suc (suc (suc zero)));
   test 3+0: (suc (suc (suc (zero)))) +N zero is (suc (suc zero)); -}
When you then invoke the function agda2-test-all (via C-u C-c C-v for "verify"), you should be presented with a new buffer called *Agda test results*, containing the text
ok 1 - 2+1
not ok 2 - 3+0
    got 3
    expected 2
[ Except, er, I get "expected _175" for that last test instead. I don't think that's a bug in my elisp code, because I get the same result when I evaluate that expression manually with C-c C-n. Halp pls?]

You should recognise the output as the Test Anything Protocol; it should be easy to plug in existing tools for aggregating and displaying TAP results.

There are a lot of things I'd like to do in the future:
  • Add commands to only run a single test, or just the tests in a given comment block, or a user-specified ad-hoc test group.
  • Highlight failing tests in the Agda buffer.
  • Allow the user to specify a TAP aggregator for the results.
  • Do something about the case where the test expressions don't compile cleanly.
If you'd like to send patches (which would be very welcome!), the darcs repository is currently considered the master one, so please submit patches there. I'm guessing that most Agda people are darcs users, and the slight additional friction of using darcs as my VCS is noise next to the friction of using Emacs as my editor :-) The git repository is currently just a mirror of the darcs one, but it would be easy to switch over to having it as the master if enough potential contributors would prefer git to darcs. Things might get confusing if I get patches submitted in both places, though :-)
pozorvlak: (pozorvlak)
Wednesday, January 12th, 2011 08:28 pm
I think that djb redo will turn out to be the Git of build systems.

Read more... )
pozorvlak: (Default)
Wednesday, December 1st, 2010 03:03 pm
After my git/darcs talk, some of the folks on darcs-users were kind enough to offer constructive criticism. In particular, Stephen Turnbull mentioned an interesting use-case which I want to discuss further.

As I tried to stress, the key insight required to translate between git-think and darcs-think is
In git, the natural thing to pull is a branch (an ordered list of commits, each of which is assumed to depend on those before it); in darcs, the natural thing to pull is a patch (a named change whose dependencies are calculated optimistically by the system).
Stephen's use-case is this: you're a release manager, and one of your hackers has written some code you want to pull in. However, you don't want all their code. Suppose the change is nicely isolated into a single commit. In darcs, you can pull in just that commit (and a minimal set of prior commits required for it to apply cleanly). This is as far as my thinking had got, but Stephen points out that the interesting part of the story is what happens next: if you subsequently pull in other changes that depend on that commit, then darcs will note that it's already in your repository and all will be well. This is true in git if the developer has helpfully isolated that change into a branch: you can pull that branch, and subsequent merges will take account of the fact that you've done so. However, if the developer hasn't been so considerate, then you're potentially in trouble: you can cherry-pick that commit (creating a new commit with the same effect), but if you subsequently pull a branch containing it then git will not take account of your having cherry-picked it earlier. If either of you have changed any of the lines affected by that diff, then you'll get conflicts.

Thinking about this further, this means that I was even righter than I realised. In the git view of the world, the fact that that commit is not in its own branch is an assertion that it only makes sense in the context of the rest of the branch. Attempting to pull it in on its own is therefore not useful. You can do it, of course - it's Unix git, you can do anything - but you're making a rod for your own back. As I tried to emphasize in the talk, git-cherry-pick is a low-level, hackish tool, only really intended for use in drastic situations or in the privacy of your own local repo. If you want something semantically meaningful, only pull branches.

Git-using release managers, therefore, have to rely on developers to package atomic features sensibly into branches. If your developers can't be trusted to do this, you may have a problem. But note that darcs has the dual problem: if you can't trust your developers to specify semantic (non-textual) dependencies with darcs commit --ask-deps, then you're potentially going to be spending a lot of time tracking down semantic dependencies by hand. Having been a release manager under neither system, I don't have any intuition for which is worse - can anyone here shed any light?

[The cynic in me suggests that any benefits to the Darcs approach would only become apparent in projects which are large enough to rule out the use of Darcs with its current performance, but I could, as ever, be completely wrong. And besides, not being very useful right now doesn't rule out its ultimately proving to be a superior solution.]

On another note: Mark Stosberg (who's written quite a lot on the differences between darcs and git himself) confirms that people actually do use spontaneous branches, with ticket numbers as the "branch" identifiers. Which got me thinking. Any git user can see that spontaneous branches are more work for the user than real branches, because you have to remember your ticket number and type it into your commit message every time. Does that sound like a trivial amount of work to complain about? That's because you have no idea how easy branching and merging is in git. But it's also work that can be automated away with some tool support. Stick a file somewhere in _darcs containing the name of the current ticket, and somehow prepend that to your commit messages. I have just written a Perl script to do that (GitHub repo, share and enjoy).

Now we just need the ability to easily back-out and restore incomplete tickets without creating a whole new repo, and they'll be as convenient as git topic branches :-)
pozorvlak: (Default)
Friday, November 12th, 2010 10:37 am
Last night was Glasgow.pm's second technical meeting, and I finally gave a version of the DVCS-comparison talk I've been thinking about doing for at least a year. I could have done with a lot more rehearsal, something that would have been helped by finishing off the slides before (checks git-log) 5.33pm - the meeting started at 6.15 - but I think it went OK.

The idea of the talk was to
  • explain a little bit of what git's doing under the hood
  • explain how the darcs model differs
  • cut through some of the tangle of conflicting terminology
  • explain why git users think branching's such a big deal, and why darcs users think that cherry-picking's such a big deal (spoilers: the answers are the same!).
I didn't try to explain the details of how you use either system, because in both cases that's fairly easy to work out once you have a decent handle on what's going on under the hood. The audience was mostly full of git users, so I spent more time explaining the darcs model; hopefully darcs users wanting to learn git will also find the slides helpful. For a more detailed introduction to how git works, I strongly recommend mjd's recent talk on the subject, and for more on the (much ameliorated) "exponential merge" problem in darcs see here. Edit: and for the details of how to use git from a darcsy perspective, try the GHC wiki's git for darcs users page.

By the way, there's a useful consequence of git's design which neither mjd's slides nor mine mention, but which I mentioned in the talk, namely the reflog. It's a list of every commit you've visited, and when you were there. This means you can say things like "Show me the state of the repository at 11.15 last Monday"; more usefully, it lets you track down and recover commits that have been orphaned by some botched attempt at history rewriting. This is not a feature that you need often, but when you do need it it's an absolute lifesaver. Git's "directed graph of snapshots" model makes this feature almost trivial to add (and because git's built on a content-addressable filesystem, jumping to those orphaned commits is fast), but darcs' "bag of patches" model makes it much harder to add such a feature (though they're thinking about possible approaches that make more sense than storing your darcs repo in git).

Thanks very much to Eric Kow and Guillaume Hoffman for answering my questions about darcs. Any errors remaining are of course my own.

Anyway, you can get the slides here (slightly cleaned-up). Please let me know if they don't make sense on their own. Or, if you really care, you can browse the history here. To build the TeX source, you'll need the prosper, xypic and graphicsx packages installed.

Edit: some of the people on darcs-users were kind enough to comment, prompting some further thought. I've written a followup post in which I respond to some of the things they said.
pozorvlak: (Default)
Sunday, December 13th, 2009 03:16 pm
We've been using git for version control at work for the last couple of months, and I'm really impressed with it. My favourite thing about git is what might be termed its unfuckability: no matter what you do to a repository to fuck it up, it seems, it's always possible to unfuck¹ it, usually simply by keeping calm and reading the error messages. I've managed to lose data with an ill-advised git reset --hard, but that was before I knew about the reflog, and I've always been able to recover "lost" work in every other case. And then there's the rest: cheap local branching², the index, the raw speed, git-bisect, git-gui and gitk (which has rapidly become an indispensable part of my development toolchain)³.

The merge algorithm pretty much Just Works: we get the occasional merge conflict, sure, but (so far) never without good reason. So I was surprised to learn (from Mark Shuttleworth) of a really simple case where git's merge algorithm does the Wrong Thing ).

Bazaar obviously gets this right, otherwise Mark Shuttleworth wouldn't have written his post. Commenters there suggest that Darcs gets this right too, but after spending a while looking through the Darcs wiki I discover that I really, really can't be arsed to work out how to do the necessary branching to test it. Hopefully some helpful Darcs user (I know you're still out there...) will be able to post the relevant transcript in a comment. [Edit: I realised belatedly that you don't need branches for this. Transcript here.]

Overall, I don't think this is a show-stopper, or even a reason to seriously think about switching to another DVCS, but it's certainly worth remembering and watching out for.

¹ Why yes, I have been watching the excellent Generation Kill. How did you guess? :-)
² If you're not used to systems in which this is possible, it probably sounds really scary and difficult, and like the kind of thing you'd never need or use. It's not. It's actually really simple and incredibly useful. The article that made it click for me, Jonathan Rockway's Git Merging By Example, is no longer online, but I'm sure there are equally good ones out there. You'll probably find Aaron Crane's "branch name in your shell prompt" utility helpful.
³ Just to clarify: I'm not saying that Your Favourite VCS sucks, or that it's impossible to get these features using it: I'm just saying that git has them, and they're really, really helpful.