January 2017

1 234567

Style Credit

Expand Cut Tags

No cut tags
pozorvlak: (Default)
Monday, April 20th, 2015 08:29 pm

Calculating the dependency graph of your build ahead-of-time can be fiendishly difficult, or even impossible. Redo brings two, I think brilliant, insights to bear on this problem.
  1. If you have to build the target in the course of calculating its dependencies, that's totally OK, because that's what you really wanted to do in the first place.
  2. You don't actually have to know the entire build graph when you start building; you only need to know enough dependencies such that
    • if a given target T needs to be rebuilt, at least one of the dependencies you know about for T will have been affected;
    • in the course of building T, you will discover the remaining dependencies of T and rebuild any stale ones.
Let's try to write do-files for building OCaml modules.

In default.cmi.do:
redo-ifchange $2.mli
ocamlc $2.mli

In default.cmo.do:
redo-ifchange $2.cmi $2.ml
redo-ifchange `ocamldep $2.mli $2.ml`
ocamlc $2.ml
Note: these do-files will not actually work, because redo insists that you write your output to a temporary file called $3 so it can atomically rename the newly-built file into place, and ocamlc is equally insistent that it knows better than you what its output files should be called. However, this annoying interaction of their limitations is irrelevant to the dependency-checking algorithm, so I'll pretend that they do work :-) I'll try to construct a workaround and post it on GitHub. Update: I have now done so!

The redo-ifchange command says "if you know how to build my arguments, then build them; if any of them changes in the future, then the target built by this script will be out-of-date". So to build X.cmi, we observe that it depends on X.mli (which probably won't need building), and then build it. To build X.cmo, we observe that it depends on both X.ml and X.cmi (which will be rebuilt if need be). Then we invoke ocamldep to get a list of other files imported by X.ml, build those if any are out of date, and finally invoke ocamlc on X.ml to produce X.cmo.

Let's see how this plays out in the following scenario:
  1. We build X.cmo.
  2. We try to build it again.
  3. We edit X.ml, adding a new dependency Y.
  4. We rebuild X.cmo.
First, redo runs default.cmo.do, discovers that X.cmo depends on X.ml and X.cmi, recursively builds X.cmi (determining that it depends on X.mli), and finally compiles X.ml, producing X.cmo.

When we run redo-ifchange X.cmo again, redo will check its database of dependencies and observe that X.cmo depends transitively on X.cmi, X.ml and X.mli but that none of them have changed; hence, it will correctly do nothing.

Then we add the dependency, and run redo-ifchange X.cmo. Redo will again check its database of dependencies and note that X.ml has changed, so it must re-run default.cmo.do. First it notes that X.cmo depends on X.cmi and X.ml: it checks its database and sees that X.cmi depends on X.mli, which hasn't changed, so it leaves X.cmi alone. Next it re-runs ocamldep X.mli X.ml and hands the output to redo-ifchange: this tells redo that X.cmo now depends on Y.cmi. Y.cmi doesn't exist yet, so it builds it using the rules in default.cmi.do. Finally it compiles X.ml into X.cmo.

This system should work provided that all your dependencies live within the filesystem, or can be brought within it; however, if this is not the case then you probably have bigger problems :-)
pozorvlak: (Hal)
Thursday, December 6th, 2012 11:41 pm

I've been running benchmarks again. The basic workflow is

  1. Create some number of directories containing the benchmark suites I want to run.
  2. Tweak the Makefiles so benchmarks are compiled and run with the compilers, simulators, libraries, flags, etc, that I care about.
  3. Optionally tweak the source code to (for instance) change the number of iterations the benchmarks are run for.
  4. Run the benchmarks!
  5. Check the output; discover that something is broken.
  6. Swear, fix the problem.
  7. Repeat until either you have enough data or the conference submission deadline gets too close and you are forced to reduce the scope of your experiments.
  8. Collate the outputs from the successful runs, and analyse them.
  9. Make encouraging noises as the graduate students do the hard work of actually writing the paper.

Suppose I want to benchmark three different simulators with two different compilers for three iteration counts. That's 18 configurations. Now note that the problem found in stage 5 and fixed in stage 6 will probably not be unique to one configuration - if it affects the invocation of one of the compilers then I'll want to propagate that change to nine configurations, for instance. If it affects the benchmarks themselves or the benchmark-invocation harness, it will need to be propagated to all of them. Sounds like this is a job for version control, right? And, of course, I've been using version control to help me with this; immediately after step 1 I check everything into Git, and then use git fetch and git merge to move changes between repositories. But this is still unpleasantly tedious and manual. For my last paper, I was comparing two different simulators with three iteration counts, and I organised this into three checkouts (x1, x10, x100), each with two branches (simulator1, simulator2). If I discovered a problem affecting simulator1, I'd fix it in, say, x1's simulator1 branch, then git pull the change into x10 and x100. When I discovered a problem affecting every configuration, I checked out the root commit of x1, fixed the bug in a new branch, then git merged that branch with the simulator1 and simulator2 branches, then git pulled those merges into x10 and x100.

Keeping track of what I'd done and what I needed to do was frankly too cognitively demanding, and I was constantly bedevilled by the sense that there had to be a Better Way. I asked about this on Twitter, and Ganesh Sittampalam suggested "use Darcs" - and you know, I think he's right, Darcs' "bag of commuting patches" model is a better fit to what I'm trying to do than Git's "DAG of snapshots" model. The obvious way to handle this in Darcs would be to have six base repositories, called "everything", "x1", "x10", "x100", "simulator1" and "simulator2"; and six working repositories, called "simulator2_x1", "simulator2_x10", "simulator2_x100", "simulator2_x1", "simulator2_x10" and "simulator2_x100". Then set up update scripts in each working repository, containing, for instance

darcs pull ../base/everything
darcs pull ../base/simulator1
darcs pull ../base/x10
and every time you fix a bug, run for i in working/*; do $i/update; done.

But! It is extremely useful to be able to commit the output logs associated with a particular state of the build scripts, so you can say "wait, what went wrong when I used the -static flag? Oh yeah, that". I don't think Darcs handles that very well - or at least, it's not easy to retrieve any particular state of a Darcs repo. Git is great for that, but whenever I think about duplicating the setup described above in Git my mind recoils in horror before I can think through the details. Perhaps it shouldn't - would this work? Is there a Better Way that I'm not seeing?

pozorvlak: (Hal)
Thursday, December 6th, 2012 09:45 pm
Inspired by Falsehoods Programmers Believe About Names, Falsehoods Programmers Believe About Time, and far, far too much time spent fighting autotools. Thanks to Aaron Crane, [livejournal.com profile] totherme and [livejournal.com profile] zeecat for their comments on earlier versions.

It is accepted by all decent people that Make sucks and needs to die, and that autotools needs to be shot, decapitated, staked through the heart and finally buried at a crossroads at midnight in a coffin full of millet. Hence, there are approximately a million and seven tools that aim to replace Make and/or autotools. Unfortunately, all of the Make-replacements I am aware of copy one or more of Make's mistakes, and many of them make new and exciting mistakes of their own.

I want to see an end to Make in my lifetime. As a service to the Make-replacement community, therefore, I present the following list of tempting but incorrect assumptions various build tools make about building software.

All of the following are wrong:
  • Build graphs are trees.
  • Build graphs are acyclic.
  • Every build step updates at most one file.
  • Every build step updates at least one file.
  • Compilers will always modify the timestamps on every file they are expected to output.
  • It's possible to tell the compiler which file to write its output to.
  • It's possible to tell the compiler which directory to write its output to.
  • It's possible to predict in advance which files the compiler will update.
  • It's possible to narrow down the set of possibly-updated files to a small hand-enumerated set.
  • It's possible to determine the dependencies of a target without building it.
  • Targets do not depend on the rules used to build them.
  • Targets depend on every rule in the whole build system.
  • Detecting changes via file hashes is always the right thing.
  • Detecting changes via file hashes is never the right thing.
  • Nobody will ever want to rebuild a subset of the available dirty targets.
  • People will only want to build software on Linux.
  • People will only want to build software on a Unix derivative.
  • Nobody will want to build software on Windows.
  • People will only want to build software on Windows.
    (Thanks to David MacIver for spotting this omission.)
  • Nobody will want to build on a system without strace or some equivalent.
  • stat is slow on modern filesystems.
  • Non-experts can reliably write portable shell script.
  • Your build tool is a great opportunity to invent a whole new language.
  • Said language does not need to be a full-featured programming language.
  • In particular, said language does not need a module system more sophisticated than #include.
  • Said language should be based on textual expansion.
  • Adding an Nth layer of textual expansion will fix the problems of the preceding N-1 layers.
  • Single-character magic variables are a good idea in a language that most programmers will rarely use.
  • System libraries and globally-installed tools never change.
  • Version numbers of system libraries and globally-installed tools only ever increase.
  • It's totally OK to spend over four hours calculating how much of a 25-minute build you should do.
  • All the code you will ever need to compile is written in precisely one language.
  • Everything lives in a single repository.
  • Files only ever get updated with timestamps by a single machine.
  • Version control systems will always update the timestamp on a file.
  • Version control systems will never update the timestamp on a file.
  • Version control systems will never change the time to one earlier than the previous timestamp.
  • Programmers don't want a system for writing build scripts; they want a system for writing systems that write build scripts.

[Exercise for the reader: which build tools make which assumptions, and which compilers violate them?]

pozorvlak: (Default)
Wednesday, October 26th, 2011 06:55 pm
Man, maintaining our Makefiles by hand really sucks. I think I'm going to write a Makefile generator. Do you think that's a good idea?

Certainly not! Go and wash your computer out with soap.

But then my computer will be ruined, and I won't be able to code!

Yes. That's the idea.
pozorvlak: (Default)
Saturday, July 2nd, 2011 06:37 pm
I'm currently running a lot of benchmarks in my day job, in the hope of perhaps collecting some useful data in time for an upcoming paper submission deadline - this is the "science" part of "computer science". Since getting a given benchmark suite built and running is often needlessly complex and tedious, one of my colleagues has written an abstraction layer in the form of a load of Makefiles. By issuing commands like "make build-eembc2", "make run-utdsp" or "make distclean-dspstone" you can issue the correct command (build/run/distclean) to whichever benchmark suite you care about. The lists of individual benchmarks are contained in .mk files, so you can strip out any particular benchmark you're not interested in.

I want to use benchmark runs as part of the fitness function for a genetic algorithm, so it's important that it run fast, and simulating another processor (as we're doing) is inherently a slow business. Fortunately, benchmark suites consist of lots of small programs, which can be run in parallel if you don't care about measuring wallclock seconds. And make already has support for parallel builds, using the -j option.

But it's always worth measuring these things, so I copied the benchmark code up onto our multi-core number crunching machine, and did two runs-from-clean with and without the -j flag. No speedup. Checking top, I found that only one copy of the simulator or compiler was ever running at a time. What the hell? Time to look at the code:
TARGETS=build run collect clean distclean

%-eembc2: eembc-2.0
        @for dir in $(BMARKS_EEMBC2) ; do \
          if test -d eembc-2.0/$$dir ; then \
            ${MAKE} -C eembc-2.0/$$dir $* ; \
          fi; \
Oh God. Dear colleague, you appear to have taken a DSL explicitly designed to provide parallel tracking of dependencies, and then deliberately thrown that parallelism away. What were you thinking?¹ But it turns out that Dominus' Razor applies here, because getting the desired effect without sacrificing parallelism is actually remarkably hard... )

Doing it in redo instead )

Time to start teaching my colleagues about redo? I think it might be...

¹ He's also using recursive make, which means we're doing too much work if there's much code shared between different benchmarks. But since the time taken to run a benchmark is utterly dominated by simulator time, I'm not too worried about that.