January 2018

S M T W T F S
  123456
78910111213
14151617181920
21222324252627
28293031   

Style Credit

Expand Cut Tags

No cut tags
pozorvlak: (Default)
Tuesday, April 5th, 2011 01:06 pm
Here are some bits of code I've released recently:

UK mountain weather forecast aggregator


The Mountain Weather Information Service do an excellent job, providing weather forecasts for all the mountain areas in the UK - most weather forecast sites only give forecasts for inhabited areas, and the weather at sea level often differs in interesting ways from the nearby weather at 1000m. However, their site's usability could be better. They assume that you're already in an area and want to know what the weather's going to be like for the next couple of days¹, but it's more normal for me to know what day I'm free to go hillwalking, and to want to know where I'll get the best weather.

So I decided to write a screen-scraper to gather and collate the information for me. I'd heard great things about Python's BeautifulSoup library and its ability to make sense of non-compliant, real-world HTML, so this seemed like a great excuse to try it out; unfortunately, BeautifulSoup completely failed me, only returning the head of the relevant pages. Fortunately, Afternoon and [livejournal.com profile] ciphergoth were on hand with Python advice; they told me that BeautifulSoup is now largely deprecated in favour of lxml. This proved much better: now all I needed to handle was the (lack of) structure of the pages...

There's a live copy running at mwis.assyrian.org.uk; you can download the source code from GitHub. There are a bunch of improvements that could be made to this code:
  1. The speed isn't too bad, but it could be faster. An obvious improvement is to stop doing eight HTTP GETs in series!
  2. There's no API.
  3. Your geographic options are limited: either the whole UK, or England & Wales, or Scotland. Here in the Central Belt, I'm closer to the English Lake District than I am to the North-West Highlands.
  4. The page design is fugly severely functional. Any design experts wanna suggest improvements? Readability on mobile devices is a major bonus.
  5. MWIS is dependent on sponsorship for their website-running costs, and for the English and Welsh forecasts. I don't want to take bread out of their mouths, so I should probably add yet more heuristics to the scraper to pull out the "please visit our sponsors" links.
  6. Currently all HTML is generated with raw print statements. It would be nicer to use a templating engine of some sort.
A possible solution to (1) and (2) is to move the scraper itself to ScraperWiki, and replace my existing CGI script with some JavaScript that pulls JSON from ScraperWiki and renders it. Anyway, if anyone feels like implementing any of these features for me, I'll gratefully accept your patches :-)

git-deploy


While I was developing the MWIS scraper, I found it was annoying to push to GitHub and then ssh to my host (or rather, switch to a window in which I'd already ssh'ed to my host) and pull my changes. So I wrote the World's Simplest Deployment Script. I've been finding it really useful, and you're welcome to use it yourself.

[In darcs, of course, one would just push to two different repos. Git doesn't really like you pushing to non-bare repositories, so this isn't such a great idea. If you want to know what an industrial-strength deployment setup would look like, I suggest you read this post about the continuous deployment setup at IMVU.]

bfcc - BrainF*** to C compiler


I was on the train, looking through the examples/ directory in the LLVM source tree, and noticed the example BrainF*** front-end. For some reason, it hadn't previously occurred to me quite how simple it would be to write a BF compiler. So I started coding, and had one working by the time I got back to Glasgow (which may sound a long time, but I was on my way back from an Edinburgh.pm meeting and was thus somewhat drunk). You can get it here. [livejournal.com profile] aaroncrane suggested a neat hack to provide O(1) arithmetic under certain circumstances: I should add this, so I can claim to have written an optimising BF compiler :-)



All of these programs are open source: share and enjoy. They're all pretty much trivial, but I reckon that creating and releasing something trivial is a great improvement over creating or releasing nothing.

¹ Great Britain is a small, mountainous island on the edge of the North Atlantic. Long-term weather forecasting is a lost cause here.
pozorvlak: (Default)
Wednesday, March 30th, 2011 10:57 pm
In The Art of Unix Programming, Eric Raymond lists among his basics of the Unix philosophy the "Rule of Generation":
14. Rule of Generation: Avoid hand-hacking; write programs to write programs when you can.
He goes into this idea in more detail in chapter 9 of the same book.

I used to believe this was a good idea, and in many situations (here's a great example) it is. But my current work project, which makes heavy use of code generators and custom minilanguages, has been a crash course (sometimes literally) in the downsides. Here's the latest example.

I've recently been merging in some code a colleague wrote about a year ago, just before I started. As you'd expect, with a year's drift this was a non-trivial exercise, but I eventually got all the diffs applied in (I thought) the right places. Protip: if forced to interact with a Subversion repository, use git as your client. It makes your life so much less unpleasant. Anyway, I finished the textual part of the merge, and compiled the code.

Screens-full of error messages. Oh well, that's not so unexpected.

I'm a big fan of Tilton's Law: "solve the first problem". The chances are good that the subsequent problems are just cascading damage from the first problem; no sense in worrying about them until you've fixed that one. Accordingly, I looked only at the first message: "The variable 'state' has not been declared at line 273".

Hang on...

Git checkout colleagues-branch. Make. No errors.

Git checkout merge-branch. Make. Screens-full of errors.

Git checkout colleagues-branch. Grep for a declaration of "state". None visible.

Clearly, there was some piece of voodoo that I'd failed to merge correctly.

I spent days looking through diffs for something, anything, that I'd failed to merge properly that might be relevant to this problem. I failed.

I then spent some serious quality time with the code-generation framework's thousand-page manual, looking for some implicit-declaration mechanism that might explain why "state" was visible in my colleague's branch, but not in mine. I failed.

Finally, I did what I probably should have done in the first place, and took a closer look at the generated code. The error messages that I was seeing referred to the DSL source code rather than the generated C code, because the code-generator emitted #line directives to reset the C compiler's idea of the current file and line; I could therefore find the relevant section of generated code by grepping for the name of the buggy source file in the gen/ directory.

The framework uses code generators for all sorts of things (my favourite generator being the shell script that interprets a DSL to build a Makefile which is used to build another Makefile), but this particular one was used to implement a form of polymorphism: the C snippet you provide is pasted into a honking great switch statement, which switches on some kind of type tag.

I found the relevant bit of generated code, and searched back to the beginning of the function. Yep, "state" was indeed undeclared in that function. And the code generator had left a helpful comment to tell me which hook I needed to use to declare variables or do other setup at the beginning of the function. So that was the thing I'd failed to merge properly!

Git checkout colleagues-branch. Grep for the hook. No results.

And then it hit me.

Like all nontrivial compilers, ours works by making several transformation passes over the code. The first pass parses your textual source-code and spits out a machine-independent tree-structured intermediate representation (IR). There then follow various optimization and analysis passes, which take in IR and return IR. Then the IR is expanded into a machine-specific low-level IR, and finally the low-level IR is emitted as assembly language.

The code that was refusing to compile was part of the expansion stage. But at the time that code was written, the expansion stage didn't exist: we went straight from the high-level IR to assembly. Adding an expansion stage had been my first task on being hired. Had we been using a language that supported polymorphism natively, that wouldn't have been a problem: the code would have been compiled anyway, and the errors would have been spotted; a smart enough compiler would have pointed out that the function was never called. But because we were using a two-stage generate-and-compile build process, we were in trouble. Because there was no expansion stage in my colleague's branch, the broken code was never pasted into a C file, and hence never compiled. My colleague's code was, in fact, full of compile-time errors, but appeared not to be, because the C compiler never got a look at it.

And then I took a closer look at the screensfull of error messages, and saw that I could have worked that out right at the beginning: subsequent error messages referred to OUTFILE, and the output file isn't even open at the expansion stage. Clearly, the code had originally been written to run in the emit phase (when both state and OUTFILE were live), and he'd got half-way through converting it to run at expansion-time before having to abandon it.

Lessons learned:
  1. In a generated-code scenario, do not assume that any particular snippet has been compiled successfully just because the whole codebase builds without errors.
  2. Prefer languages with decent native abstraction mechanisms to code generators.
  3. At least skim the subsequent error messages before dismissing them and working on the first bug: they may provide useful context.
  4. Communication: if I'd enquired more carefully about the condition of the code to be merged I could have saved myself a lot of time.
  5. Bear in mind the possibility that you might not be the guilty one.
  6. Treat ESR's pronouncements with even greater caution in future. Same goes for Kenny Tilton, or any other Great Prognosticator.
Any more?

Edit: two more:
  1. If, despite (2), you find yourself writing a snippet-pasting code generator, give serious thought to providing "this snippet is unused" warnings.
  2. Learn to spot when you're engaged in fruitless activity and need to step back and form a better plan. In my case, the time crawling through diffs was wasted, and I probably could have solved the problem much quicker if I'd rolled up my sleeves and tried to understand the actual code.
Thanks to [livejournal.com profile] gareth_rees and jerf.