January 2018

S M T W T F S
  123456
78910111213
14151617181920
21222324252627
28293031   

Style Credit

Expand Cut Tags

No cut tags
pozorvlak: (Default)
Monday, August 19th, 2013 09:10 pm
If I ever design a first course in compilers, I'll do it backwards: we'll start with code generation, move on to static analysis, then parse a token stream and finally lex source text. As we go along, students will be required to implement a small compiler for a simple language, using the techniques they've just learned. The (excellent) Stanford/Coursera compilers course does something similar, but they proceed in the opposite direction, following the dataflow in the final compiler: first they cover lexing, then parsing, then syntax analysis, then codegen. The first Edinburgh compilers course follows roughly the same plan of lectures, and I expect many other universities' courses do too.

I think a backwards course would work better for two reasons:
  1. Halfway through the Stanford course, you have a program that can convert source text into an intermediate representation with which you can't do very much. Halfway through the backwards course, you'd have a compiler for an unfriendly source language: you could write programs directly in whatever level of IR you'd got to (I'm assuming a sensible implementation language that doesn't make entering data literals too hard), and compile them using code you'd written to native code. I think that would be pretty motivating.
  2. When I did the Stanford course, all the really good learning experiences were in the back end. Writing a Yacc parser was a fiddly but largely straightforward experience; writing the code generator taught me loads about how your HLL code is actually executed and how runtime systems work. I also learned some less obvious things like the value of formal language specifications¹. Most CS students won't grow up to be compiler hackers, but they will find it invaluable to have a good idea of what production compilers do to their HLL code; it'll be much more useful than knowing about all the different types of parsing techniques, anyway². Students will drop out halfway through the course, and even those who make it all the way through will be tired and stressed by the end and will thus take in less than they did at the beginning: this argues for front-loading the important material.
What am I missing?

¹ I learned this the hard way. I largely ignored the formal part of the spec when writing my code generator, relying instead on the informal description; then towards the end of the allocated period I took a closer look at it and realised that it provided simple answers to all the thorny issues I'd been struggling with.
² The answer to "how should I write this parser?" in an industrial context is usually either "with a parser generator" or "recursive descent". LALR parsers such as those produced by Yacc are a pain to debug if you don't understand the underlying theory, true, but that's IMHO an argument for using a parser generator based on some other algorithm, most of which are more forgiving.
pozorvlak: (Default)
Wednesday, October 19th, 2011 12:08 pm
Previously on posts tagged with 'angst' (mostly friendslocked): our hero became depressed, sought help, and is now (mostly) feeling better. Now read on...

John Walker, in his excellent book The Hacker's Diet, distinguishes two approaches to tackling problems, which he calls the Manager's approach and the Engineer's approach. Management is about monitoring and ameliorating chronic problems to keep their symptoms to within an acceptable level; engineering is about solving problems outright. Most difficult problems, he claims, must be tackled using a combination of both approaches.

My problem, which might be summarised as "I became depressed because I suck at my job", has so far responded well to a management approach: leaning on friends, attending CBT workshops, prioritizing exercise, talking to my supervisor about my problems. I'm very grateful to all of you for your support, and to the Glasgow NHS mental health team. Now that I'm feeling a bit more spoonful, it's time to apply the engineer's approach, and suck less at my job.

[A perhaps more honourable alternative would be to find another job at which I wouldn't suck, but that would be fraught with risk, and my current job has much to recommend it. Besides, I'm not sure that such a job exists.]

I have three basic problems:
  1. I'm not a good enough programmer;
  2. I don't know enough about the problem domain;
  3. I am now effectively an experimental scientist, but I don't know anything about experiment design.


On the first point, I'm reasonably happy with the readability of my code, but I'm unhappy with its testability and correctness, and I'm very unhappy with the time it takes me to produce it. I'm frequently struck by analysis paralysis. I've spent most of my programming career working with high-level languages, so I'm not very good at lower-level programming. I think the only solution to this problem is to write (hundreds of) thousands of lines of code, at as low a level as possible.

On the second point: before starting this job, I'd previously worked at a compiler vendor and at a web startup which did some machine-learning; back in high school, I'd done some assembly programming for an embedded system. I'd also done a bit of background reading on compiler theory. It turned out that this was insufficient preparation for a job using machine-learning to improve the state of the art in compilers targetting embedded systems. Astonishing, I know.

There used to be a cute slide in the Edinburgh University first compilers course:

In the front end, everything's polynomial. In the back end, everything's NP-complete. In the middle-end, everything's uncomputable.

Now, that's true for compiler theory, and it explains why compiler research is still a going concern after sixty years, but it doesn't explain why day-to-day hacking on compilers is hard. For me at least, that's because hacking on compilers is systems programming. You need to know, at least to the level of "understand the basic principles of and can look up the details if needed", about things like addressing modes, instruction set architecture design, executable layout, and calling conventions. Forget the fat guys who know C++ who apparently keep Silicon Valley running; I work with guys who know Verilog and the GNU ld control language.

[Betcha didn't know that the GNU linker embeds a Turing-equivalent programming language :-)]

Now, none of this stuff is especially difficult as far as I can see. But it's a lot of background knowledge to have, and if you lack it then you'll be constantly hitting problems that you lack the mental toolkit to address. So here's what I'm doing about it:
  • To address the patchiness of my knowledge about machine-learning, I went to the library and got out a couple of ML textbooks, one of which I've now read. I've also signed up to the free online machine-learning class from Stanford, and, while I was at it, the Introduction to AI class too.
  • To address my ignorance of architectural issues, I'm auditing the computer design undergrad course at Edinburgh; when I've finished that, I'll audit the follow-on course in computer architecture. So far it's mostly been the sort of boolean algebra and digital electronics that I learned at my mother's knee, but there's a lot of stuff I don't know later on in the course.
  • Linkers are a particular problem for me; I think linker errors are the Universe's way of telling me what it's like for a non-techie who thinks they need to press the "any" key. [livejournal.com profile] zeecat kindly lent me a copy of Levine's book Linkers and Loaders, which I am now reading; as an unexpected bonus, one of the early chapters is a 30,000ft tour of computer architecture. To my delight, the book walks you through the construction of a simple linker in Perl.
  • To address my lack of C and assembly language experience, to solidify my understanding of basic compiler theory, and to give me a testbed for implementing some optimisation algorithms later on, I started writing a simple compiler in C. Currently it accepts a simple term/factor/expression grammar and outputs x86 assembly; the plan is to extend it so it accepts a subset of C. "Compiler" is currently a bit of a joke; it is technically a compiler, but one that does really aggressive constant folding :-) I haven't hacked on this for a while, because work got in the way, but intend to pick it up again soon.
  • To address my ignorance of C minutiae, I started reading through the comp.lang.c FAQ, at Aaron Crane's suggestion. This is also stalled, as is a project to convert said FAQ to a collection of Mnemosyne flashcards.
  • The world seems to be standardising on Hadoop as a clustering solution; I should try that out at some point.

So, anything else I should be doing? Anything in that list I should not be doing?

I have no real idea how to start addressing my third problem, namely my ignorance of experimental design. Go to the library and get another book out, I guess. Anyone got any recommendations?
pozorvlak: (Default)
Monday, May 30th, 2011 06:44 pm
I have just written the following commit message:
"Branch over unconditional jump" hack for arbitrary-length brcc.
    
 - brcc (branch and compare) instructions can have static branch-prediction
   hints, but can only jump a limited distance.
 - Calculating the distance of a jump at expand-time is hard.
 - So instead of emitting just a brcc instruction, we emit an unconditional
   jump to the same place and a branch over it.
 - I added a simple counter to emit/state to number the labels thus introduced.
 - With this commit, I forfeit all right to criticise the hackiness of anyone
   else's code.
 - Though I doubt that will stop me.
pozorvlak: (Default)
Tuesday, April 5th, 2011 01:06 pm
Here are some bits of code I've released recently:

UK mountain weather forecast aggregator


The Mountain Weather Information Service do an excellent job, providing weather forecasts for all the mountain areas in the UK - most weather forecast sites only give forecasts for inhabited areas, and the weather at sea level often differs in interesting ways from the nearby weather at 1000m. However, their site's usability could be better. They assume that you're already in an area and want to know what the weather's going to be like for the next couple of days¹, but it's more normal for me to know what day I'm free to go hillwalking, and to want to know where I'll get the best weather.

So I decided to write a screen-scraper to gather and collate the information for me. I'd heard great things about Python's BeautifulSoup library and its ability to make sense of non-compliant, real-world HTML, so this seemed like a great excuse to try it out; unfortunately, BeautifulSoup completely failed me, only returning the head of the relevant pages. Fortunately, Afternoon and [livejournal.com profile] ciphergoth were on hand with Python advice; they told me that BeautifulSoup is now largely deprecated in favour of lxml. This proved much better: now all I needed to handle was the (lack of) structure of the pages...

There's a live copy running at mwis.assyrian.org.uk; you can download the source code from GitHub. There are a bunch of improvements that could be made to this code:
  1. The speed isn't too bad, but it could be faster. An obvious improvement is to stop doing eight HTTP GETs in series!
  2. There's no API.
  3. Your geographic options are limited: either the whole UK, or England & Wales, or Scotland. Here in the Central Belt, I'm closer to the English Lake District than I am to the North-West Highlands.
  4. The page design is fugly severely functional. Any design experts wanna suggest improvements? Readability on mobile devices is a major bonus.
  5. MWIS is dependent on sponsorship for their website-running costs, and for the English and Welsh forecasts. I don't want to take bread out of their mouths, so I should probably add yet more heuristics to the scraper to pull out the "please visit our sponsors" links.
  6. Currently all HTML is generated with raw print statements. It would be nicer to use a templating engine of some sort.
A possible solution to (1) and (2) is to move the scraper itself to ScraperWiki, and replace my existing CGI script with some JavaScript that pulls JSON from ScraperWiki and renders it. Anyway, if anyone feels like implementing any of these features for me, I'll gratefully accept your patches :-)

git-deploy


While I was developing the MWIS scraper, I found it was annoying to push to GitHub and then ssh to my host (or rather, switch to a window in which I'd already ssh'ed to my host) and pull my changes. So I wrote the World's Simplest Deployment Script. I've been finding it really useful, and you're welcome to use it yourself.

[In darcs, of course, one would just push to two different repos. Git doesn't really like you pushing to non-bare repositories, so this isn't such a great idea. If you want to know what an industrial-strength deployment setup would look like, I suggest you read this post about the continuous deployment setup at IMVU.]

bfcc - BrainF*** to C compiler


I was on the train, looking through the examples/ directory in the LLVM source tree, and noticed the example BrainF*** front-end. For some reason, it hadn't previously occurred to me quite how simple it would be to write a BF compiler. So I started coding, and had one working by the time I got back to Glasgow (which may sound a long time, but I was on my way back from an Edinburgh.pm meeting and was thus somewhat drunk). You can get it here. [livejournal.com profile] aaroncrane suggested a neat hack to provide O(1) arithmetic under certain circumstances: I should add this, so I can claim to have written an optimising BF compiler :-)



All of these programs are open source: share and enjoy. They're all pretty much trivial, but I reckon that creating and releasing something trivial is a great improvement over creating or releasing nothing.

¹ Great Britain is a small, mountainous island on the edge of the North Atlantic. Long-term weather forecasting is a lost cause here.
pozorvlak: (Default)
Wednesday, March 30th, 2011 10:57 pm
In The Art of Unix Programming, Eric Raymond lists among his basics of the Unix philosophy the "Rule of Generation":
14. Rule of Generation: Avoid hand-hacking; write programs to write programs when you can.
He goes into this idea in more detail in chapter 9 of the same book.

I used to believe this was a good idea, and in many situations (here's a great example) it is. But my current work project, which makes heavy use of code generators and custom minilanguages, has been a crash course (sometimes literally) in the downsides. Here's the latest example.

I've recently been merging in some code a colleague wrote about a year ago, just before I started. As you'd expect, with a year's drift this was a non-trivial exercise, but I eventually got all the diffs applied in (I thought) the right places. Protip: if forced to interact with a Subversion repository, use git as your client. It makes your life so much less unpleasant. Anyway, I finished the textual part of the merge, and compiled the code.

Screens-full of error messages. Oh well, that's not so unexpected.

I'm a big fan of Tilton's Law: "solve the first problem". The chances are good that the subsequent problems are just cascading damage from the first problem; no sense in worrying about them until you've fixed that one. Accordingly, I looked only at the first message: "The variable 'state' has not been declared at line 273".

Hang on...

Git checkout colleagues-branch. Make. No errors.

Git checkout merge-branch. Make. Screens-full of errors.

Git checkout colleagues-branch. Grep for a declaration of "state". None visible.

Clearly, there was some piece of voodoo that I'd failed to merge correctly.

I spent days looking through diffs for something, anything, that I'd failed to merge properly that might be relevant to this problem. I failed.

I then spent some serious quality time with the code-generation framework's thousand-page manual, looking for some implicit-declaration mechanism that might explain why "state" was visible in my colleague's branch, but not in mine. I failed.

Finally, I did what I probably should have done in the first place, and took a closer look at the generated code. The error messages that I was seeing referred to the DSL source code rather than the generated C code, because the code-generator emitted #line directives to reset the C compiler's idea of the current file and line; I could therefore find the relevant section of generated code by grepping for the name of the buggy source file in the gen/ directory.

The framework uses code generators for all sorts of things (my favourite generator being the shell script that interprets a DSL to build a Makefile which is used to build another Makefile), but this particular one was used to implement a form of polymorphism: the C snippet you provide is pasted into a honking great switch statement, which switches on some kind of type tag.

I found the relevant bit of generated code, and searched back to the beginning of the function. Yep, "state" was indeed undeclared in that function. And the code generator had left a helpful comment to tell me which hook I needed to use to declare variables or do other setup at the beginning of the function. So that was the thing I'd failed to merge properly!

Git checkout colleagues-branch. Grep for the hook. No results.

And then it hit me.

Like all nontrivial compilers, ours works by making several transformation passes over the code. The first pass parses your textual source-code and spits out a machine-independent tree-structured intermediate representation (IR). There then follow various optimization and analysis passes, which take in IR and return IR. Then the IR is expanded into a machine-specific low-level IR, and finally the low-level IR is emitted as assembly language.

The code that was refusing to compile was part of the expansion stage. But at the time that code was written, the expansion stage didn't exist: we went straight from the high-level IR to assembly. Adding an expansion stage had been my first task on being hired. Had we been using a language that supported polymorphism natively, that wouldn't have been a problem: the code would have been compiled anyway, and the errors would have been spotted; a smart enough compiler would have pointed out that the function was never called. But because we were using a two-stage generate-and-compile build process, we were in trouble. Because there was no expansion stage in my colleague's branch, the broken code was never pasted into a C file, and hence never compiled. My colleague's code was, in fact, full of compile-time errors, but appeared not to be, because the C compiler never got a look at it.

And then I took a closer look at the screensfull of error messages, and saw that I could have worked that out right at the beginning: subsequent error messages referred to OUTFILE, and the output file isn't even open at the expansion stage. Clearly, the code had originally been written to run in the emit phase (when both state and OUTFILE were live), and he'd got half-way through converting it to run at expansion-time before having to abandon it.

Lessons learned:
  1. In a generated-code scenario, do not assume that any particular snippet has been compiled successfully just because the whole codebase builds without errors.
  2. Prefer languages with decent native abstraction mechanisms to code generators.
  3. At least skim the subsequent error messages before dismissing them and working on the first bug: they may provide useful context.
  4. Communication: if I'd enquired more carefully about the condition of the code to be merged I could have saved myself a lot of time.
  5. Bear in mind the possibility that you might not be the guilty one.
  6. Treat ESR's pronouncements with even greater caution in future. Same goes for Kenny Tilton, or any other Great Prognosticator.
Any more?

Edit: two more:
  1. If, despite (2), you find yourself writing a snippet-pasting code generator, give serious thought to providing "this snippet is unused" warnings.
  2. Learn to spot when you're engaged in fruitless activity and need to step back and form a better plan. In my case, the time crawling through diffs was wasted, and I probably could have solved the problem much quicker if I'd rolled up my sleeves and tried to understand the actual code.
Thanks to [livejournal.com profile] gareth_rees and jerf.
pozorvlak: (Default)
Thursday, March 4th, 2010 01:56 pm
As part of my preparation for the PASTA job interview, I spent some time refreshing my knowledge of basic compiler theory, and learning about some of the bits that I'd never really got my head around. Here, in the hope that someone else will find them useful, are the notes I took about parsing theory. They're loosely based on the Edinburgh lecture course on Compiling Techniques. You'll need to be familiar with the notions of formal language and formal grammar (links go to Sam Hughes' highly approachable introductions to the subjects); you'll also need to vaguely know what a regular expression ("regex") is, and to know that every classical regex (ie, one without backreferences) can be mechanically converted to a Deterministic Finite Automaton (DFA) that accepts exactly the same set of strings. All code is for illustrative purposes only and probably won't even compile.

Nowe read on... )

Comments? Corrections? Clarifications? Questions? You know what to do.