Several of my fellow PhD students here have to do a substantial amount of programming as part of their PhDs: unfortunately, most of them haven't done any programming before. The usual procedure, alas, is to hand them a Fortran compiler and tell them to get on with it, often hacking on a large mass of code written by someone else who was "taught" the same way. I try to do what I can to help, but there's a limit to the amount of time I can devote to someone else's project (and a limit to the amount of time they'd want me to devote, I suspect). But still, I see some horror stories: yesterday, for instance, an office-mate finally tracked down a bug due to a magic number which had been bothering her for longer than she cared to say, and which had been seriously undermining her confidence in her actual model. Not using magic numbers is basic programming practice, but nobody had told her this.
So I've been thinking about an introductory course on programming aimed at maths/science grad students. The emphasis would be on writing maintainable code and modern programming practices: modularity, use of libraries wherever possible, test-first programming, use of debuggers, source control systems and profilers, optimising later (if you have to at all), use of high-level languages, documentation, and so on. My real aim would be to break the cycle of abuse whereby each new generation of grad students is told to write 1000-line-to-a-function, opaque, untested, rape-and-paste Fortran by their supervisors, because it was good enough for their supervisors, and on and on...
Here's a first cut at a course catalogue entry for this fantasy course: I'd be very interested to hear everyone's comments.
The use of computers is now widespread in mathematics and science, but all too few scientists are aware of the techniques that are standard in industry for creating correct, maintainable code. This course is a ground-up introduction to computer programming emphasising code clarity and maintainability, and the use of standard tools like debuggers, profilers, test frameworks and version control systems.
The language of instruction will be Python, a modern multi-paradigm language famed for its simplicity, but most of the lessons of the course (and all of the important ones) will transfer easily to any reasonably mainstream language. Indeed, if you want to learn to program in language X it will almost certainly be faster to learn to program in Python and then learn language X. In addition, Python is a powerful and useful general-purpose programming language in its own right. The differences between Python and other languages will be explained at appropriate points throughout the course.
The course will consist of $weeks_in_term two-hour lab sessions, as follows:
Lab 1: Basics
Hello, World; creating and running Python programs; input/output; variables; loops; conditionals; use and creation of functions; recursion; functions are data; interaction with the filesystem.
Lab 2: Structured data
Strings, integers and floats; lists; dictionaries; trees; graphs; recursion on data-structures; objects and classes; reflection; pickling (serialisation).
Lab 3: Modules
Using standard modules; creating modules of your own; documenting your modules; scope; some useful modules from the standard library; finding and installing new modules from the Web; interfacing to other languages.
Lab 4: Testing and debugging
PyUnit and friends; unit testing versus functional testing; white-box versus black-box testing; what to test; test-first programming; testing versus proofs of correctness; coverage analysis; debugging with print statements; use of the debugger.
Lab 5: Version control
Basic concepts of version control; use of (cvs|subversion|darcs|whatever we have available); branching and merging; regression testing.
Lab 6: Text munging
String manipulation; globs; regular expressions; parser generators and Backus-Naur Form; parsing and manipulating XML; analysing data in textual form; code that writes code.
Lab 7: GUI programming and event-driven programming
Writing Graphical User Interfaces with [Python folks: what's the best GUI toolkit to use for this? There seem to be so many...]; event-driven programming for GUIs; event-driven programming in other contexts, including SAX.
Lab 8: Numeric and array processing
Array programming; SciPy and its capabilities; limitations of floating-point arithmetic; IEEE special values.
Lab 9: Optimisation
When to optimise; use of profilers; basic complexity theory.
Lab 10: Round-up
Summary of good programming practice; anything we missed.
Notes:
So I've been thinking about an introductory course on programming aimed at maths/science grad students. The emphasis would be on writing maintainable code and modern programming practices: modularity, use of libraries wherever possible, test-first programming, use of debuggers, source control systems and profilers, optimising later (if you have to at all), use of high-level languages, documentation, and so on. My real aim would be to break the cycle of abuse whereby each new generation of grad students is told to write 1000-line-to-a-function, opaque, untested, rape-and-paste Fortran by their supervisors, because it was good enough for their supervisors, and on and on...
Here's a first cut at a course catalogue entry for this fantasy course: I'd be very interested to hear everyone's comments.
Practical Computer Programming for Scientists
The use of computers is now widespread in mathematics and science, but all too few scientists are aware of the techniques that are standard in industry for creating correct, maintainable code. This course is a ground-up introduction to computer programming emphasising code clarity and maintainability, and the use of standard tools like debuggers, profilers, test frameworks and version control systems.
The language of instruction will be Python, a modern multi-paradigm language famed for its simplicity, but most of the lessons of the course (and all of the important ones) will transfer easily to any reasonably mainstream language. Indeed, if you want to learn to program in language X it will almost certainly be faster to learn to program in Python and then learn language X. In addition, Python is a powerful and useful general-purpose programming language in its own right. The differences between Python and other languages will be explained at appropriate points throughout the course.
The course will consist of $weeks_in_term two-hour lab sessions, as follows:
Lab 1: Basics
Hello, World; creating and running Python programs; input/output; variables; loops; conditionals; use and creation of functions; recursion; functions are data; interaction with the filesystem.
Lab 2: Structured data
Strings, integers and floats; lists; dictionaries; trees; graphs; recursion on data-structures; objects and classes; reflection; pickling (serialisation).
Lab 3: Modules
Using standard modules; creating modules of your own; documenting your modules; scope; some useful modules from the standard library; finding and installing new modules from the Web; interfacing to other languages.
Lab 4: Testing and debugging
PyUnit and friends; unit testing versus functional testing; white-box versus black-box testing; what to test; test-first programming; testing versus proofs of correctness; coverage analysis; debugging with print statements; use of the debugger.
Lab 5: Version control
Basic concepts of version control; use of (cvs|subversion|darcs|whatever we have available); branching and merging; regression testing.
Lab 6: Text munging
String manipulation; globs; regular expressions; parser generators and Backus-Naur Form; parsing and manipulating XML; analysing data in textual form; code that writes code.
Lab 7: GUI programming and event-driven programming
Writing Graphical User Interfaces with [Python folks: what's the best GUI toolkit to use for this? There seem to be so many...]; event-driven programming for GUIs; event-driven programming in other contexts, including SAX.
Lab 8: Numeric and array processing
Array programming; SciPy and its capabilities; limitations of floating-point arithmetic; IEEE special values.
Lab 9: Optimisation
When to optimise; use of profilers; basic complexity theory.
Lab 10: Round-up
Summary of good programming practice; anything we missed.
Notes:
- Python really feels like the obvious choice for this. I want something high-level, to teach by contrast the unnecessary pain of using low-level languages. Java, Ruby and (especially) Perl have syntax that's too complex: I want to spend the absolute minimum time on syntax and the absolute maximum on concepts, and Python's the most syntactically simple mainstream language that I know of. Java is also too tightly-wedded to the OO paradigm. They'd probably have a better shot at Ultimate 1337th if I started them off on Haskell, or Scheme, or C, or J, but that's not the aim: I want to get them to the stage where they can write useful code in support of their actual work without shooting themselves in the foot more than necessary. Hence we need a language that's reasonably similar (while still obviously better than) what they'll actually be using (probably C/C++, or Matlab, or (spit!) Fortran). And Python has an excellent standard library, which would be a great help in teaching the lesson that you should rely on your libraries wherever possible.
- It would be nice to have some overall goal for the course: write a small Lisp interpreter, or a game, or a raytracer, or something.
- There's no multiprogramming in there. It's not something I know a lot about, but it's probably pretty useful for people who'll end up doing hardcore numeric stuff. Where should it go? What should be in the multiprogramming lecture?
- There's nothing specifically about Web programming in there. Or databases. Databases would be a good one for another lab if $weeks_in_term > 10. Or maybe it should kick something out. But what? GUI programming? Text munging?
- Lab 1 might be a bit over-full, and Labs 5 and 9 might be a bit short.
- I think Lab 3 (modules) should occur after we've used a couple of standard modules (here,
sysandpickle). But the precise ordering of Labs 2-5 is tricky. Maybe I should move use and creation of functions to the Modules lab, and re-name it (The Religion of) Modularity. And they're certainly going to have done some debugging already by the time they get to Lab 4. - What "useful modules" should go in the Modules lecture? Or should I spread the standard modules throughout the course?
- Where should "scope" go?
- Lab 6 would really be a more general lab on data-munging, but examples would mostly be textual - sequences of DNA bases, Unix config files, and so on. Maybe I could get them to write a symbolic differentiator or something.
Tags:
no subject
I think he's said before that mathematicians who've done absolutely no coding before (like, um, me) would be better off going straight to functional languages, as they don't have any object-oriented preconceptions in the way, and it fits better with a mathematical intuition. Or something. Will see what he actually says :-)
no subject
no subject
Python seems a good fit to these to me (but then I'm neither a CS or a mathmo and FP scares me).
I would be tempted to say that Labs 1-5 and possibly 9&10 form a "core course", with 6-8 being more specialist (not everyone will need to program a GUI). I would also be tempted to add towards the end (6 at the earliest), having a lab or labs where students try using the things they've learned in the language they'll be using. Which of course encourages them to use Python for an easy ride ;) This could be joined with 9 - with tips on how different languages need different optimisations (e.g. Matlab is slow as buggery at loops, but fast as a quite fast thing if you re-write it to be array manipulation)
Oh, and regexes are vital - if only so that they know it's not line noise if they see it in someone else's code.
no subject
That's an interesting point. I don't have a reference to hand, but someone was lamenting a while back that it's quite difficult for someone to just start out with a simple program nowadays. Back in my youth, my first program (in BBC BASIC) was something like this:
10 PRINT "JOHN IS COOL"
20 GOTO 10
So, two lines, entered straight at the command prompt, then I could run it. I've gradually worked up from that to more complex environments, but it would be unfair to expect a young kid to get to grips with something like the Visual Studio IDE straightaway, and the time spent on that would get in the way of actual coding.
There's a book I'm reading at the moment (Programming Visual Basic 2005: The Language) which favours console applications (a bit like DOS programs, but designed to run at a Windows command prompt), because that keeps things simple, and you can create a "Hello World" app quite easily with just a text editor and a command line compiler.
no subject
no subject
no subject
no subject
no subject
Secondly, I concur with your suspicion that multiprogramming needs a mention. It might fit OK with optimisation, or it might have to get bumped to lecture 10 I guess. I'm not sure you need to say that much, just that the easiest way to take advantage of multiple cores, CPUs or machines is to split your task (ideally just your data) up into a number of independent chunks and launch multiple versions of your program. Mention that sometimes batch processing systems are provided to help you, and maybe explain a bit about them. Possibly talk a bit about IO-bound and CPU-bound processes? And obviously point out that some tasks are more suitable for this kind of parallel processing than others. Oh, and mention that as a language with an interpreter, a Python script is a great way to manage the running of multiple tasks. :-)
Great idea, btw.
no subject
That's roughly the scenario I had in mind. Maybe some exercises on "What does this fragment of code do? What's the bug in it?", where the fragment might be in some other language (Ada, or PL/I, or something else obscure so nobody has an unfair advantage...) I suppose if there's a specific set of libs that everyone's going to use, it would make sense to work with it.
Combining multiprogramming with optimisation makes sense. I guess it depends on how long it takes to explain Big-O notation.
no subject
I've never used Python, so I can't comment on how good it is. We did something similar in Durham, using Modula-2 for our course; this had the same basic concepts as C, while avoiding some of the quirks (e.g. "=" vs "=="). The main downside is that I've never come across any companies that use it, so it's not a particularly marketable skill. (This will probably be less relevant to people who aren't specifically programmers.)
Then again, I'd say that there's quite a big difference between imperative and functional languages, so I'm not sure whether the same concepts really apply to both; declarative languages (e.g. Prolog) are a separate thing again. So, this will depend on what people actually want to do. (I'm planning to write a set of LJ posts on this subject.)
Lab 2 looks very full: I think you could fill up an entire 2 hour session just on reflection, and it's a pretty advanced topic. On the other hand, I think that you could potentially skip "interaction with the filesystem" from lab 1; several of my apps don't do that at all (or it gets handled outside the app, e.g. by redirecting output).
I'd be inclined to start out with something on basic algorithms. For instance, we started out the CompSci degree by doing an algorithm for a vending machine. I'd say that the choice of algorithm is normally more important than the choice of language, so this might help people who already know language X. You don't have anything about sorting there, which I think would be useful; even if they use a library function rather than implementing BubbleSort, they may still need to do custom comparison functions for their own classes.
no subject
And the correct way to do sorting is to use
list.sort():-)no subject
http://blogs.msdn.com/oldnewthing/archive/2003/10/23/55408.aspx
He's talking about issues that can arise when you write your own comparison function, and the stuff about partial order vs total order would probably quite appeal to mathematicians.
no subject
On web-programming: I suppose, with knowledge of databases and text munging web-programming is trivial :-)
By the way - where are all the algorithms, such as sorting, search and, maybe, graphs?
no subject
In the standard libraries, where they belong :-) One of the explicit problems I've noticed and would like to correct is that people waste far too much time and energy re-implementing nontrivial but standard functions badly. Such as sorting, or numeric integration, or fast Fourier transforms, or whatever. Remember, I'm trying to teach scientists, not computer scientists.
Web programming: sure, but it would be nice to show them a modern web framework like Django or something.
no subject
BTW, it's a good demonstration of the problem you noted - "Guys, maybe you tried to reimplement the reimplementable framework. Now you got a framework you can't just remake, so find out how to use it" :-)
no subject
no subject
I think that an algorithm like BubbleSort is fairly easy to understand, so it would be reasonable to ask people to go away and implement that in their chosen language, just as a practice; this would also be an opportunity to teach them about proper test cases (e.g. the different arrays of integers that they could pass to their function). It would also be relevant for lab 9, if you're planning to do "Big O" notation for time complexity, particularly if you can then compare it to something like Mergesort. (Quicksort might be too complicated for an introductory course like this.)
no subject
I coded C under protest this term.
no subject
Step two should be rewrite a (basicaly version 2.0) from the ground now you know where you are going. Since no matter how go you are at programming you always find your choices you made at the beginning will hinder you later.
It is rare for a program just to be grown layer apon layer to ever really be suited to the final task, unless it was so diverse to start with can shift with changing targets. Most software of the latter form never seems to get finished as the target was too broad.
The trouble is alot of grad systems step 2 would only happen after finishing the PhD (if we assume the person has no programming skills at the start). Also since step 1 has been layered up over many PhDs the result is a poorly commented mess. Which would take too long to clean up, and work out why previous students made the choices they did.
Once I had to go back to some fairly extensive VBA code for an excel spreadsheet at a company over a year later. I knew I had written it, and swore at some of the sloppy parts in it. Thankfully comments, like "You're going to hate the next bit it's meant to do x -> y" meant I could atleast forgive my past self for admitting the mistakes and warning where traps might lie. But I had only been called in to add features and no time to correct the lack house work due to rushing the project out the first time. VBA isn't really programming I know but this script had to have it's own error handlers extra to stop the user being able to see behind in effect it must have zero bugs that do something, all bugs must leave it in a valid state. Amazingly that spreadsheet is used over most of international company, with as of yet no big failings.
I still only see it as an early 'here what I can do in a rush' solution. And probably will one day be patched over the top of. It started with good intentions with nearly everything as a function, but over time more and more functions became specific, so then where copied and pasted to make a new function that did something slightly different rather than using a parameter.
well I wittled on there, all I'm really saying is people don't make the mistake as the resources arn't there, they make the mistake as they don't have the practice and foresight.
no subject
Ground-up rewrites have their own problems: it's usually better to take something messy, make it self-testing, and then refactor it into acceptability. Speaking of which, there's nothing about refactoring in there...
no subject
I think that's all important stuff to teach - but I reckon some of it should be implicit from the beginning (and made explicit later). So the first lab can begin with the instruction "type darcs get $URL", the end of each exercise can say "type darcs record and enter the tag Question $n", and the end of the lab instruction can be "type darcs send".
You don't have to explain what those commands are for at this stage - it's just the system you're using to make marking easier and people are used to that sort of arbitraryness. But later when you come to explain VCS, it won't be some weird abstract concept - it'll be a real tool they've been using. You could even lecture it like "Most of you guys made a common mistake in lecture 1 - let's use our cool software to go back in time and fix it!".
I agree with soundwell - it's a lot to teach. This is a real problem, and not one we can solve in one course. We can't turn out master blacksmiths in one term, so the temptation is to introduce them to a few tools (so they know which the sharp end is), give them a few important safety tips, and hope that they'll successfully teach themselves the actual smithery they need in their own time.
So, what I'm wondering is if it's worth making this "self-teaching" thing more explicit. Not a fully thought through idea, but let's roll with it:
Rather than teaching programming (which is impossible in so short a time), we're now teaching "self-teaching programming"...
The first word mentioned in the first lecture/lab/whatever could be "google". We roughly follow the plan marked out by pozorvlak, but rather than making the source of knowledge this course (which is time limited - after a term the course will be no more for these people), we make the source of knowledge the net - and the software communities in it.
So, when you want to sort something, the first thing you do is google python sort: "Python lists have a built-in sort() method." - cool, we can use that. Now maybe ask the attendees to figure out how to sort something using whatever system their lab have them using? Hm - that might get too confusing and unpredictable :)
We could mention the various help forums - encourage people to participate, etc... We still mention the safety tips, of course - always VCS, always split things up, never magic number - but we don't have to go into so much detail on specific techniques like file access and so on. We still mention them, and how to find out about them, but we don't teach them. The result is that there are fewer things to remember from the course. There might be a key list like:
Have I missed anything?
Dunno if that approach would work in practice, but it might be fun to give it a try.
no subject
Of course, 15 seconds after typing that html, I hit "post" without testing the html - and I'm about to do it again. Hm :)
no subject
I guess I'm really thinking of applied mathematicians or physicists as my students; the real problem with teaching them functional programming is that they'd then have to go to work on something imperative, for which their functional background would be little help. But I'd certainly want to show them some stuff with higher-order functions (even if it's only nonstandard sort) and some cool tricks with closures, and say "there's this thing called functional programming that's getting increasingly important, and you should definitely check it out some time..."
no subject
With programming, I certainly think that internet resources are a useful tool. The main snag is that it can lead to people blindly copying/pasting code without really understanding it, which then leads to problems in the long run. This could be because the code has come from an unreliable source, e.g. people who are relying on undocumented behaviour, so the app will suddenly stop working in the future. Or it could be because the sample code deliberately left out error checking in order to be concise (to illustrate particular functionality), and the person who copied it didn't realise that they needed to make it robust. The flipside of that is when I've done maintenance programming and discovered that 80% of a function is completely useless: it probably did something important in its original location, but it's not needed in the new program.
no subject
The only cure for this kind of thing is to actually teach people to program, I think.
no subject
no subject
We had a course like this last year (run in Matlab, as that was where all the relevant libs were - but the more advanced people were expected to progress to python and [C, Java] (can't remember which) later on), which was excellent and drummed a lot of good style into me that I knew I /ought/ to be doing but wasn't by the simple expedient of insisting that /all/ code for any coding-based practical (all the way up to our first year report) had to be handed in, and you got a bollocking if there was poor coding style (like magic numbers) or not enough comments. I could probably dig out some of the notes if you like.
And, boy, am I grateful for it now when I need to go back and edit code I wrote 6 months ago :)
no subject
Comments are good, but they're all too often used to cover up things that would be better fixed in the code. A maxim I heard a while ago: "all comments should begin with the word 'because'". In other words, they should explain why something is the way it is, rather than what it means - if you need to do that, your code's not clear enough :-)
no subject
no subject
As for comments within the code, I remember writing a project during my first year as an undergrad which contained about 3 lines of comments, because I thought that everything else was self-explanatory. I then came back to this code a year later (wanting to re-use it in a new project), and thought "Gah, what does this all mean?!", so after that I tended to have lots of comments. In my FP code, I twiddled the options so that I'd use the > prefix for code rather than comments, on the grounds that more than half of my source code files were comments. Nowadays, I tend to be a bit more spartan, e.g. one comment per loop, but I still emphasise "what" as well as "why".
This can be particularly useful if you're reading code in a language that you're not familiar with. For instance, 10 years ago I worked at the Corporation of London, where I converted Clipper DOS apps into VB Windows apps; in my current job, I'm looking at converting Clarion code to VB.NET. I'm sure that other people have been involved in replacing old COBOL code, particularly in the wake of the Millennium Bug. In a situation like this, I'm not intending to ever use "Ancient Language X" again, so I'm not particularly interested in becoming the Leet-master in that language.
One thing I'm particularly keen on nowadays is that I start lines with code words. For instance, "TODO: ", "HACK: ", and "BUGBUG: " (I think this looks iffy, but I'm not going to change it until I'm sure about what it's supposed to do). The Visual Studio IDE can be configured to recognise any given prefix, and show them up in the task list; I assume that there's similar functionality in other languages.
Of course, this may not apply in a university context; I'm only speaking from my own experience. However, even if your students aren't going to be doing maintenance programming, some other poor sod may be stuck dealing with their code in the future, so comments are really for their benefit.
no subject
This is what
WRT the danger of docs and code getting out of sync - this is why many haskell coders write down the types of important functions even though the compiler doesn't need them. They're useful documentation that can't get out of sync with the code. You can use runtime assertions to do something similar in dynamically typed code (or to supplement type annotations in statically typed code). I'm sure I've heard of various tools to help make these things look more like doc and less like code in the source.
no subject
Or ... possibly my latest code is going to be submitted for publication.
Or ... what way around do I want to say this, anyway?
no subject
no subject
no subject
This is a bit different to the docstrings that
no subject
* (describe 'coerce-float-to-rational)
COERCE-FLOAT-TO-RATIONAL is an internal symbol in #
* (describe 'coerce-float-to-rational)
COERCE-FLOAT-TO-RATIONAL is an internal symbol in #<PACKAGE "COMMON-LISP-USER">.
Function: #<FUNCTION COERCE-FLOAT-TO-RATIONAL>
Its associated name (as in FUNCTION-LAMBDA-EXPRESSION) is
COERCE-FLOAT-TO-RATIONAL.
The function's arguments are: (X)
Its defined argument types are:
(FLOAT)
Its result type is:
(VALUES RATIONAL &OPTIONAL)
Note that SBCL observed that the declare applied to an argument, so typed the arguments in the description. And what I think is much cooler (I only just noticed it), SBCL correctly deduced the return type even though I never told it. Because Lisp is dynamic the compiler can't make these inferences in general, but in this case it can.
no subject
By the way, I like your blog. The echo */ trick is going in my toolbox...
no subject
Aha, I'm right. I can spread universal truth and wisdom through portable shell programming. :)
no subject
Decent event-based programming in python is basically Twisted. My experience of teaching Twisted to Computer Scientists and Electronics students indicates that you will not be able to squeeze enough conceptual work into one or two labs, it's very difficult to grasp the concept of deferred execution.
Relational databases could be a subject for a Lab, but the downside is that you inevitably have to learn SQL, which can be a bit much in such a compressed course. Maybe an ORM such as the Django ORM is a better way to go about things.
I seriously doubt you can cover enough about GUI programming in one lab to be useful. Maybe replace it web programming, it's easier to cover the essentials.
As software goes, pick an IDE and stick with it. I rather like eric3 on unix (based on pyqt), and pyscripter on Windows (based on python 4 delphi). Your mileage may vary on the unix one, but there are very few good IDEs that interact correctly with Windows.
Suggested text: How to Think like a Computer Scientist - it's only ever so slightly out of date, but it covers most of the concepts you want to go over. The most glaring omission in that text is generators, but I gather it's being rewritten at the moment, so it may well be included in the next edition.
no subject
Somewhere along the line I seem to have stopped using the subjunctive about this course...
no subject
no subject
*covers head with hands*
no subject
Lab 1: wouldn't bother with: recursion; functions are data. Add "Python on the web", «help(str)» etc.
Lab 2: wouldn't bother with: graphs; reflection; pickling. Introduce recursion here with trees. That there exist more datastructures beyond the array will be an eye-opener for many scientists. Given that they're going to have to produce files to be eaten by some 50 year old Fortran program good text and binary IO is going to be more important than pickling.
Lab 3: seems really key. Your telescope produces images in a weird file format? Try looking on the web for a Python module to handle it.
Lab 6: wouldn't bother with: parser-generators; code that writes code.
Lab 7: wouldn't bother with it. (aside, since you ask: I've written one GUI program in my life, for money, and it was in wxWidgets in Python, never having used wxWidgets before. It was fine)
Scope? Maybe next to recursion?
You have lots and lots of very sensible comments and suggestions already.
no subject