September 19th, 2005

dotat

Echsim

Last week, a discussion started on exim-users about how Exim's excessive number of little languages could be rationalized. (link). I have thought about this problem to some extent, so I wrote the following...

There are two sets of languages that are relevant to Exim: configuration languages (of which I count a generous handful) and extension languages (currently 2: Perl via ${perl and C via local_scan() and ${dlfunc). Configuration languages are important because they are the user interface of the program, and everyone has to live with them. Exim's problem is that it has too many sub-languages: two filter languages (Exim's own, plus Sieve); the ACL language; the driver language for routers etc; the list match language; the string expansion language; and regular expressions. This count is rather inflated: it's a bit of a cheat to count regexes separately, because nowadays they're part of every decent language, and list matching and string expansion function as the expression syntax to the other languages' statement syntax. But there's a lot of overlap and non-orthogonality, so plenty of room for improvement.

Time for a bit of terminology. Configuration languages are a subset of "domain-specific languages". The scope of the term is quite broad, and is fairly well illustrated by the "little languages" of the traditional Unix tools: typesetting languages like troff, tbl, pic; compiler generation tools like lex and yacc; text-processing languages like sed, awk; command languges like make and the shell; configuration languages like crontab, inetd.conf, printcap, termcap; etc. These may or may not be usable as general-purpose languages; the point is that they are targeted at a specific domain (i.e. purpose).

There is an observation that DSLs, especially for complicated pieces of software, either need to be programmable, or they become programmable as they accumulate features. The latter has happened to Exim twice (one, two). This leads to the argument that programmability should be designed in from the start; further more, if you base the DSL on an existing programming language then you don't have to do the language implementation yourself and can concentrate on the domain-specific code. Hence the idea of "embedded domain-specific languages": DSLs that are implemented within the framework of a programming language. We were speculating about replacing Exim's configuration language with a DSL designed for programmability, and I suggested making it an EDSL. Then we got into an agrument about which language should be the host for the embedding. So what makes a good host language? I think the most important thing is extensible flow control operators. The reason for this is that Exim's declarative configuration style hides quite a lot of flow complexity: many decisions are four-way (accept/reject/defer/pass) or more, and there is implicit short-cutting and iteration over addresses. The EDSL configuration should preserve this hiding of complexity, which means that configuration keywords like drop/deny/defer/accept have to be able to affect the control flow without requiring boilerplate from the user. This is even more important in the routers, where instead of dropping back into Exim's core, you usually want to skip to the next router. It's better to make the whole chain of routers a single routine (rather than one per router) because then the postmaster can code complicated routing decisions beyond the usual sequencing, but this in turn makes difficult demands of the host language.

Tcl is of course famously designed to be a host for EDSLs (such as expect); it isn't a particularly nice language in itself with its clumsy variable assignment and expression evaluation commands, but this is less of a problem if your common commands have rich semantics and it's compensated by Tcl's brilliance at non-standard flow control. This is mainly because it's easy to quote blocks of code for evaluation now or later, so unlike many languages it's trivial to define your own if command because order of evaluation is not rigid. In if [test] {then} {else} the [] specifies evaluation now and the {} specifies evaluation later, so the if command's implementation can just look at the value of its first argument and evaluate its second or third accordingly.

Lisp is another big EDSL host - in fact this is part of the culture of Lisp: when writing a Lisp program you first design an EDSL then you code the solution in your new language. Lisp is less nice than Tcl as the basis for a configuration language, though, because of the irritating superfluous parentheses. Still Emacs makes a plausible existence proof.

However both these languages suffer from lack of static checking (in the case of Tcl even at the level of basic syntax) which imposes a burden of testing on the postmaster which in an ideal world would be performed automatically. Which is why (apart from personal aesthetic preference) I suggest Haskell as the host language. Like Lisp, it has a culture of EDSLs. However these tend to focus on sets of "combinators" that are used to tie bits of code together - exactly the kind of do-it-yourself flow control we want to be able to do. In addition to that, the "monad" concept is brilliant for tucking away the implicit state manipulation and the short-cutting flow control that Exim does all the time, without cluttering the syntax seen by the user.

Functional programming has a reputation for taking "hair shirt" purity too far, to the extent of being useless for practical purposes. However, at least one plausible Internet server application has been written in Haskell, and Pugs is showing that it isn't a completely undigestable language for Perl hackers.

But at the moment this is just idle speculation - though I do have a cute name for the idea ("Elegant Configuration using Haskell for Internet Mail") - but it's unlikely to actually happen until I find some extra tuits...