?

Log in

No account? Create an account

fanf

Generalized literal syntax for programming languages

« previous entry | next entry »
5th Feb 2010 | 20:13

For a while I have had a fantasy programming language syntax idea that I have been failing to write up. This week I found out that it has already been implemented twice, which is cheering news :-)

The main inspiration is the special support for regular expression literals in languages such as Perl and Javascript. It's annoying that regexes are privileged with special syntax, but library writers can't define domain-specific syntax for their own purposes.

In fact Perl has several special literal syntaxes: single-quotes and q{} for verbatim strings; double quotes and qq{} for backslash-escaped interpolated strings; backticks and qx{} for shell commands; qw{} for word lists; slashes, m{}, s{}{}, and qr{} for regular expressions; tr{}{} for character substitution; and << for "here" documents.

The D programming language also has some extra flavourful string literals: r"" or `` for verbatim strings; "" for backslash-escaped strings; x"" for hex-encoded data; and bare backslash escape sequences.

What I'd like to be able to do is define a library for handling my special syntax. It would work as a plugin for the compiler that would parse and check my special literals at compile time (no run-time syntax errors!) and emit code that implements their special semantics. This framework means that, instead of being built into the language, libraries can provide features like converting backslash escape sequences into control characters, turning interpolated strings into a series of concatenations, arranging for regular expressions to be compiled once, and so forth.

You could then provide support for (say) XML literals, XPath expressions, better pattern matching syntax, or whatever else you might fancy.

One possibility that is very enticing is to make string interpolation context-aware, so that interpolated strings can be automatically escaped properly. The mixture of languages and syntaxes in a web page makes this fiendishly complicated. Different SQL engines have different escaping requirements. If this kind of hard-won knowledge can be implemented once in a library, security vulnerabilities such as cross-site scripting and SQL injection would be easier to avoid. The Caja project includes a proposal for secure string interpolation in Javascript which explains these issues very well.

The syntax I had in mind is inspired by Perl (and D's letter prefixes are along similar lines), for example, re$/^ *#/ is a regular expression for matching comment lines. A literal starts off with an identifier that specifies the literal's compiler. These identifiers have their own namespace so they can be very terse without clashing with variable names. The identifier is followed by a $ which indicates this is a string. The contents of the literal are delimited in the same way that Perl's generic literals, either with nesting {} () <> [] brackets or with matching punctuation. There's no way to escape delimiters within the literal, so that the literal can be passed unmodified to its compiler. For longer literals, or if single character delimiters are awkward, you can use $$ instead of $ and the rest of the line forms the delimiter, a bit like a here document. A literal compiler may or may not support interpolation, and defines its own syntax for doing so.

This week I was pleased to find out that the Glasgow Haskell Compiler supports my generalized literal idea. They call it "quasiquoting" after Lisp. The syntax it supports looks like [$re|^ *#|] though this is likely to change to [re|^ *#|]. The brackets are based on Template Haskell's [| |] quotation brackets. The quasiquoting paper has lots of good rationale for the feature and great examples of how easy Haskell makes it to write certain kinds of literal compilers. It also allows literal compilers to define their own interpolation syntax.

The E programming language has a feature called "quasi-literals" which look like rx`^ *#` with back-ticks for delimiters. The secure string interpolation document criticizes them for being executed at run-time, not compile time, so perhaps they aren't quite what I have in mind; also I hope they are wrong to say that the feature can only compile quasi-literals to parse trees. The E documentation is sparse so it's hard to tell.

I should also mention Lisp here, since it's the king of metaprogramming. I expect you could implement a feature like this using Common Lisp reader macros...

Do any other languages have a feature like this?

| Leave a comment | Share

Comments {14}

Just a random swede

from: vatine
date: 5th Feb 2010 20:30 (UTC)

I was just seconds away from saying "CL reader macros", but you did it for me. When I implemented a (partial) elisp library for Portable Hemlock, I used reader macros quite successfully to implement the special read syntax that elisp use for various things. Extending it so it dispatches one step further would be just SMOP.

Reply | Thread

from: anonymous
date: 6th Feb 2010 09:45 (UTC)

In passing, E and Caja share some developers.

Reply | Thread

Tony Finch

from: fanf
date: 6th Feb 2010 11:58 (UTC)

Yes. Sadly there are fewer people working on capability-secure systems than the technology deserves.

Reply | Parent | Thread

pozorvlak

from: pozorvlak
date: 6th Feb 2010 11:58 (UTC)

Factor has this, and Forth probably does too. Regular expressions are handled by the R/ parsing word, which scans forward until the next unescaped / and compiles everything it finds into a DFA. You can also call the regex-AST-constructing functions directly if you want to build up a regex programmatically. I don't think it's possible to do variable interpolation in Factor regexes (there's certainly no equivalent of Perl's (?{ code }) feature), but it's possible to do interpolation in other contexts: see here.

Pluggable compiler extensions are also one of the major design drivers behind Perl 6, AIUI.

There are problems with extensible syntax: for instance, somehow you need to tell your tools about the extensions.

Reply | Thread

Tony Finch

from: fanf
date: 6th Feb 2010 12:05 (UTC)

The tools problem is why the literal delimiters need to be independent of the extension. That way, a tool that doesn't know about the extension can just treat it like a string literal. Lisp reader macros and Factor parsing words don't have this property.

Reply | Parent | Thread

Gerald the cuddly duck

from: gerald_duck
date: 6th Feb 2010 14:31 (UTC)

This seems to rhyme with something I've wanted for at least a couple of decades.

There are a great many programming languages in the world, following a variety of idioms: declarative, imperative, formal-logical, functional, stack-based, mathematical, object-oriented, event-based, logic-simulation, text-transformation, data-translation, etc. In general, any given programming language will have a few of these idioms that are regarded as its forté and some other general strengths. Then bits and pieces of the other idioms will get bolted on by those who want them.

When the problem truly spans domains that demand different programming languages, people write software in a mixture of languages and link them together. This has been supported for ages, and things like Microsoft's .net and the GNU Compiler Collection try to make it easy.

However, what I want is a meta-language — in a far broader sense than is intended by the authors of ML. As well as a consistent runtime that governs procedure call standards, memory management, datum ownership, representations of fundamental data types, record/struct packing, I want a consistent metasyntax that defines at least default rules for:
  • Choice of character set, character classes, binary representation
  • Token boundaries
  • Comments
  • Whitespace
  • Permitted identifiers
  • Nesting
  • Escaping
  • Idiom transition
  • Names for standard types
  • Textual program-code substitution mechanisms ("pre-processing")
GCC allows use of inline assembler in C/C++ code; Perl has special syntax for inline regexps and literal strings; and so on.

Armed with this mechanism, if you knew how to write a regular expression in the object-oriented programming language, you'd also know how to write a regular expression in your logic simulator (and then someone can work out how to synthesise the regexp into an FPGA…). Similarly, you could write a new idiom and have it work from within any of the other idioms.

For me the next interesting question is: how might this be implemented? Which idiom would be used for the API between the different idiom implementations? Given the ideal that programming languages can be implemented in terms of themselves, presumably the compiler would be a mixed-idiom program where each idiom implemented itself? (-8

Reply | Thread

Tony Finch

from: fanf
date: 6th Feb 2010 22:04 (UTC)

A worthy goal.

(Aside: remember that ML was the metalanguage for a computational logic system. It's a general purpose programming language, not a general purpose metalanguage.)

Your lower level goals (enumerated in the paragraph before your bulleted list) are answered by the CLR (deliberately) and the JVM (by accident). But they don't address source language commonality. cf. clojure, scala, groovy, C#, F#, etc.

The difficulty is that different styles of language have conflicting traditions. OO languages try to provide abstract data types, whereas functional languages expose the representation of types because pattern matching is so nice. Dynamic languages are great for loosely-coupled systems, whereas static types find mistakes sooner.

When language designers tackle this problem they usually try to come up with an over-arching design that will accommodate the programming styles that they like, rather than coming up with a pluggable language framework. I think this is because pluggable frameworks only work when the framework designer has a good idea of the domain occupied by the plugins. Prrogramming languages are still too much of a research area.

In the JVM and CLR ecosystems there's a lot of work on interoperability between language styles. But the work is very coloured by the dominant conventional OO languages that lead the platforms, because secondary languages have to interoperate with libraries written in/for the primary languages. Styles of library are closely coupled to the languages they are written in. (For example, a lot of the popularity of clojure is due to the quality of its libraries.)

There seems to be a lot less interest in interop in the lower-level C world. I think this is because languages that compile down to the CPU choose run time data structures that benefit the specific language, because interop with C implies sacrificing advanced features (such as GC) which are part of the JVM/CLR platforms.

So despite the OO bias, I think the places to look for programming language convergence are the JVM and the CLR, but they still don't get very near your ideal. A lot of the interop is still based on linkage.

What I have in mind is much more modest. Things like compiling printf() format strings instead of interpreting them. (And perl's pack() and unpack() functions.) A multi-paradigm compiler would obviously be written using one or two DSLs for lexing and parsing, perhaps a DSL for transforming tree-based intermediate representations, perhaps a template language for emitting bytecode or assembler.

Most DSLs are very modest, not Turing-equivalent let alone capable of implementing their own compiler. I want modest DSLs to be easier to implement and their implementations to be better.

Reply | Parent | Thread

Res facta quae tamen fingi potuit

from: pauamma
date: 6th Feb 2010 17:10 (UTC)

Hmm, ESQL?

Reply | Thread

Tony Finch

from: fanf
date: 6th Feb 2010 22:08 (UTC)

Um...

If you mean "Embedded SQL" then one of the things I has in mind was LINQ.

Reply | Parent | Thread

Res facta quae tamen fingi potuit

from: pauamma
date: 7th Feb 2010 15:49 (UTC)

Yes, I meant embedded SQL. It didn't occur to me that IBM would use the acronym with another meaning. (Although, this being IBM, perhaps I should have expected it.)

/me wanders off to read your link.

Reply | Parent | Thread

C++1x

from: anonymous
date: 7th Feb 2010 09:21 (UTC)

The upcoming revision of C++ (which is very exciting) also has something like this. The syntax is C string literal syntax followed by a suffix that specifies the compiler:

"^ *#"re

The compiler can be a function to be called at run time:

regex_t operator ""(const char *s, size_t len);

or at compile time:

constexpr regex_t operator ""(const char *s, size_t len);

Combining this with other abuses of user-defined operators allows one to build a nice domain-specific language.

Unfortunately this feature does not integrate well with template metaprograming: the implementation of a string literal operator cannot use the string it is parsing as a template argument (though a numeric literal operator can).

The standard is filled with examples like a definition for 1.5_km or "étale"_utf8, but the regex example seems more interesting.

Jonathan

References:
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2750.pdf>
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2009/n3000.pdf> page 305 ([over.literal])

Reply | Thread

Tony Finch

Re: C++1x

from: fanf
date: 7th Feb 2010 11:20 (UTC)

Nice.

One of my goals was to give the literal compiler complete control over escaping syntax, to avoid the "leaning toothpick" problem you get with multiple layers of escaping. See for example regular expressions in Emacs Lisp. (Why on earth does it use BREs not EREs?!) I wanted the programmer to be able to paste in a bit of code in some foreign language without modification.

I expect that as well as the double escaping problem, C++ programmers will also have to bracket each newline in a multi-line literal with \n" and ".

Reply | Parent | Thread

Re: C++1x

from: anonymous
date: 7th Feb 2010 17:18 (UTC)

The draft standard also provides a syntax for that:

R==[^ *#]==re

Here, == is a string of non-whitespace, non-bracket characters chosen by the user to avoid escaping problems (kind of like the s,foo,bar, expressions in sed).

Without the user-defined suffixes, you can play around with this in g++-4.5 -std=c++0x or gcc-4.5 -std=gnu99.

In practice, it is probably a more important syntactic extension than the user-defined literal suffixes. Thanks for pointing me to it.

Reply | Parent | Thread

Tony Finch

Re: C++1x

from: fanf
date: 7th Feb 2010 20:30 (UTC)

Sweet :-)

A pity they didn't add underscore separators to numeric constants, so you could write 1_000_000 or 3.141_592_653_589 like Ada, Verilog, Perl, Ruby, Java 7 and maybe some other languages.

Edited at 2010-02-07 09:17 pm (UTC)

Reply | Parent | Thread