14 Matthew F. Building Languages

6.10.0.2

14 Matthew F. Building Languages

Matthew Flatt

Goals

— conventional syntax

On the one hand, we’ve looked at modeling languages in Redex. On the other hand, we’ve started looking at implementing compile-time functions as a way of defining new pieces of a language. As we’ll see, you can use comple-time functions to define a whole new language within Racket. So, what’s the relationship between Redex models and compile-time functions?

Redex and compile-time functions reflect the two main, different ways to implement a language in the realm of Racket. A Redex model gives you an interpreter—a function that maps programs to results. Compile-time functions can define a compiler—a function that maps programs to other programs; to run the resulting program, you will rely on the existing Racket "interpreter". The Racket interpreter itself composes a compiler to machine code with interpretation of that machine code.

Whether an interepreter or a compiler is better depends on your goal. You may well want both; you want to take a model as an interpreter and compile programs to a call to your interpreter, which gives you some of the benefits of both, and we’ll see how to do that tomorrow morning.

14.1 Extending or Defining a Language with Macros

Up to this point, we’ve written compile-time function, but we refine the terminology now to macro to reflect that we mean a particular kind of compile-time function.

Racket macros implement syntactic extensions, by which we mean that you have to learn specific rules for each macro that you might use in a way that’s qualitiatively different from having to learn the specific behavior of each library function that you might call. When you use a run-time function, you can know that the rest of the program will run independent of the function as long as you don’t reach the call. More importantly, you know how argument expressons in the function call will behave. With a macro, you don’t know whether your program will even compile if you don’t know anything about the macro (i.e., you may not have the option of running the rest of the program), and there are no subexpressions within the macro use that have a meaning independent of the macro.

We’ve seen examples all week of how you have to learn special rules for the syntactic forms provided by Redex. Hopefully, it has also been clear why learning and using those special rules is worthwhile to more succinctly express program models. If you’re defining a language, then the concern of having to specify a form’s interaction with the rest of the language is the point, anyway.

While both macros and the implementation of a conventional compiler use compile-time functions (i.e., the compiler, obviously runs at compile time), macros have the additional feature of being able to plug into different contexts and to cooperate with other, unknown language extensions. Along those lines, Racket macros offer a smooth path from simple syntactic abstractions to language extensions to whole language implementations.

To get a sense of why it’s possible to implement whole new languages with Racket macros, try running this program

#lang racket
(require (for-syntax syntax/parse))

(define-syntax (lambda stx)
  (syntax-parse stx
    [(_ (x:id) e:expr)
     #'(cons 'x e)]))

(lambda (x) 10)

This example illustrates that there are no identifiers in Racket that are special as keywords that cannot be redefined. Instead, seemingly core parts of the language, including lambda, can be defined as macros.

Stop! What happens if you add (define (f x) 11) and (f 10) to the program?

Exercise 30. Racket’s define forms can appear in (let () ....) to make the definition local to the let form. Given that fact, define lambda without referring to the built-in lambda form.

14.2 Macros and Identifiers

When we define lambda as above, then the original lambda becomes inaccessible. Sometimes that’s fine, but if the intent of a new lambda is to extend the existing one—perhaps to add logging on each entry to the function—then we’d like to define a new lambda in terms of the original.

One approach is to define lambda as the new form, but import the original lambda under a different name so that we can still refer to it:

#lang racket
(require (for-syntax syntax/parse)
         (only-in racket
                  [lambda original-lambda]))

(define-syntax (lambda stx)
  (syntax-parse stx
    [(_ (x:id) e:expr)
     #'(original-lambda (x)
         (printf "arg: ~s\n" x)
          e)]))

(define f (lambda (x) 10))
(f 2)

Importing the original lambda as original-lambda allows the new implementation of lambda to use it, but it also allows the rest of the module to use original-lambda. If we want to write programs that only have access to the new lambda, the best organization is to put the implementation of lambda in a module separate from the program that uses it.

"noisy-lambda.rkt"
#lang racket
(require (for-syntax syntax/parse)
         (only-in racket
                  [lambda original-lambda]))

(provide lambda)

(define-syntax (lambda stx)
  (syntax-parse stx
    [(_ (x:id) e:expr)
     #'(original-lambda (x)
         (printf "arg: ~s\n" x)
          e)]))

"program.rkt"
#lang racket
(require "noisy-lambda.rkt")

(define f (lambda (x) 10))
(f 2)
; original-lambda isn’t bound here

Since we may want to use the original lambda in many ways to implement a langauge, and since that language implementaton typically doesn’t doesn’t want to use the new form directly, we usually rename on provide instead of on require:

"noisy-lambda.rkt"
#lang racket
(require (for-syntax syntax/parse))

(provide (rename-out [new-lambda lambda]))

(define-syntax (new-lambda stx)
  (syntax-parse stx
    [(_ (x:id) e:expr)
     #'(lambda (x)
         (printf "arg: ~s\n" x)
          e)]))

Exercise 31. Add a match clause (or several) to the new-lambda macro so that lambda shapes (trees) other than
(lambda (x:id) e:expr)
behave as before. Note If you know more than basic Racket but not the whole language, just get some shapes to work—not all of lambda.

Exercise 32. Adjust "noisy-lambda.rkt" to make define create noisy functions, too, when it’s used in function-shorthand mode—like (define (f x) 11), as opposed to (define x 8) or (define f (lambda (x) 11)).

14.3 Controlling the Whole Language

Although "noisy-lambda.rkt" provides a lambda to shadow the one initially provided by the racket language, we rely on a client program to require it within a #lang racket without renaming the new lambda to something else and without requiring any other modules that provide a variant of lambda. To take control more reliably, we’d like a plain #lang line that gives the program the new lambda directly.

The language name listed after #lang is almost the same as a module name listed in require. To extra features of #lang prevent us from using "noisy-lambda.rkt" after #lang in place of racket:

A language name after #lang is responsible not only for providing a set of identifier bindings, but also for declaring how to parse the rest of the characters after #lang, and "noisy-lambda.rkt" does not yet do that.
A language name after #lang has to be just alphanumeric characters plus _ and -. It cannot hash quote marks, like "noisy-lambda.rkt".

We can defer both of these constraints to an existing language s-exp, which declares that the module content is parsed using parentheses, and that looks for a module name to provide initial bindings (using normal Racket string syntax) right after s-exp—but our first attempt will not work:

"program.rkt"
#lang s-exp "noisy-lambda.rkt"

(define f (lambda (x) 10))
(f 2)

The error is

module: no #%module-begin binding in the module’s language

We’ll need to tell you a little more to say why the error complains about #%module-begin, but the overall problem is that the module after s-exp is responsible for providing all bindings to be used in the module body, and not just things that differ from racket. Our example program needs, in addition to lambda, the define form, number constants, function application, and module-body sequencing. Let’s define "noisy-racket.rkt" to provide our new lambda plus all the non-lambda bindings of racket.

"noisy-racket.rkt"
#lang racket
(require (for-syntax syntax/parse))

(provide (rename-out [new-lambda lambda])
         (except-out (all-from-out racket)
                     lambda))

(define-syntax (new-lambda stx)
  (syntax-parse stx
    [(_ (x:id) e:expr)
     #'(lambda (x)
         (printf "arg: ~s\n" x)
          e)]))

Then we can use it as

"program.rkt"
#lang s-exp "noisy-racket.rkt"

(define f (lambda (x) 10))
(f 2)

Exercise 33. To avoid the possibility of divide-by-zero errors, adjust "noisy-racket.rkt" to not export /, modulo, or remainder.

Exercise 34. To allow division without the possibility of divide-by-zero errors, adjust "noisy-racket.rkt" export variants of /, modulo, or remainder that return +inf.0 when the second argument is 0.

14.4 Implicit Syntactic Forms

Triggering syntactic extensions by name allows different extensions to be composed in a natural way, since each has its own trigger. Still, Racket has several forms where you don’t use a name. For example, 5 by itself normally treated as a literal number, instead of requiring the programmer to write (quote 5). Similarly, assuming that f has a variable binding, (f 1 2) is a function call without something before the f to say “this is a function call.” In many of these places, you might want to extend or customize a language, even though there’s no apparent identifier to bind.

To support extension and replacement. Racket macro expander treats several kinds of forms as having an implicit use of a particular identifier:

(#%datum 5)

(f 1 2)

(#%app f 1 2)

#lang racket/base

(define (f x) x)

(f 5)

(module name racket/base

(#%module-begin

(define (f x) x)

(f 5)))

Why does #lang correspond to two implicit names? Because the module one can’t be configured. The second one, #%module-begin, applies after the first one has imported the #%module-begin binding, so its meaning can be configured.

We couldn’t use "noisy-lambda.rkt" as a module-language module, because it doesn’t export #%module-begin. By exporting everything from racket except lambda, "noisy-racket.rkt" provides #%module-begin, #%app, and #%datum, all of which are used implicitly in "program.rkt".

Exercise 35. Racket’s #%app implements left-to-right evaluation of function-call arguments. Change "noise-racket.rkt" so that it implements right-to-left evaluation of arguments to a function call. You’ll need to use Racket’s #%app to implement your new #%app.

14.5 Macro-Definition Shorthands

The pattern

(define-syntax (macro-id stx)
(syntax-parse stx
[(_ pattern ....) #'template]))

is common enough that it would be nice to have a shorter way of writing it. Fortunately, we’re in a language that’s easy to extend with a shorthand like define-syntax-rule, which lets you write the above form equivalently as

(define-syntax-rule
(macro-id pattern ....)
template)

For historical reasons, the allowed pattern forms are restricted in that they cannot include identifiers that have : followed by a syntax-class name, as in x:id. Also, the error messages are worse, so define-syntax-rule is normally used only for temporary or internal extensions.

There’s also an intermediate point, which avoids writing an explicit lambda but allows multiple patterns:

(define-syntax macro-id
(syntax-rules ()
[(_ pattern ....) template]))

Finally, you may see syntax-case, which is almost the same as syntax-parse, but it has the pattern-language restrictions of define-syntax-rule and syntax-rules. There’s little reason to use syntax-case over syntax-parse, other than the minor convenience of having it included in racket (again, for historical reasons).

14.6 Aside: Scope and Macro Expansion

In Redex, define-language lets you specify binding structure. The define-syntax-rule form doesn’t include any such specification. And yet...

#lang racket

(define-syntax-rule
  (noisy-begin e ... last-e)
  (begin
   (printf "~s\n" e)
   ...
   (let ([result last-e])
     (printf "~s\n" result)
     result)))

(let ([result 1])
  (noisy-begin
   result
   2))

Racket’s macro system can infer binding structure for macros based on the way that macros ultimately expand. Specifically, the example macro above expands to let, and the expander knows the binding structure of let, so it can effectively infer a binding rule for example. But you know that the define-syntax-rule form is just a shorthand for a compile-time functions, which can do arbitrary things... mumble mumble halting problem mumble... so this inference is not as straightforward as, say, type inference. In fact, the inference works dynamically (at compile time). The details are beyond the scope (pun intended) of this summer school, but see these notes if you’re interested.

14.7 Interactive Evaluation

When you run a program in DrRacket, you get to interact with the program after it runs. The interactive prompt is sometimes called the top level, because you have access to all the bindings that are at the outer scope of your module, while nested bindings are inaccessible. Interactive evaluation is similar to adding additional definitions and expression to the end of your program—but it’s not exactly the same, because interactive evaluation cannot generally reflect the same constraints and behaviors of in-module forms; the top level is hopeless.

Since making interactive evaluation sensible with respect to a module’s content depends on the module’s language, a #%top-interaction form is implicitly used for each interaction. A replacement #%top-interaction might disallow definitions, or it might combine an expression’s processing with information (such as types) that is recorded from the module body.

The #%top-interaction form is unusual in that it’s paired with its argument form using ., as opposed to putting #%top-interaction and its argument form in a syntactic list:

#lang racket

(define-syntax-rule
(#%top-interaction . e)
'("So, you want to evaluate..." e "?"))

Exercise 36. Make a language module (to be used after #lang s-exp) that is like racket but adjusts #%top-interaction to wrap time around each form to show how long it takes to evaluate.

14.8 #lang and Installed Languages

We mentioned in Controlling the Whole Language that the language named after #lang must have two properties: it must take responsibility for parsing the rest of the characters in the module, and it must be accessible by a name that doesn’t involve quote marks.

To make the module accessible without quote marks, then it needs to reside in a directory that is registered with Racket as a collection. More specifically, we normally register the directory as a package, and the default treatment of a package (unless the package says otherwise) is to use its directory as a collection.

You can also use a command line by cding to the parent of the "noisy" directory and running
raco pkg install noisy/
Don’t omit the final /, which makes it a directory reference instead of a request to consult the package server.

Create a directory named "noisy" somewhere on your filesystem. (Make the name "noisy" so that it matches our examples.) Then choose Package Manager... from DrRacket’s File menu, click Browse... near the top left of the resulting window, answer Directory, and pick your "noisy" directory. Finally, click Install.

Now, create a "main.rkt" file in your "noisy". (The name "main.rkt" is special.) Put the content of "noisy-racket.rkt" in "main.rkt".

It still won’t work if you now try

#lang noisy

because we’ve only addressed one of the problems—accessing the module by a name without quotes. We’re now ready to supply the parsing half. Change your "main.rkt" file to add the nested module

(module reader syntax/module-reader
noisy)

This declaration creates a reader submodule in the "main.rkt" module, and #lang noisy looks for a submodule by that name in the "main.rkt" module of the "noisy" collection.

This reader submodule is implemented using the language syntax/module-reader, which is a language specifically for making module parsers. The #%module-begin form of the syntax/module-reader module looks for a single identifier to be injected as the language of the parsed module; in this case, we use noisy to refer back to the "main.rkt" module of the "noisy" collection, which is back to the enclosing module.

Since the syntax/module-reader language implements a default reader that is the same as the s-exp parser, then

#lang noisy
(+ 2 3)

will run and print 5. It happens that

#lang s-exp noisy
(+ 2 3)

would run and print the same way, just using the parser via s-exp instead of the reader submodule.

14.9 #lang and Parsing

If the point of creating and installing "noisy/main.rkt" is that we can use the short reference #lang noisy, then we’re done. If the point is to change parsing, then we need to override the default parser provided by syntax/module-reader.

A parser comes in two flavors: read-syntax and read. The read flavor is essentially legacy, but a parser submodule must provide it, anyway, even if just by using read-syntax and stripping away “syntax” information to get a “datum.” The read flavor takes an input stream, while the read-syntax flavor takes a source-file description (usually a path) plus an input stream.

Instead of writing a parser from scratch, which can be tedious, lets use the built-in read-syntax and just configure it to read decimal numbers as exact rationals instead of inexact floating-point numbers:

(module reader syntax/module-reader
  noisy
  #:read-syntax my-read-syntax
  #:read (lambda (in)
           (syntax->datum (my-read-syntax #f in)))

  (define (my-read-syntax src in)
    (parameterize ([read-decimal-as-inexact #f])
      (read-syntax src in))))

With that change, then

#lang s-exp noisy
(+ 2 3.1)

will show an exact result instead of a floating-point approximation.

Exercise 37. Some users of #lang noisy may miss DOS-style comments using REM. Adjust the reader so that it detects and discards an REM result, discarding the rest of the line as well, and then tries reading again. Use syntax? to detect a non-EOF result from read-syntax, and use read-line to consume (the rest of) a line from an input stream.

14.10 Extended Example

See QL.

14.11 Resources

If you want to construct languages, take a look at Matthew Butterick’s book on building Beautiful Racket.

Matthew Butterick and Alex Knauth constructed a "meta-language"—like s-exp and at-exp—for debugging.

If you would like to read some paper on constructing DSLs, consider

Dan Feltey et al. describe how to re-create a mini version of Java, including an IDE in the Racket world
Vincent St-Amour et al. invent and implement a language for describing Lindemayer fractals, a paper with lots of amazing pictures, some code, and even less text
Leif Andersen et al. illustrate the language-oriented programming idea with a small, yet reasonably complex example involving eight embedded DSLs

← prev up next →

1	From the Lambda Calculus to Redex
2	Lab Playing with PCF-value
3	Modeling Functional Expression Languages
4	Lab Modeling PCF-value
5	Modeling Functional Languages
6	Lab The Mystery Languages of Records
7	Lab The Mystery Languages of Functions
8	Modeling Imperative-Functional Languages
9	Lab Modeling Event Loops
10	Lab The Mystery Languages of Variables
11	Shriram K. Semantics Re-engineering
12	Extending Languages
13	Lab Practice with Macros
14	Matthew F. Building Languages
15	Lab Practice with Hash Langs
16	Lab Testing Models, Testing Languages
17	Specification vs Implementation
18	Robby F. Advanced Testing

14.1	Extending or Defining a Language with Macros
14.2	Macros and Identifiers
14.3	Controlling the Whole Language
14.4	Implicit Syntactic Forms
14.5	Macro-Definition Shorthands
14.6	Aside: Scope and Macro Expansion
14.7	Interactive Evaluation
14.8	#lang and Installed Languages
14.9	#lang and Parsing
14.10	Extended Example
14.11	Resources