3 Parsing Syntax, Syntax Classes

7.0.0.6

3 Parsing Syntax, Syntax Classes

Goals

— syntax-parse’s pattern language and directives

— syntax classes

3.1 More on Parsing Syntax

It is time to get to know some more of the language ofsyntax-parse, which has much to offer to the developers of compile-time functions. To motivate this idea a bit, we will work through a concrete problem and two natural solutions, both of which come with syntactic patterns.

Problem MP3 files (v3.1) ends in a 128-bytes description of some basic information: a title (30 bytes), an artist (also 30), an album (30), and a year (five). It all starts with "tag":
  0.2|3...32|31...62|63...92|93...97|...
  tag|title |artist |album  |year   |...
The actual tag bytes contain more information but this is good enough.
Design a function that extracts the title, artist, and album as strings and the year as a (natural) number if possible (#f otherwise). A module developer would turn this informal description into the following provide specification:
(provide
(contract-out
  (mp3-tags.v0
   ; given "MP3 bytes", extract the title, artist, album, and year
   (-> (and/c bytes? (compose (>=/c TAGS) bytes-length))
       (values string? string? string? (or/c #f number?))))))

(define TAGS 128)

The essential byte-manipulation function is this:

; Bytes N N -> String
; extract specified bytes and trim UTF-8 string from right
(define (extract tags low high)
  (define s       (bytes->string/utf-8 (subbytes tags low high)))
  (define s-nulls (string-trim s #rx"\u0000+" #:left? #f))
  (define s-space (string-trim s-nulls #rx" +" #:left? #f))
  s-space)

Here we focus on the sequencing of the byte extraction and ignore these details.

For the first solution we simply write down the relevant numbers:

(define (mp3-tags.v0 file:bytes)
  (define tags      (subbytes file:bytes (- (bytes-length file:bytes) 128)))
  (define the-tag   (extract tags  0  2))
  (define title     (extract tags  3 32))
  (define artist    (extract tags 33 62))
  (define album     (extract tags 63 92))
  (define year:str  (extract tags 93 98))
  (define year:num  (string->number year:str))
  (values title artist album year:num))

This solution is naive. It contains way too many magic constants, but the real harm is that it does not convey how these numbers come about. There is a pattern there, and the written code does not convey it. If anybody were to ever change the MP3 format—now who do that?—the maintainer of this code would have to spend quite some time on figuring out what is going on here.

For the second solution, we create a function that extracts a hash table:

(define (mp3-tags.v1 file:bytes)
  (define tags (subbytes file:bytes (- (bytes-length file:bytes) 128)))
  (define hmap
    (split-rt tags 3 128
      '[title 30] '[artist 30] '[album 30] `[year 5 ,string->number]))
  (values
    (=> hmap 'title) (=> hmap 'artist) (=> hmap 'album) (=> hmap 'year)))

; Bytes N [Listof (U [List X N] [List X N [String -> Y]])]
; ->
; [Hashof X (U String Y)]
(define (split-rt tags base end . assoc)
  (define-values (results _)
    (for/fold ([results '()] [base base]) ([a assoc])
      (match-define (list* name step convert+) a)
      ; ———————————
      ; the dynamic error checking
      (define up2 (+ base step -1))
      (unless (< step end)
        (error 'extract "index out of range"))
      ; ———————————
      (define raw (extract tags base up2))
      (define res (if (empty? convert+) raw ((first convert+) raw)))
      (values (cons (cons name res) ress) (+ base step))))
  (make-hash results))

(define => hash-ref)

This is really the point of this exercise, to spot syntactic patterns where others would say “this is the best we can do.” This is the best a Haskellian can do. The good news is that the extract function consumes an actual description of the layout of byte string and an optional function for converting the year string to a number. We may actually wish to specify other conversions, so this is really good. The bad news is that this approach turns one syntactic patterns (the numbers) into one of list, quote (in Haskell, strings) and lists of specific length—a protocol that extract enforces at run-time. It will also check at run time that the layout doesn’t specify too many bytes—which is also something we know when we write down the code.

Clearly, a Racketeer wants to write this function like this:

(define (mp3-tags.v2 file:bytes)
(define tags (subbytes file:bytes (- (bytes-length file:bytes) 128)))
(split-ct tags 3 128 [title 30] [artist 30] [album 30] [year 5 string->number]))

The clauses of split-ct look similar to those of the extract function from the second solution, except that all pattern elements are stripped: the pieces simply combine a name with the number of bytes that represent the desired information.

So let’s develop split-ct. Here is a first draft:

; SYNTAX
; (split-ct tags start end [name:id step (~optional convert)] ...)
; computes the values of the fields name... by successively extracting
; bytes from tags, beginning at start to maximally end
(define-syntax (split-ct stx)
  (syntax-parse stx
    [(_ tags start end [name:id step (~optional convert)] ...)
     #`(let ([i start])
         (let*-values ([(i name) (values (+ i step) (extract tags i (+ i step -1)))]
                       ...)
           (values ((~? convert values) name) ...)))]))

The essence is sequential let*-values expression that keeps track of how far the indexing has gone (i) and uses name ... to name the values extracted from tags. Two features stand out:

The first is (~optional convert), a new pattern element. Not surprisingly this pattern element matches its input only in some cases, namely, when a field comes with the specification of a conversion function.
The language of patterns comes with many such keywords. Indeed, x:id is really short for (~var x id), which says bind the corresponding piece in the input to x but only if it is an identifier; otherwise, the match fails.
The second is a template pattern, namely, (~? convert values). Since the pattern variable inside of an ~optional might not have a value, we need a way of testing this during the code generation phase. One easy way is (~? p1 p2), which returns the first pattern that is not missing any values from optional pattern variables.

Since we want split-ct to express layouts clearly, we may also wish to say that some of its pieces must represent literal constants. To start with, tags is not one of those because we might wish to place an entire expression there or an identifier or an actual byte string. But start, name, step, and end are obviously candidates for literal constants—except that the language of syntax-parse does not provide a classifier, also known as syntax classes.

Developers may define new syntax classes:

(begin-for-syntax
(define-syntax-class byte
(pattern b:nat #:fail-unless (< (syntax-e #'b) 256) "not a byte")))

The trick is to define them at the correct phase. The classifier on a pattern variable is a syntactic element that must exist when a compile-time function refers to it. To achieve this, the syntax class must be defined at the compiile-time of compile time—and that is achieved with begin-for-syntax.

Note A syntax-class can implement many more things than basic checks. For a quick look, read the introduction to syntax-parse.

Equipped with the new classifier, we can easily refine the pattern of split-ct to force developers into specifying a literal layout:

(define-syntax (split-ct stx)
  (syntax-parse stx
    [(_ tags start:byte end:byte [name step:byte (~optional convert)] ...)
     ***]))

The final improvement concerns the range checking that split-rt performs at run time. As promised, split-ct can do so at compile time. In principle, this is also a task for the pattern language but it is simpler to do so with directives, specifically two of them:

#:do allows the addition of definitions, which are then visible in the rest of the clause.
#:fail-unless is for checking a condition, and if it fails, for aborting the compilation process with a syntax error.

Again, there is more to directives than we can explain here; it is worth perusing the sub-sub-language of directives and patterns to find out more.

That said, here is the version of split-ct that statically checks the indexing:

(define-syntax (split-ct stx)
  (syntax-parse stx
    [(_ tags start:integer end:byte [name step:byte (~optional convert)] ...)
     ; ———————————
     ; the static error checking
     #:do [(define end-int  (syntax-e #'end))
           (define step-int (sum #'(step ...)))]
     #:fail-unless (< step-int end-int) "index out of range"
     ; ———————————
     ***]))

The #:do defines two variables here: end-int, which is the numeric value of the given compile-time constant, and step-int, which is the sum of the steps specified in the clauses. While we could extract the constants from the syntax trees and sum them up in-line, it is better to define an auxiliary function:

(begin-for-syntax
  ; [Listof [Syntax Number]] -> Number
  ; compute the sum of the numbers hidden in syntax
  (define (sum list-of-syntax-numbers)
    (apply + (map syntax-e (syntax->list list-of-syntax-numbers)))))

Since split-ct refers to this function during compile time, so we have to define it at the appropriate phase, just like the syntax class above. Because this situation is so common, we do have a syntactic abbreviation for this of course:The for/sum variant is modern Racket and probably more readable (for newcomers) than the combinator-based variant.

; [Listof [Syntax Number]] -> Number
; compute the sum of the numbers hidden in syntax
(define-for-syntax (sum list-of-syntax-numbers)
(for/sum ([step (syntax->list list-of-syntax-numbers)])
(syntax-e step)))

3.2 Communicating Macros

Tomorrow we will see how to make languages by adding features, removing features, and re-interpreting features from an existing languages. Specifically, we will turn this fragment into a "real" Racket language, and on Wednesday we will add types. On Thursday we will decorate it with sprinkles.

Building languages with Racket means primarily re-using elements from the host language, adding new new linguistic constructs via macros, and communicating between these distinct facilities. This incremental approach makes Racketeers much more productive language developers than people who must create complete languages from scratch.

Thus far, we have focused on adding individual constructs, but when it comes to building (sub-)languages, we often need several linguistic extensions and those need to communicate. For concreteness, let’s look at a (small aspect) of the problem of creating the teaching languages (with which we run courses for beginners).

Problem Suppose we need a first-order, (non-recursive) function language. The language should come with numneric and symbolic data, primitive functions on those, a decision construct, first-order function definitions, and first-order function application. While the language is untyped, users would still welcome some minimal checks, for example, arity checks.
Clearly Racket provides most of these features, though its function definition and function application constructs are too unconstrained. So, we re-use what we can and use distinct keywords to indicate which constructs must be added:
Definition = (define-function (Variable Variable1 ...) Expression)

Expression = (function-application Variable Expression ...)
| (if Expression Expression Expression)
| (+ Expression Expression)
| Variable
| Number
| String
(This set-up is a simplified to fit the needs of this summer school. The teaching languages actually introduce recursive functions and many more primitive operations.)

For this concrete problem, function definition and function application must communicate and compare the number of parameters and the number of arguments. The former (parameters) becomes known when a function is defined, the latter (arguments) when the function is applied. In terms of re-using Racket, define-function must expand into more than a regular define, because this checking is supposed to happen at compile time. It turns out that this just means the generated definition must be like a macro, in that it uses define-syntax to bind a name to a compile-time value:

(define-function (f:id x:id ...) body:expr)
; == rewrites to ==>
(define-syntax ***)

Insight 1 Macros can generate more compile-time definitions.

Furthermore, the definition must record the number of parameters as well as the function itself. Pairing two pieces of information sounds like a task for cons. Together with the idea of generating a define-syntax form, we can see the essential idea now:

(define-function (f:id x:id ...) body:expr)
; == rewrites to ==>
(define-syntax f (cons arity function))

Insight 2 A define-syntax form can associate identifiers with other values than functions. We don’t call those kinds of definitions “macros,” because the compile-time value is not a function and the name cannot be “applied” like a macro.

Let’s try this out:

; ;; SYNTAX
; ;; (define-function (f x ...) e)
; ;; binds f to a syntax tranformer of shape (cons n s)
; ;; where n is the arity |x ...| of f
; ;; and s is syntax for (λ (x ...) e)

(define-syntax (define-function stx)
  (syntax-parse stx
    [(_ (f:id parameter:id ...) body:expr)
     (define arity (length (syntax->list #'(parameter ...))))
     #`(define-syntax f (cons #,arity #'(lambda (parameter ...) body)))]))

The arity is computed in a straightforward way, by determining how long the list of parameters is. Using lambda to denote the function is permissible, as long as the rest of the program cannot use it as a higher-order function.

The basic idea for function-app is quite obvious, too:

; ;; SYNTAX
; ;; (function-app f e1 ... eN)
; ;; applies f to the values of e1 ... IF f is defined and f's arity is N

(define-syntax (function-app stx)
  (syntax-parse stx
    [(_ f:id arg:expr ...)
     (define n-args (length (syntax->list #'(arg ...))))
     (define-values (arity the-function) (lookup #'f stx))
     (cond
       [(= arity n-args) #`(#,the-function arg ...)]
       [else
        (define msg (format "wrong number of arguments for ~a" (syntax-e #'f)))
        (raise-syntax-error #f msg stx)])]))

This compile-time function computes the number of arguments, retrieves the information about the function name, and then proceeds to generate code or signal a compile-time error. The retrieval functionality is delegated to an auxiliary function—lookup—which consumes the name of the function and the entire syntax tree (for error reporting).

Communication between compile-time functions can take several different forms:

via syntax-local-value, which allows a compile-time function to retrieve the current value of a define-syntax binding;
via syntax-property, which is about decorating code with properties and retrieving those properties when needed;
and some protocols between modules.

For communicating between define-function and function-app, we use the first facility in the lookup function. Roughly speaking, syntax-local-value consumes an identifier (syntax) and an optional failure thunk th. If it finds a binding of the given identifier with a compile-time value, it returns its value; otherwise it runs (th).

For lookup, we give syntax-local-value a thunk that aborts the computation to signal that the function is not defined:

; Identifier Syntax -> (values N Id)
; EFFECT raises an exception if id is not available
(define-for-syntax (lookup id stx)
  ; -> Empty
  ; EFFECT abort process with syntax error
  (define (failure)
    (define msg (format "undefined function: ~a" (syntax-e id)))
    (raise-syntax-error #f msg stx))
  (define result (syntax-local-value id failure))
  (values (car result) (cdr result)))

If the given identifier is bound to a compile-time value, we know from above that it is a pair of data: the number of parameters and the function itself. Taking apart this pair yields the desired result for lookup.

Now you may wonder how we know that the result is always a pair. The answer is that this is a protocol that someone could accidentally break, especially a student. Hence we encapsulate such protocols within a module and export only the key functionality. But this is an idea for tomorrow.

About define-for-syntax versus define-syntax: You may wonder about the difference between define-for-syntax and define-syntax, since they both bind an identifier to a compile-time value.
The define-syntax forms binds an identifier that is meant to be used in run-time expressions, which is why we use it to bind macros. It’s a kind of bridge between run time and compile time. That’s also why we generate a define-syntax form from define-function, since the defined function is meant to be used in a run-time function-application expression.
In contrast, the define-for-syntax form binds an identifier that can be used directly in compile-time code. For example, use define-for-syntax for a helper function that is meant to be called from a macro’s implementation, since a macro is a compile-time function. A run-time expression can’t refer at all to an identifier that is defined with define-for-syntax.
The syntax-local-value function won’t find a value for an identifier that is defined with define-for-syntax, because syntax-local-value is meant to be used on a part of a syntax tree for a run-time expression. That is, it’s meant to be used on bridge identifiers.

Now, it’s off to Lab ... and Yet More Syntax to practice some more.

← prev up next →

1	Racket and Language-oriented Programming
2	Lab Syntax, More Syntax, ...
3	Parsing Syntax, Syntax Classes
4	Lab ... and Yet More Syntax
5	Building a Language
6	Lab My First Language
7	Completing the Language
8	Lab My First Real Language
9	Lexing and Parsing
10	Lab Parsing Modules
11	Building an Ugly Language
12	Lab My First Ugly Language
13	Types and Type Checking
14	Lab My First Typed Language
15	Building a Typed Language with Macros and Turnstile
16	Lab My Second Typed Language
17	Language Gems I
18
19	Language Gems II
20	Good Bye

Definition	=	(define-function (Variable Variable1 ...) Expression)

Expression	=	(function-application Variable Expression ...)
	\|	(if Expression Expression Expression)
	\|	(+ Expression Expression)
	\|	Variable
	\|	Number
	\|	String