3 Parsing Syntax, Syntax Classes
Goals |
— |
— |
— |
3.1 More on Parsing Syntax
It is time to get to know some more of the language ofsyntax-parse, which has much to offer to the developers of compile-time functions. To motivate this idea a bit, we will work through a concrete problem and two natural solutions, both of which come with syntactic patterns.
Problem MP3 files (v3.1) ends in a 128-bytes description of some basic information: a title (30 bytes), an artist (also 30), an album (30), and a year (five). It all starts with "tag":
0.2|3...32|31...62|63...92|93...97|...
tag|title |artist |album |year |...
The actual tag bytes contain more information but this is good enough.Design a function that extracts the title, artist, and album as strings and the year as a (natural) number if possible (#f otherwise). A module developer would turn this informal description into the following provide specification:
; Bytes N N -> String ; extract specified bytes and trim UTF-8 string from right (define (extract tags low high) (define s (bytes->string/utf-8 (subbytes tags low high))) (define s-nulls (string-trim s #rx"\u0000+" #:left? #f)) (define s-space (string-trim s-nulls #rx" +" #:left? #f)) s-space)
(define (mp3-tags.v0 file:bytes) (define tags (subbytes file:bytes (- (bytes-length file:bytes) 128))) (define the-tag (extract tags 0 2)) (define title (extract tags 3 32)) (define artist (extract tags 33 62)) (define album (extract tags 63 92)) (define year:str (extract tags 93 98)) (define year:num (string->number year:str)) (values title artist album year:num))
(define (mp3-tags.v1 file:bytes) (define tags (subbytes file:bytes (- (bytes-length file:bytes) 128))) (define hmap (split-rt tags 3 128 '[title 30] '[artist 30] '[album 30] `[year 5 ,string->number])) (values (=> hmap 'title) (=> hmap 'artist) (=> hmap 'album) (=> hmap 'year))) ; Bytes N [Listof (U [List X N] [List X N [String -> Y]])] ; -> ; [Hashof X (U String Y)] (define (split-rt tags base end . assoc) (define-values (results _) (for/fold ([results '()] [base base]) ([a assoc]) (match-define (list* name step convert+) a) ; — — — — — — — — — — — ; the dynamic error checking (define up2 (+ base step -1)) (unless (< step end) (error 'extract "index out of range")) ; — — — — — — — — — — — (define raw (extract tags base up2)) (define res (if (empty? convert+) raw ((first convert+) raw))) (values (cons (cons name res) ress) (+ base step)))) (make-hash results)) (define => hash-ref)
(define (mp3-tags.v2 file:bytes) (define tags (subbytes file:bytes (- (bytes-length file:bytes) 128))) (split-ct tags 3 128 [title 30] [artist 30] [album 30] [year 5 string->number]))
; SYNTAX ; (split-ct tags start end [name:id step (~optional convert)] ...) ; computes the values of the fields name... by successively extracting ; bytes from tags, beginning at start to maximally end (define-syntax (split-ct stx) (syntax-parse stx [(_ tags start end [name:id step (~optional convert)] ...) #`(let ([i start]) (let*-values ([(i name) (values (+ i step) (extract tags i (+ i step -1)))] ...) (values ((~? convert values) name) ...)))]))
The first is (~optional convert), a new pattern element. Not surprisingly this pattern element matches its input only in some cases, namely, when a field comes with the specification of a conversion function.
The language of patterns comes with many such keywords. Indeed, x:id is really short for (~var x id), which says bind the corresponding piece in the input to x but only if it is an identifier; otherwise, the match fails.
The second is a template pattern, namely, (~? convert values). Since the pattern variable inside of an ~optional might not have a value, we need a way of testing this during the code generation phase. One easy way is (~? p1 p2), which returns the first pattern that is not missing any values from optional pattern variables.
Since we want split-ct to express layouts clearly, we may also
wish to say that some of its pieces must represent literal constants. To
start with, tags is not one of those because we might wish to
place an entire expression there or an identifier or an actual byte
string. But start, name, step, and end
are obviously candidates for literal constants—
(begin-for-syntax (define-syntax-class byte (pattern b:nat #:fail-unless (< (syntax-e #'b) 256) "not a byte")))
Note A syntax-class can implement many more things than basic checks. For a quick look, read the introduction to syntax-parse.
(define-syntax (split-ct stx) (syntax-parse stx [(_ tags start:byte end:byte [name step:byte (~optional convert)] ...) ***]))
#:do allows the addition of definitions, which are then visible in the rest of the clause.
#:fail-unless is for checking a condition, and if it fails, for aborting the compilation process with a syntax error.
(define-syntax (split-ct stx) (syntax-parse stx [(_ tags start:integer end:byte [name step:byte (~optional convert)] ...) ; — — — — — — — — — — — ; the static error checking #:do [(define end-int (syntax-e #'end)) (define step-int (sum #'(step ...)))] #:fail-unless (< step-int end-int) "index out of range" ; — — — — — — — — — — — ***]))
(begin-for-syntax ; [Listof [Syntax Number]] -> Number ; compute the sum of the numbers hidden in syntax (define (sum list-of-syntax-numbers) (apply + (map syntax-e (syntax->list list-of-syntax-numbers)))))
; [Listof [Syntax Number]] -> Number ; compute the sum of the numbers hidden in syntax (define-for-syntax (sum list-of-syntax-numbers) (for/sum ([step (syntax->list list-of-syntax-numbers)]) (syntax-e step)))
3.2 Communicating Macros
Tomorrow we will see how to make languages by adding features, removing features, and re-interpreting features from an existing languages. Specifically, we will turn this fragment into a "real" Racket language, and on Wednesday we will add types. On Thursday we will decorate it with sprinkles.
Building languages with Racket means primarily re-using elements from the host language, adding new new linguistic constructs via macros, and communicating between these distinct facilities. This incremental approach makes Racketeers much more productive language developers than people who must create complete languages from scratch.
Thus far, we have focused on adding individual constructs, but when it comes to building (sub-)languages, we often need several linguistic extensions and those need to communicate. For concreteness, let’s look at a (small aspect) of the problem of creating the teaching languages (with which we run courses for beginners).
Problem Suppose we need a first-order, (non-recursive) function language. The language should come with numneric and symbolic data, primitive functions on those, a decision construct, first-order function definitions, and first-order function application. While the language is untyped, users would still welcome some minimal checks, for example, arity checks.
Clearly Racket provides most of these features, though its function definition and function application constructs are too unconstrained. So, we re-use what we can and use distinct keywords to indicate which constructs must be added:
Definition = (define-function (Variable Variable1 ...) Expression) Expression = (function-application Variable Expression ...) | (if Expression Expression Expression) | (+ Expression Expression) | Variable | Number | String (This set-up is a simplified to fit the needs of this summer school. The teaching languages actually introduce recursive functions and many more primitive operations.)
(define-function (f:id x:id ...) body:expr) ; == rewrites to ==> (define-syntax ***)
Insight 1 Macros can generate more compile-time definitions.
(define-function (f:id x:id ...) body:expr) ; == rewrites to ==> (define-syntax f (cons arity function))
Insight 2 A define-syntax form can associate identifiers with other values than functions. We don’t call those kinds of definitions “macros,” because the compile-time value is not a function and the name cannot be “applied” like a macro.
; ;; SYNTAX ; ;; (define-function (f x ...) e) ; ;; binds f to a syntax tranformer of shape (cons n s) ; ;; where n is the arity |x ...| of f ; ;; and s is syntax for (λ (x ...) e) (define-syntax (define-function stx) (syntax-parse stx [(_ (f:id parameter:id ...) body:expr) (define arity (length (syntax->list #'(parameter ...)))) #`(define-syntax f (cons #,arity #'(lambda (parameter ...) body)))]))
; ;; SYNTAX ; ;; (function-app f e1 ... eN) ; ;; applies f to the values of e1 ... IF f is defined and f's arity is N (define-syntax (function-app stx) (syntax-parse stx [(_ f:id arg:expr ...) (define n-args (length (syntax->list #'(arg ...)))) (define-values (arity the-function) (lookup #'f stx)) (cond [(= arity n-args) #`(#,the-function arg ...)] [else (define msg (format "wrong number of arguments for ~a" (syntax-e #'f))) (raise-syntax-error #f msg stx)])]))
via syntax-local-value, which allows a compile-time function to retrieve the current value of a define-syntax binding;
via syntax-property, which is about decorating code with properties and retrieving those properties when needed;
and some protocols between modules.
; Identifier Syntax -> (values N Id) ; EFFECT raises an exception if id is not available (define-for-syntax (lookup id stx) ; -> Empty ; EFFECT abort process with syntax error (define (failure) (define msg (format "undefined function: ~a" (syntax-e id))) (raise-syntax-error #f msg stx)) (define result (syntax-local-value id failure)) (values (car result) (cdr result)))
Now you may wonder how we know that the result is always a pair. The answer is that this is a protocol that someone could accidentally break, especially a student. Hence we encapsulate such protocols within a module and export only the key functionality. But this is an idea for tomorrow.
About define-for-syntax versus define-syntax: You may wonder about the difference between define-for-syntax and define-syntax, since they both bind an identifier to a compile-time value.
The define-syntax forms binds an identifier that is meant to be used in run-time expressions, which is why we use it to bind macros. It’s a kind of bridge between run time and compile time. That’s also why we generate a define-syntax form from define-function, since the defined function is meant to be used in a run-time function-application expression.
In contrast, the define-for-syntax form binds an identifier that can be used directly in compile-time code. For example, use define-for-syntax for a helper function that is meant to be called from a macro’s implementation, since a macro is a compile-time function. A run-time expression can’t refer at all to an identifier that is defined with define-for-syntax.
The syntax-local-value function won’t find a value for an identifier that is defined with define-for-syntax, because syntax-local-value is meant to be used on a part of a syntax tree for a run-time expression. That is, it’s meant to be used on bridge identifiers.
Now, it’s off to Lab ... and Yet More Syntax to practice some more.