Chapter 5. Working on Grammar

This is a chapter from
Creating a compiler with Raku

In this chapter, we’ll review the grammar that was created so far and will try to make some changes to make the grammar and the actions more compact, more readable and more user-friendly. The bigger the language becomes, the more important it is to keep its code maintainable.

Executable

The first thing to do in this chapter is to make the current interpreter (on its way to be a compiler) a proper executable program so that we can easily call it from the command line and pass the filename containing a Lingua program to it:

./lingua my-prog.lng

The lingua executable has to check whether you passed the filename and if the file exists. Then, it parses and executes the input program. Here is the whole code:

#!/usr/bin/env raku

use lib '.';
use Lingua;
use LinguaActions;

error('Usage: ./lingua <filename>') unless @*ARGS.elems;

my $filename = @*ARGS[0];
error("Error: File $filename not found") unless $filename.IO.f;

my $code = $filename.IO.slurp();
my $result = Lingua.parse($code, :actions(LinguaActions));
say $result ?? 'OK' !! error('Error: parse failed');

sub error($message) {
    note $message;
    exit;
}

The error function prints an error message and terminates the program. It uses the note built-in function, which behaves like say but sends the output to the standard error stream. The die routine is not used here as it prints additional information about the location of the error, which is not really needed in this case. Suppressing extra output of die needs roughly the same number of lines as introducing a new function.

Composing and inheriting grammars

In the Lingua language, we allow one-line comments starting with the # character, and inline and multi-line comments between /* and */. Such comments are used in other programming languages too, and it may be useful to extract the rules for the comments out of the language grammar and put it in a separate class. This also makes the main language grammar smaller and more transparent.

Let’s recap the fragments of the existing Lingua grammar that handle comments:

grammar Lingua {
    rule TOP {
        [
            | <comment>
            | <statement> ';'
        ]*
    }

    rule comment {
        '#' \N*
    }

    regex ws {
        <!ww> [
            | \s*
            | \s* '/*' \s* .*? \s* '*/' \s*
        ]
    }

    . . .
}

Most of these can go to a separate grammar class. It is also wise to make it a bit wordy to clearly distinguish between the two types of comments.

grammar CommentableLanguage {
    regex ws {
        <!ww> [
            | \s*
            | \s* <inline-comment> \s*
        ]
    }

    regex inline-comment {
        '/*' \s* .*? \s* '*/'
    }

    rule one-line-comment {
        '#' \N*
    }
}

The CommentableLanguage grammar only knows about what to do with comments, but as it now resides in a separate class, it can be the base for another language definition. In our case, Lingua can be derived from it:

use CommentableLanguage;

grammar Lingua is CommentableLanguage {
    . . .
}

The use statement is required if you placed the CommentableLanguage class in a separate file.

In Lingua, the only change required now is using a proper name for the one-line comment in the main code:

rule TOP {
    [
        | <one-line-comment>
        | <statement> ';'
    ]*
}

All the rest is done automatically. For instance, the default ws regex from the Raku’s Grammar class is now replaced with ws from CommentableLanguage.

We can make another simplification of the main grammar by extracting the part which is responsible for parsing numbers. As with comments, that part can be placed in a separate class too. In this case, though, it is better to make it a role and also save it in a separate file.

role Number {
    token number {
        <sign>? [
            | <integer>
            | <floating-point>
        ] <exponent>?
    }

    token sign {
        <[+-]>
    }

    token exp {
        <[eE]>
    }

    token integer {
        \d+
    }

    token floating-point {
        <integer>? ['.' <integer>]
    }

    token exponent {
        <exp> <sign>? <integer>
    }
}

Later, if needed, you can easily modify the Number role to allow other types of numbers in the program. To append it to the Lingua grammar, use the does keyword:

use CommentableLanguage;
use Number;

grammar Lingua is CommentableLanguage does Number {
    . . .
}

Reviewing the calculator

The part of the grammar that came from the calculator includes a few parts which resemble each other.

rule expression {        
    <term>* %% <op1>
}

rule term {
    <factor>* %% <op2>
}

rule factor {
    <value>* %% <op3>
}

But first, let us think of the quantifiers in there. The star allows any number of repetitions of either term, or factor, or value. What if a program contains none, say, as in a fragment shown below:

my x;
x = ;
say x;

This is obviously wrong, but the Lingua grammar does not return Nil. It fails earlier, producing an undesired messy output from Raku:

Cannot shift from an empty Array
  in sub process at /Users/ash/lingua/LinguaActions.rakumod (LinguaActions) line 52
  in method factor at /Users/ash/lingua/LinguaActions.rakumod (LinguaActions) line 46
  in regex factor at /Users/ash/lingua/Lingua.rakumod (Lingua) line 48
  in regex term at /Users/ash/lingua/Lingua.rakumod (Lingua) line 44
  in regex expression at /Users/ash/lingua/Lingua.rakumod (Lingua) line 40
  in regex assignment at /Users/ash/lingua/Lingua.rakumod (Lingua) line 23
  in regex statement at /Users/ash/lingua/Lingua.rakumod (Lingua) line 13
  in regex TOP at /Users/ash/lingua/Lingua.rakumod (Lingua) line 6
  in block <unit> at ./lingua line 13

Actually thrown at:
  in method function-call at /Users/ash/lingua/LinguaActions.rakumod (LinguaActions) line 13
  in regex function-call at /Users/ash/lingua/Lingua.rakumod (Lingua) line 27
  in regex statement at /Users/ash/lingua/Lingua.rakumod (Lingua) line 13
  in regex TOP at /Users/ash/lingua/Lingua.rakumod (Lingua) line 6
  in block <unit> at ./lingua line 13

That’s not what a user wants to see. The compiler broke instead of generating an error message. We have to change the grammar and demand at least one value at the place where an expression is expected. The simplest modification is to replace * with +:

rule expression {
    <term>+ %% <op1>
}

rule term {
    <factor>+ %% <op2>
}

rule factor {
    <value>+ %% <op3>
}

Now, we control the error message ourselves:

Error: parse failed

Using multi-rules

The three rules, expression, term, and factor all share the same pattern: a rule that repeats more than once with an operator in-between. We can unify them by using multi-methods which are offered by Raku for classes (and thus, for grammars). Instead of three different tokens op1, op2, and op3, let us create a single name and three alternatives by specifying an integer argument and its value.

multi token op(1) {
    '+' | '-'
}

multi token op(2) {
    '*' | '/'
}

multi token op(3) {
    '**'
}

The values 1 to 3 are not important for the grammar itself; for us, it indicates the operator’s precedence: the bigger the number the higher its precedence.

We also have to update the above-mentioned rules using these operator tokens:

rule expression {
    <term>+ %% <op(1)>
}

rule term {
    <factor>+ %% <op(2)>
}

rule factor {
    <value>+ %% <op(3)>
}

In the actions, we will not see the argument value, and all the names with a simple op:

method expression($/) {
    $/.make(process($<term>, $<op>));
}

method term($/) {
    $/.make(process($<factor>, $<op>));
}

method factor($/) {
    $/.make(process($<value>, $<op>));
}

It is clearly seen here that the action methods are the same, so we can reduce the code further, but first, let’s try running a test program to confirm that the first part of the transformation works.

Let’s continue and collapse the three rules and the three methods to a single rule and its corresponding generic method. Again, using multi-methods.

rule expression {
    <expr(1)>
}

multi rule expr(1) {
    <expr(2)>+ %% <op(1)>
}

multi rule expr(2) {
    <expr(3)>+ %% <op(2)>
}

multi rule expr(3) {
    <expr(4)>+ %% <op(3)>
}

multi rule expr(4) {
    | <number>
    | <variable-name>
    | '(' <expression> ')'
}

This time, the change is a bit larger. We introduced the new multi-rule expr that replaced both term and factor. To make the expr method uniform, the value method is replaced with expr(4). This is done to be able to access the former value as expr(4) from the former factor, which became expr(3).

After that, the first three expr alternatives that takes arguments 1, 2, and 3 can be replaced with a single generic rule with a simple math operation of $n + 1 that brings us to the next level.

multi rule expr($n) {
    <expr($n + 1)>+ %% <op($n)>
}

Now, the grammar includes two alternatives: expr($n) and expr(4). When the parser reaches the third level, it will choose a more specific expr(4) alternative next, which stops the recursion.

In the actions class, the following two methods remain; they replace the methods expression, term, factor, and value:

method expression($/) {
    $/.make($<expr>.made);
}

method expr($/) {
    if $<number> {
        $/.make($<number>.made);
    }
    elsif $<variable-name> {
        $/.make(%var{$<variable-name>});
    }
    elsif $<expr> {
        $/.make(process($<expr>, $<op>));
    }
    else {
        $/.make($<expression>.made);
    }
}

At first, it may seem that we made the grammar and actions less transparent, but if you will need to introduce more operators, you will only have to add the new op(n) rule in the grammar and their corresponding sub in the actions class.

Get rid of globals

For storing variable values, we are using a global hash %var. Let’s make the program more elegant and move the storage to the actions class as a data member.

class LinguaActions {
    has %!var;

    . . .
}

Of course, you should replace all occurrences of %var to %!var now, for example, in the assignment action (there are three more such places in the LinguaActions class):

method assignment($/) {
    %!var{~$<variable-name>} = $<expression>.made;
}

And finally, as we need a place for the hash in memory, thus you need to instantiate the actions class before calling the parse method:

my $result = Lingua.parse($code,
                          :actions(LinguaActions.new));

Better variable names

Before the end of this chapter, let us make another couple of small but very fruitful additions. Earlier, we made an ad-hoc token for parsing variable names:

token variable-name {
    \w+
}

This token matched the so-called word characters, which include letters, digits and the underscore character. The downside of this simple solution is that it allows a digit as the first character of the variable name, and the following code is formally grammatically correct:

my 4 = 3;
say 4;

To fix the situation, let us use the pre-defined token that matches letters:

token variable-name {
    [<:alpha> | '_'] \w*
}

Now, variable names can only start with a letter or the underscore character, followed by an optional part consisting on any word characters. For example, the previous wrong program can be converted this way:

my var_4 = 3;
say var_4;

Functions take expressions

Another ad-hoc solution that still persists in the grammar is function call. It can only take a variable name as its argument. We’ll dedicate a separate chapter to functions but for now, let us allow the following calls:

say 42;
say 100 + 300 / 3 ** (7 - 5);

Instead of a variable, an expression is passed to a function. So, update the function-call rule:

rule function-call {
    <function-name> <expression>
}

The action also requires an update. And a great thing is that by switching to expressions, we made the action simpler. This is how it looked before:

method function-call($/) {
    say %!var{$<variable-name>} if $<function-name> eq 'say';
}

And this is how it looks now:

method function-call($/) {
    say $<expression>.made;
}

A function just uses the value computed somewhere else and does not do any variable checks.

In this chapter, a lot of transformations of the grammar and its associating code were made. That changed the grammar to be more transparent and even allowed us to add some nice extensions. Consult the repository to make sure we are on the same page.

Next: Chapter 6. Working with Strings