This is a chapter from
Creating a compiler with Raku
In this chapter, we’ll review the grammar that was created so far and will try to make some changes to make the grammar and the actions more compact, more readable and more user-friendly. The bigger the language becomes, the more important it is to keep its code maintainable.
Executable
The first thing to do in this chapter is to make the current interpreter (on its way to be a compiler) a proper executable program so that we can easily call it from the command line and pass the filename containing a Lingua program to it:
./lingua my-prog.lng
The lingua
executable has to check whether you passed the filename and if the file exists. Then, it parses and executes the input program. Here is the whole code:
#!/usr/bin/env raku use lib '.'; use Lingua; use LinguaActions; error('Usage: ./lingua <filename>') unless @*ARGS.elems; my $filename = @*ARGS[0]; error("Error: File $filename not found") unless $filename.IO.f; my $code = $filename.IO.slurp(); my $result = Lingua.parse($code, :actions(LinguaActions)); say $result ?? 'OK' !! error('Error: parse failed'); sub error($message) { note $message; exit; }
The error
function prints an error message and terminates the program. It uses the note
built-in function, which behaves like say
but sends the output to the standard error stream. The die
routine is not used here as it prints additional information about the location of the error, which is not really needed in this case. Suppressing extra output of die
needs roughly the same number of lines as introducing a new function.
Composing and inheriting grammars
In the Lingua language, we allow one-line comments starting with the #
character, and inline and multi-line comments between /*
and */
. Such comments are used in other programming languages too, and it may be useful to extract the rules for the comments out of the language grammar and put it in a separate class. This also makes the main language grammar smaller and more transparent.
Let’s recap the fragments of the existing Lingua grammar that handle comments:
grammar Lingua {
rule TOP {
[
| <comment>
| <statement> ';'
]*
}
rule comment {
'#' \N*
}
regex ws {
<!ww> [
| \s*
| \s* '/*' \s* .*? \s* '*/' \s*
]
}
. . .
}
Most of these can go to a separate grammar class. It is also wise to make it a bit wordy to clearly distinguish between the two types of comments.
grammar CommentableLanguage { regex ws { <!ww> [ | \s* | \s* <inline-comment> \s* ] } regex inline-comment { '/*' \s* .*? \s* '*/' } rule one-line-comment { '#' \N* } }
The CommentableLanguage
grammar only knows about what to do with comments, but as it now resides in a separate class, it can be the base for another language definition. In our case, Lingua can be derived from it:
use CommentableLanguage;
grammar Lingua is CommentableLanguage {
. . .
}
The use
statement is required if you placed the CommentableLanguage
class in a separate file.
In Lingua, the only change required now is using a proper name for the one-line comment in the main code:
rule TOP {
[
| <one-line-comment>
| <statement> ';'
]*
}
All the rest is done automatically. For instance, the default ws
regex from the Raku’s Grammar
class is now replaced with ws
from CommentableLanguage
.
We can make another simplification of the main grammar by extracting the part which is responsible for parsing numbers. As with comments, that part can be placed in a separate class too. In this case, though, it is better to make it a role and also save it in a separate file.
role Number { token number { <sign>? [ | <integer> | <floating-point> ] <exponent>? } token sign { <[+-]> } token exp { <[eE]> } token integer { \d+ } token floating-point { <integer>? ['.' <integer>] } token exponent { <exp> <sign>? <integer> } }
Later, if needed, you can easily modify the Number
role to allow other types of numbers in the program. To append it to the Lingua grammar, use the does
keyword:
use CommentableLanguage;
use Number;
grammar Lingua is CommentableLanguage does Number {
. . .
}
Reviewing the calculator
The part of the grammar that came from the calculator includes a few parts which resemble each other.
rule expression {
<term>* %% <op1>
}
rule term {
<factor>* %% <op2>
}
rule factor {
<value>* %% <op3>
}
But first, let us think of the quantifiers in there. The star allows any number of repetitions of either term
, or factor
, or value
. What if a program contains none, say, as in a fragment shown below:
my x;
x = ;
say x;
This is obviously wrong, but the Lingua grammar does not return Nil
. It fails earlier, producing an undesired messy output from Raku:
Cannot shift from an empty Array in sub process at /Users/ash/lingua/LinguaActions.rakumod (LinguaActions) line 52 in method factor at /Users/ash/lingua/LinguaActions.rakumod (LinguaActions) line 46 in regex factor at /Users/ash/lingua/Lingua.rakumod (Lingua) line 48 in regex term at /Users/ash/lingua/Lingua.rakumod (Lingua) line 44 in regex expression at /Users/ash/lingua/Lingua.rakumod (Lingua) line 40 in regex assignment at /Users/ash/lingua/Lingua.rakumod (Lingua) line 23 in regex statement at /Users/ash/lingua/Lingua.rakumod (Lingua) line 13 in regex TOP at /Users/ash/lingua/Lingua.rakumod (Lingua) line 6 in block <unit> at ./lingua line 13 Actually thrown at: in method function-call at /Users/ash/lingua/LinguaActions.rakumod (LinguaActions) line 13 in regex function-call at /Users/ash/lingua/Lingua.rakumod (Lingua) line 27 in regex statement at /Users/ash/lingua/Lingua.rakumod (Lingua) line 13 in regex TOP at /Users/ash/lingua/Lingua.rakumod (Lingua) line 6 in block <unit> at ./lingua line 13
That’s not what a user wants to see. The compiler broke instead of generating an error message. We have to change the grammar and demand at least one value
at the place where an expression
is expected. The simplest modification is to replace *
with +
:
rule expression {
<term>+ %% <op1>
}
rule term {
<factor>+ %% <op2>
}
rule factor {
<value>+ %% <op3>
}
Now, we control the error message ourselves:
Error: parse failed
Using multi-rules
The three rules, expression
, term
, and factor
all share the same pattern: a rule that repeats more than once with an operator in-between. We can unify them by using multi-methods which are offered by Raku for classes (and thus, for grammars). Instead of three different tokens op1
, op2
, and op3
, let us create a single name and three alternatives by specifying an integer argument and its value.
multi token op(1) { '+' | '-' } multi token op(2) { '*' | '/' } multi token op(3) { '**' }
The values 1 to 3 are not important for the grammar itself; for us, it indicates the operator’s precedence: the bigger the number the higher its precedence.
We also have to update the above-mentioned rules using these operator tokens:
rule expression { <term>+ %% <op(1)> } rule term { <factor>+ %% <op(2)> } rule factor { <value>+ %% <op(3)> }
In the actions, we will not see the argument value, and all the names with a simple op
:
method expression($/) {
$/.make(process($<term>, $<op>));
}
method term($/) {
$/.make(process($<factor>, $<op>));
}
method factor($/) {
$/.make(process($<value>, $<op>));
}
It is clearly seen here that the action methods are the same, so we can reduce the code further, but first, let’s try running a test program to confirm that the first part of the transformation works.
Let’s continue and collapse the three rules and the three methods to a single rule and its corresponding generic method. Again, using multi-methods.
rule expression {
<expr(1)>
}
multi rule expr(1) {
<expr(2)>+ %% <op(1)>
}
multi rule expr(2) {
<expr(3)>+ %% <op(2)>
}
multi rule expr(3) {
<expr(4)>+ %% <op(3)>
}
multi rule expr(4) {
| <number>
| <variable-name>
| '(' <expression> ')'
}
This time, the change is a bit larger. We introduced the new multi-rule expr
that replaced both term
and factor
. To make the expr
method uniform, the value
method is replaced with expr(4)
. This is done to be able to access the former value
as expr(4)
from the former factor
, which became expr(3)
.
After that, the first three expr
alternatives that takes arguments 1, 2, and 3 can be replaced with a single generic rule with a simple math operation of $n + 1
that brings us to the next level.
multi rule expr($n) { <expr($n + 1)>+ %% <op($n)> }
Now, the grammar includes two alternatives: expr($n)
and expr(4)
. When the parser reaches the third level, it will choose a more specific expr(4)
alternative next, which stops the recursion.
In the actions class, the following two methods remain; they replace the methods expression
, term
, factor
, and value
:
method expression($/) {
$/.make($<expr>.made);
}
method expr($/) {
if $<number> {
$/.make($<number>.made);
}
elsif $<variable-name> {
$/.make(%var{$<variable-name>});
}
elsif $<expr> {
$/.make(process($<expr>, $<op>));
}
else {
$/.make($<expression>.made);
}
}
At first, it may seem that we made the grammar and actions less transparent, but if you will need to introduce more operators, you will only have to add the new op(n)
rule in the grammar and their corresponding sub
in the actions class.
Get rid of globals
For storing variable values, we are using a global hash %var
. Let’s make the program more elegant and move the storage to the actions class as a data member.
class LinguaActions { has %!var; . . . }
Of course, you should replace all occurrences of %var
to %!var
now, for example, in the assignment
action (there are three more such places in the LinguaActions
class):
method assignment($/) { %!var{~$<variable-name>} = $<expression>.made; }
And finally, as we need a place for the hash in memory, thus you need to instantiate the actions class before calling the parse
method:
my $result = Lingua.parse($code,
:actions(LinguaActions.new));
Better variable names
Before the end of this chapter, let us make another couple of small but very fruitful additions. Earlier, we made an ad-hoc token for parsing variable names:
token variable-name {
\w+
}
This token matched the so-called word characters, which include letters, digits and the underscore character. The downside of this simple solution is that it allows a digit as the first character of the variable name, and the following code is formally grammatically correct:
my 4 = 3;
say 4;
To fix the situation, let us use the pre-defined token that matches letters:
token variable-name {
[<:alpha> | '_'] \w*
}
Now, variable names can only start with a letter or the underscore character, followed by an optional part consisting on any word characters. For example, the previous wrong program can be converted this way:
my var_4 = 3;
say var_4;
Functions take expressions
Another ad-hoc solution that still persists in the grammar is function call. It can only take a variable name as its argument. We’ll dedicate a separate chapter to functions but for now, let us allow the following calls:
say 42;
say 100 + 300 / 3 ** (7 - 5);
Instead of a variable, an expression is passed to a function. So, update the function-call rule:
rule function-call { <function-name> <expression> }
The action also requires an update. And a great thing is that by switching to expressions, we made the action simpler. This is how it looked before:
method function-call($/) {
say %!var{$<variable-name>} if $<function-name> eq 'say';
}
And this is how it looks now:
method function-call($/) {
say $<expression>.made;
}
A function just uses the value computed somewhere else and does not do any variable checks.
In this chapter, a lot of transformations of the grammar and its associating code were made. That changed the grammar to be more transparent and even allowed us to add some nice extensions. Consult the repository to make sure we are on the same page.
Next: Chapter 6. Working with Strings
One thought on “Chapter 5. Working on Grammar”