Chapter 1. Creating a Simple Interpreter

This is a chapter from
Creating a compiler with Raku

Let’s start exploring the power or Raku’s grammars and regexes from a simple interpreter program that parses and executes the following tiny program. I will call this language Lingua.

my x;
x = 42;
say x;

You should not experience any problems with understanding what this code means, as the syntax is deliberately chosen to resemble the syntax of Raku itself, just no sigils in front of the variable names.

The program declares a variable named x, assigns an integer value to it and then prints the value to the console.

Suppose you saved the code in a file, say, test.lng. Let us now read it with Raku.

my $code = 'test.lng'.IO.slurp();
say $code;

Save this Raku program in another file, lingua.raku, and run it:

$ raku lingua.raku

If you have Raku installed, you’ll get the content of our test program printed.

Grammar

Now it is time to involve Raku grammars and parse the program. Syntactically, grammars are classes, but they use regexes to describe the behaviour of their methods, which are in its turn called rules and tokens. The first applied rule is usually called TOP; it is chosen by Raku as the default starting rule. Grammars are meant to parse some text, so just call the parse method on the defined grammar and pass the text to it. All that is demonstrated in our first program:

grammar Lingua {
    rule TOP {
        .*
    }
}

my $code = 'test.lng'.IO.slurp();
my $result = Lingua.parse($code);
say $result;

Here, the grammar Lingua is defined. It will describe our target language, and currently it only has a single rule, which is the first and the last that is applied to the Lingua code contained in $code.

The body of the TOP rule is a regex that matches any line: the dot matches any character, and a star quantifier allows any number of repetitions of it. The TOP method silently anchors the regex to the beginning and the end of the string, so it actually is equivalent to ^ .* $.

Run lingua.raku again and you’ll see that the parser managed to read the whole program. This is what is printed in the console:

｢my x;
x = 42;
say x;
｣

Those square angle brackets indicate that this is not a regular string that was printed. In our case, the $result variable contains an object of the Lingua class.

As you may see from the program in Lingua, its statements are separated by semicolon. To be precise at this point, you have to decide whether the statements are separated by semicolon or they should end with it. The difference is that in the first case, you don’t have to put semicolon after the last statement, for example, at the end of the program. Raku grammars allow to implement both options.

So, the program consists of a number of statements separated by semicolon. In a Raku grammar, you express this in the following way:

rule TOP {
    <statement>* %% ';'
}

Now, the TOP rule is described using another entity, statement. It can be repeated any number of times (including none) and if there are more than one statement, they have to be separated by the ';' character. If you change the definition and type % instead of %%, the rule will demand to have semicolons after each statement.

A statement itself is another rule, which we can first define to be really vague:

rule statement {
    <-[;]>*
}

It matches everything except the semicolon character. Empty statements pass this filter too.

With this change, the interpreter will now split the statements and emit the following output:

｢my x;
x = 42;
say x;
｣
 statement => ｢my x｣
 statement => ｢x = 42｣
 statement => ｢say x｣
 statement => ｢｣

The first block displays the whole text of the input program (it was all matched and consumed) and then the four separate statements. The last statement is empty because the original file contained a newline character after the last non-empty character, and the statement rule allows empty statements.

There are three different types of statements in our code. They are variable declaration, assignment, and calling a built-in function. It can be easily expressed by the grammar:

rule statement {
    | <variable-declaration>
    | <assignment>
    | <function-call>
}

Vertical bars delimit alternative branches of the rule. Formally, only two bars are required to list three options, but for the sake of making the code beautiful, you may add one more bar so that they form a long vertical line in the code outline. If you do that, the grammar will not add an empty match as the first alternative. By the way, another good side of Raku is that it allows hyphens in the names of the variables and methods, and I prefer variable-declaration over variable_declaration.

Looking at the test program residing in test.lng, we can define the new rules to the grammar:

rule variable-declaration {
    'my' <variable-name>
}

rule assignment {
    <variable-name> '=' <value>
}

rule function-call {
    <function-name> <variable-name>
}

Quoted strings such as 'my' and '=' match literally with the corresponding syntax items in the target language. The names in angle brackets are references to other rules or tokens that we need to define to complete the grammar. And here we come to our first token:

token variable-name {
    \w+
}

It matches a sequence of characters that can form a word (so it includes at least letters, digits and underscores). Also notice that there should be at least one character in the name of a variable.

The main difference between rules and tokens is how Raku deals with whitespaces. Look, for example, at the body of the variable-declaration rule:

'my' <variable-name>

It makes both of the following texts legal:

my x
my    x

Would you create a token instead of a rule, only myx can be matched. As you see, tokens in the grammar are perfect for terminals such as variable names or keywords.

Here’s another example of a token:

token value {
    \d+
}

At our first approach, we limit the possible values with non-negative integer numbers only. Later, we will extend the token to include the numbers of other types.

Finally, the token for the function name. There is only one built-in function so far, so the rule (it can be a token in this case) is straightforward:

rule function-name {
    'say'
}

That’s it. Run the program, and here is what it finds (let me omit the first part of the output that duplicates the whole text of the program):

statement => ｢my x｣
  variable-declaration => ｢my x｣
   variable-name => ｢x｣
 statement => ｢x = 42｣
  assignment => ｢x = 42｣
   variable-name => ｢x｣
   value => ｢42｣
 statement => ｢say x｣
  function-call => ｢say x｣
   function-name => ｢say ｣
   variable-name => ｢x｣

The output reflects the structure of the parsed program as the grammar understands it. The indentation helps to see the nested structure of the program and its elements. The right side of each line reveals what part of the source code matched.

For example, the first line of the program, my x;, is a statement that contains a variable declaration my x and the variable name x. The semicolon was consumed by the separator of the TOP rule and did not go to the output tree. Similarly, the second statement, x = 42, is an assignment of the value 42 to a variable name x.

If you examine the output generated for the third line, say x;, you’ll see that the function name contains an extra space after the function name: ｢say ｣. This is easy to fix by making the rule a token:

token function-name {
    'say'
}

After this change, the result is cleaner:

statement => ｢say x｣
  function-call => ｢say x｣
   function-name => ｢say｣
   variable-name => ｢x｣

Actions

The target file is now completely parsed. We can split it into separate statements, and we can understand all parts of it. The only missing element is making all those parts working together to produce the result. This is what actions do in Raku.

Return to the test program in Lingua:

my x;
x = 42;
say x;

To see 42 in the console, we have to make sure there is a place where this value is stored and can be referenced by its name, x. In other words, we need a storage. The most obvious choice is to use a hash. Let us first make it a global variable:

my %var;

grammar Lingua {
    . . .
}

After a rule or a token is successfully matched, you can ask Raku to make something for you, namely, you can add a block of code, called action, which will be executed after the match. Inside it, you have access to the data that was just extracted.

Our first action is to create a variable when we see its declaration. Here is how you do it:

rule variable-declaration {
    'my' <variable-name> {
        %var{$<variable-name>} = 0;
    }
}

The action is placed immediately after the regex in a pair of curly braces. We know that the rule finds the substring my x by matching the literal 'my' followed by a variable name, which is a named token variable-name. We can use this name to access the content: $<variable-name>. Actually, this is an object of the Lingua class, but we use it as a key of a hash, so it is converted to a string, and we populate the hash with a new pair x => 0. Thus, a variable has been created and it is initialised with a zero.

Similarly, let us create an action for variable assignments.

rule assignment {
    <variable-name> '=' <value> {
        %var{~$<variable-name>} = +$<value>;
    }
}

Here, just to show how you could also do that, the $<variable-name> object is explicitly converted to a value of the Str data type by the ~ prefix operator. On the right side of the equals sign, another type conversion is done: the + operator converts $<value> to a number. This time, it is really important to cast the value, as if you do not do that, you will save a Lingua object instead of a number.

And now moving on to the function call.

rule function-call {
    <function-name> <variable-name> {
        say %var{$<variable-name>}
            if $<function-name> eq 'say';
    }
}

The implementation of the say function is embedded in the grammar’s action. As we only have one built-in function now, the if clause is not really needed, but let us keep it to make the code more transparent.

How it is done so far, the three action blocks are inline. They are parts of the rule definitions. We can run the interpreter and see what it does. Change the main code to the following to avoid massive output:

my $code = 'test.lng'.IO.slurp();
my $result = Lingua.parse($code);
#say $result;

The code prints the following line:

$ raku lingua.raku
42

Congratulations! The program was not only parsed but also executed. As you see, it printed the content of the variable x, and it is exactly what we put in it. If you dump the %var container (by adding say %var;), you’ll get {x => 42}.

This is our first real achievement. We managed to create an interpreter of a subset of our future Lingua language. The most exciting part here is that it is not bound to a single test program that we used before. You can create as many variables as you want, you can assign it to different values and re-assign them again. The names of the variables can be longer than a single letter. And all that magically works! Try it yourself, here’s an example of what I did:

my alpha;
my beta;
alpha = 100;
beta = 200;

say alpha;
say beta;

my gamma;
gamma = 33;
say gamma;

gamma = 44;
say gamma;

After being executed, the program printed the correct result:

You can also try assigning to value more than once:

my value;

value = 100;
say value;

value = 200;
say value;

This time, the program uses the same variable twice and stores different values in it, which you can easily confirm by running the program:

100
200

Modules

The first simple interpreter is ready but let us spend a few more minutes to make its code a bit more structured and a bit faster.

First of all, the inline actions can be collected in a separate class. In our current implementation, all the actions are the one-liners, but in more advanced compilers it will not be like that. In Raku, it is very easy to express the relations between the actions and the rules of the grammar: just use the same names when creating the methods of the actions class. Look at the following code and you will understand it instantly.

class LinguaActions {
    method variable-declaration($/) {
        %var{$<variable-name>} = 0;
    }

    method assignment($/) {
        %var{~$<variable-name>} = +$<value>;
    }

    method function-call($/) {
        say %var{$<variable-name>}
            if $<function-name> eq 'say';
    }   
}

All these methods take a single argument. Its name can be whatever you want but for your convenience it is wise to call it $/, as in this case you can use shortcuts like $<value> instead of $/<value>. In the case you name it, say, $arg, you’ll have to type more characters to access its parts: $arg<value>. Also, don’t forget to remove the code blocks from the grammar class together with their surrounding curly braces.

To use the actions class with the grammar, pass it as a named argument to the parse method:

Lingua.parse($code, :actions(LinguaActions));

It is also a good practice to extract the classes and save them in separate files. For example, the grammar goes to Lingua.rakumod, and actions (together with the %var hash, which is currently a module’s global variable) to LinguaActions.rakumod. The whole interpreter code is shortened then to the following:

use Lingua;
use LinguaActions;

my $code = 'test.lng'.IO.slurp();
Lingua.parse($code, :actions(LinguaActions));

This step not only helps to logically organise the code but also increments the speed of interpretation. If the Raku compiler is able to cache the compiled modules, then you will only need to compile them once. Every next run is faster, because a pre-compiled version of the module is used.

Next: Chapter 2. Parsing a Number

4 thoughts on “Chapter 1. Creating a Simple Interpreter”

chenyf says:

February 24, 2020 at 1:42 pm

Great article. Maybe there is a typo, starting, in the following sentence:
「it is chosen by Raku as the default staring rule」

tinmarino says:

March 6, 2020 at 3:54 am

Great !

Tipo : ‘and here is what if finds’: if -> it

It’s realy nice to share that for free. You got it in html, pdf. Maybe you could also release epubs (for ebooks).

1. Andrew Shitov says:
  
  March 6, 2020 at 11:31 am
  
  It will be both epub and PDF in the future.
  
Alain Barbason says:

June 1, 2020 at 8:06 am

Very didactic

I had to add this line
use lib “.”;

Thanks

Grammar

Actions

Modules

4 thoughts on “Chapter 1. Creating a Simple Interpreter”

Leave a Reply to Alain Barbason Cancel reply