Chapter 8. Building AST. Part 1

In this chapter, we will be working on implementing AST, the abstract syntax tree, that represents the program in the form of connected nodes, which are responsible for atomic actions such as variable declaration, or computing a sum of two values, or calling a function. This is probably the most difficult and brain-puzzling chapters in the book.

This is a chapter from
Creating a compiler with Raku

In this chapter, we will be working on implementing AST, the abstract syntax tree, that represents the program in the form of connected nodes, which are responsible for atomic actions such as variable declaration, or computing a sum of two values, or calling a function. This is probably the most difficult and brain-puzzling chapters in the book.

But first, let me demonstrate why adding another layer of abstraction can be beneficial.

Thinking of the if keyword

In the previous chapters, we implemented a lot of features that can be executed in a linear order. It is time to think about conditional statements and user-defined functions. The next small example demonstrates that we have to make a big shift before we can continue.

Let us imagine we want to implement the if clause that affects function calls. So, you will be able to disable the action if the condition is false:

my condition = 0;
if condition say "Passed";

Update the grammar and include an optional condition part:

rule function-call {
    ['if' <variable-name>]? <function-name> <value>
}

In the action, check the value of the variable and decide if you proceed with the function call:

method function-call($/) {
    if $<variable-name> {
        my $condition = %!var{$<variable-name>};
        return unless $condition;
    }
    . . .
}

Run the program with different values of the condition variable and confirm that it behaves as expected: when the variable is zero, the Passed string is not printed. Set the variable to any non-zero value, and the string is printed.

It seems to be a working solution even if you change the syntax and make the if keyword to be in a postfix clause:

rule function-call {
    <function-name> <value> ['if' <variable-name>]?
}

There are no changes in the action needed. The test program looks like this now:

my condition = 0;
say "Passed" if condition;

Let us undo the last change and try to generalise the if keyword and allow it in variable assignments so that you can write:

my condition = 0;
if condition say "Passed";

my x = 8;
x = 9;
if condition x = 10;
say x;

Of course, it is possible to embed the condition to the assignment rule:

rule assignment {
    ['if' <variable-name>]?
    <variable-name> <index>? '='
        [
            | [ <string> ':' <value> ]+ %% ','
            |                <value>+   %% ','
        ]
}

There are two difficulties here.

First, we now have two variable-name keys in the match object, and we have to refer to them as $<variable-name>[0] and $<variable-name>[1]. In one of the assignments in the example above, x = 9, the variable name lands in the 0th element, while in the second, if condition x = 10, it goes to the 1st one. This can easily be solved by using a named capture, for example, or by just looking at the length of the $<variable-name> array.

The second problem is that we have more than one variant of the assignment multi-method, and thus we have to update all of them. This is a vast duplication of code, and we have to do that for other possible statements where we want to allow an if clause.

Let us try to work on a closer-to-the-TOP-rule level in the grammar:

rule statement {
    | <variable-declaration>
    | [ 'if' <variable-name> ]? <assignment>
    | [ 'if' <variable-name> ]? <function-call>
}

The repeated part can be converted to a rule:

rule statement {
    | <variable-declaration>
    | <condition>? <assignment>
    | <condition>? <function-call>
}

rule condition {
    'if' <variable-name>
}

All the variants of different assignments (e.g., assigning to a scalar variable, or to an element of an array, or setting a value in a hash) are handled at a single point.

Handling the condition can be also done in a single place, in the new statement action:

method condition($/) {
    $/.make($<variable-name>.made ?? 1 !! 0);
}

method statement($/) {
    if $<condition> {
        my $condition = $<condition>.made;
        fail unless $condition;
    }
}

Looks good but it does not work. The program ignores the condition regardless its value:

Passed
10
OK

Let us debug this behaviour and embed a few printing instructions to the methods:

method statement($/) {
    if $<condition> {
        my $condition = $<condition>.made;
        say "test condition = $condition";
        fail unless $condition;
    }
}

multi method assignment($/ where !$<index>) {
    . . .
    else {
        say "assign $<variable-name> to " ~ $<value>[0].made;
        %!var{$<variable-name>} = $<value>[0].made;
    }
}

In the output, you will see the sequence in which the methods are called:

assign x to 9
assign x to 10
test condition = 0
10
OK

All assignments happen before the condition check. We already faced a similar situation in the section Skipping comments of Chapter 4 in the example of a program that did not have a semicolon after the last statement. The whole program did not pass the syntax check, but its last statement was executed.

Here is one of the two problematic lines of code:

if condition x = 10;

It is a statement in our language definition. To satisfy the statement rule, the parser has to check its alternatives, in this case, the statement with the assignment rule:

rule statement {
    . . .
    | <condition>? <assignment>
    . . .
}

Thus, both condition and assignment must match to make the whole statement rule successful. As soon as the assignment rule matches, its action method is triggered, and an assignment happens.

Similarly, you can track the calls for the third alternative of the statement rule and see that the printing happens before the parser can make use of the condition variable, so the action of the function-call rule takes place before we understand that we should not call it. Thus, the following line behaves incorrectly:

if condition say "Passed";

You can try moving all the code from the actions to the statement rule, where the condition is checked, or set some global variables that keep the results of the condition checks, but all those methods are error-prone and make the compiler code spaghetti. There is a better alternative, the AST.

In the next sections, we’ll be working on a set of classes to implement Lingua’s AST support. All the AST::* definitions will be kept in a separate file, LinguaAST.rakumod, and thus you have to add use LinguaAST; to LinguaActions.rakumod.

The AST blocks

We start with the smallest program to make the whole process easy. The program contains a single statement that declares a variable:

my a;

This program can be presented by a single AST node. Let us call it AST::ScalarDeclaration and give it two attributes describing the variable: its name and its value. In the given program, the initial value is not set by the user, so we will initialise it with zero.

The goal has been set, so let us implement it in code. As you may guess, we are going to create a Raku class. It may be helpful from the very beginning to inherit it from a common base class, which can be empty at this moment.

class ASTNode {    
}

class AST::ScalarDeclaration is ASTNode {
    has Str $.variable-name;
    has $.value;
}

The AST::ScalarDeclaration node represents the whole program, but there should exist some common start node that is the same for every possible program. Let us create another class to represent the TOP rule. As you remember from the previous chapters, our TOP rule matches either statements or multi-line comments:

rule TOP {
    [
        | <one-line-comment>
        | <statement> ';'
    ]*
}

There is no need to create AST nodes for comments. Comments are well recognised by the parser, but they do not emit any executable code. So we will just omit them in the AST of the program. The top-level AST node’s attribute is an array of possible statements, which are also ASTNodes:

class AST::TOP is ASTNode {
    has ASTNode @.statements;
}

That is enough for our program. Here is how the resulting tree looks like:

Later, we can walk along its branches and execute the instructions. But first, let us change the grammar actions so that they produce the AST nodes. 

The scalar declaration action can be executed either with an assignment or without it. In our program, we have a declaration only, so let us modify only one of the two multi-methods:

multi method scalar-declaration($/ where !$<value>) {
    $/.make(AST::ScalarDeclaration.new(
        variable-name => ~$<variable-name>,
        value => 0,
    ));
}

Here, we are creating the AST::ScalarDeclaration node. The variable-name attribute takes the name of the variable, the value attribute is set to zero.

For this, we not only have to change the existing actions, but also have to define the methods above the scalar-declaration, such as TOPstatement and variable-declaration.

The variable-declaration method is now just a proxy method that passes the made value to the next level.

method variable-declaration($/ where $<scalar-declaration>) {
    $/.make($<scalar-declaration>.made);
}

There are three alternative branches in the statement rule, but we need only one now. Again, we can simply pass the AST node further.

method statement($/ where $<variable-declaration>) {
    $/.make($<variable-declaration>.made);
}

Finally, at the TOP level, create the AST::TOP node and fill its @.statements with the below nodes.

method TOP($/) {
    my $top = AST::TOP.new;
    for $<statement> -> $statement {
        $top.statements.push($statement.made);
    }

    dd $top;
}

To see the structure of the generated AST tree, some debug output is added: dd $top.

Assemble the program fragments together and run the compiler against the program from the beginning of this section. Its top node should contain the following data:

AST::TOP $top = AST::TOP.new(statements => Array[ASTNode].new(AST::ScalarDeclaration.new(variable-name => "a", value => 0)))

You can see an array of statements hosting a single statement which is a scalar variable declaration. The variable is called a and its value is set to 0.

Let us add another line to the program and initialise another variable:

my a;
my b = 2;

The desired AST should contain two statement nodes, where the second one declares the second variable b with its initial value 2:

This time, it is better not to use multi-methods, as the change in the scalar-declaration method is tiny comparing to the size of the method itself. The check of the $<value> is done inline:

method scalar-declaration($/) {
    $/.make(AST::ScalarDeclaration.new(
        variable-name => ~$<variable-name>,
        value => $<value> ?? $<value>.made !! 0,
    ));
}

The rest of the chapter is dedicated to covering the existing grammar and generating AST nodes for other rules and tokens.

Value nodes

Consider a bit bigger program:

my c = 2 + 3;

Here, we are expecting that the compiler generates the code to add up two numbers. We are not considering any optimisations at this point, so we have to generate the AST nodes for every arithmetical operation. Comparing with the last AST tree, we can see that the value attribute of the AST::ScalarDeclaration node should not be a bare number but some other kind of node representing an addition. To keep the design simple, it is wise to introduce AST notes whose only work is to keep values, either integers, or floating point, or strings.

class AST::NumberValue is ASTNode {
    has $.value;
}

If there is no value (as in a minimal variable declaration without an initialiser), we can assign either 0 or an empty string, but the better solution is to define the null value and its wrapper AST node:

class AST::Null is ASTNode {
}

The AST::ScalarDeclaration node must point to an AST node, so let us restrict the type of its value attribute:

class AST::ScalarDeclaration is ASTNode {
    has Str $.variable-name;    
    has ASTNode $.value;
}

We are adding two more AST nodes to the tree, as shown in the diagram below.

The value node must be generated by the number action of the grammar:

method number($/) {
    $/.make(AST::NumberValue.new(value => +$/));
}

This node is taken as is in the scalar-declaration method, but if there is no value, then a copy of AST::Null is created:

method scalar-declaration($/) {
    $/.make(AST::ScalarDeclaration.new(
        variable-name => ~$<variable-name>,
        value => $<value> ?? 
                 $<value>.made !!
                 AST::Null.new(),
    ));
}

Extend the program with another variable declaration:

my c = b;

This time, we’ve got a variable instead of a value. But as soon as the type of the $.value attribute in AST::ScalarDeclaration is ASTNode, we can make a new class, AST::Variable, and save an instance of it. The AST::Variable object has a string attribute $.variable-name.

class AST::Variable is ASTNode {
    has Str $.variable-name;
}

Integrating the new node type is easy, as most of our action methods have descriptive signatures: 

multi method expr($/ where $<variable-name> && !$<index>) {
    $/.make(AST::Variable.new(
        variable-name => ~$<variable-name>));
}

Here is the resulting AST tree:

This tree corresponds to the following program:

my a;
my b = 2;
my c = b;

This is the end of the first part of this chapter. See you next Sunday for more of this!

Next: Chapter 8. Building AST. Part 2

One thought on “Chapter 8. Building AST. Part 1”

Leave a Reply

Your email address will not be published. Required fields are marked *