Chapter 6. Working with Strings

Until now, the language only worked with numbers—integer and floating point, both in regular and in scientific notation. In the following two chapters, we’ll update the language and integrate other data types to it: strings, arrays, and hashes. Let us start with strings.

This is a chapter from
Creating a compiler with Raku

Until now, the language only worked with numbers—integer and floating point, both in regular and in scientific notation. In the following two chapters, we’ll update the language and integrate other data types to it: strings, arrays, and hashes. Let us start with strings.

Strings

The first goal is to allow double-quoted strings in variable assignments (or initialisations) and in function calls:

my str = "Hello, World";
say str;

say "Hello indeed";

In both cases, the grammar expects an expression:

rule variable-declaration {
    'my' <variable-name> [ '=' <expression> ]?
}

rule function-call {
    <function-name> <expression>
}

So, the simplest solution can be adding a string as a variant of an expression at the same level where a number may appear:

multi rule expr(4) {
    | <number>
    | <string>
    | <variable-name>
    | '(' <expression> ')'
}

Add a new branch to the action, too:

method expr($/) {
    if $<number> {
        $/.make($<number>.made);
    }
    elsif $<string> {
        $/.make($<string>.made);
    }

    . . .
}

The string itself is defined as a character sequence between double quotes:

rule string {
    '"' .*? '"'
}

Its AST attribute is simply a stringified match object:

method string($/) {
    $/.make(~$/)
}

That was a low-hanging fruit, but there are two problems with this approach. First, the quotes become a part of a string. Second, if the user puts a string in an arithmetic expression (e.g., "Hello" + "World"), a compiler fails with an exception.

It is easy to resolve the first issue: make a capturing group and access it as [0] in the action:

rule string {
    '"' ( <-["]>* ) '"'
}

. . .

method string($/) {
    $/.make(~$/[0]);
}

By the way, notice that we also starting changing the regex for the string. It can contain any character expect a double quote. We will need more of that kind soon.

The second problem is also solvable, but with a more significant rebuilding of the grammar. Let us re-introduce the value rule, but this time, it can be either an arithmetic expression or a string:

grammar Lingua is CommentableLanguage does Number {
    . . .

    rule variable-declaration {
        'my' <variable-name> [ '=' <value> ]?
    }

    rule assignment {
        <variable-name> '=' <value>
    }

    rule function-call {
        <function-name> <value>
    }

    rule value {
        | <expression>
        | <string>
    }

    . . .
}

So, a string (as a kind of value) is allowed in assignments and in function calls.

In the actions, similar changes should be made. We work with values instead of expressions:

class LinguaActions {
    has %!var;

    method variable-declaration($/) {
        %!var{$<variable-name>} = 
            $<value> ?? $<value>.made !! 0;
    }

    method assignment($/) {
        %!var{~$<variable-name>} = $<value>.made;
    }

    method function-call($/) {
        say $<value>.made;
    }

    . . .

    method value($/) {
        if $<expression> {
            $/.make($<expression>.made);
        }
        elsif $<string> {
            $/.make($<string>.made);
        }
    }

    . . .
}

The result is exactly what was needed. Regular strings are parsed and understood, while an attempt to use them with numbers or in an expression leads to a parse error.

Escaping quotes

It is quite obvious that the above-defined string cannot contain double quotes inside itself. Let us keep " as the only possible quoting symbol, but allow escaping the quote characters inside a string:

say "Hello, \"World\"!";

Currently, the parser stops at the second quote and terminates with an error:

Hello, \
Error: parse failed

To allow an escaped quote, the string rule must consume all non-quote and non-backslash characters, and if it sees a backslash, then only the sequence of \" is considered valid:

token string {
    '"' ( [
          | <-["\\]>+
          | '\\"'
          ]* )
    '"'
}

We prefer a token over a rule from this moment to avoid skipping spaces after the opening quote.

In Raku, you can format the regexes freely enough to form some kind of ASCII graphics that helps to grasp a regex quicker. As we did it earlier, an additional vertical bar is added again to visualise the alternatives.

An escaped quote character is now allowed, but the backslash itself became an illegal character in all other cases. Let’s add another escape sequence, \\, to express a single backslash in a string:

token string {
    '"' ( [
          | <-["\\]>+
          | '\\"'
          | '\\\\'
          ]* )
    '"'
}

All other sequences are not allowed, e.g., "\W" is an error. Below are a few examples of valid strings with backslashes and quotes. A string may contain a newline or can be empty:

say "\\";
say "\"";
say "\\\\";
say "\"\"";
say "multi-
     line";
say "";

As we do not need an escaping backslash in the output, it should be removed from the string before passing it further:

method string($a) {
    my $s = ~$a[0];
    $s ~~ s:g/\\\"/"/;
    $s ~~ s:g/\\\\/\\/;
    $a.make($s);
}

Variable interpolation

As in Raku itself, we want variables to be interpolated if they appear in a string in Lingua. Let us use the following syntax: to interpolate a variable, use its name preceded by a dollar sign, as shown in the example below:

my name = "John";
say "Hello, $name!";

If so, then a literal dollar has to be escaped if needed:

say "5\$";

Let us introduce a new escape sequence in the string token together with a new alternative starting with a $:

token string {
    . . .
            | '\\$'
            | '$' <variable-name>
    . . .
}

The variable name itself is parsed by another token, variable-name.

When the parser sees an interpolation candidate in a string, it creates an array in the match object with the list of all variables (even if there is only one). These names can be read from $a[0]<variable-name>. The next step is to replace all such occurrances with the content of the variables.

method string($a) {
    my $s = ~$a[0];

    for $a[0]<variable-name>.reverse -> $var {
        $s.substr-rw($var.from - $a.from - 2,
                     $var.pos - $var.from + 1) = %!var{$var};
    }

    . . .
}

The loop needs some comments. The $a container hosts a Match object (Match is a built-in class in Raku for keeping the results of regex matching) that keeps a reference to the whole input string: not just the part saved in $s but the whole program being parsed. Its from and pos attributes show the edges of the string in hand (Hello, $name! in our example).

The $var variable is another Match object that keeps the position of a variable name in the source text. An in-place replacement using the substr-rw method replaces the variable name together with the preceding dollar character with the value of the variable.

To simplify the calculation of the starting and ending positions of the interpolated variable names, substitutions are executed from the end of the string towards its beginning (note the reverse method in the loop).

It is therefore possible to have more than one interpolated variable in a string:

my name = "John";
my another_name = "Carla";
say "Hello, $name and $another_name!";

Indexing a string

Allowing to get a given character would be a good addition to the string type in our language. Let us allow the standard syntax like this:

my s = "abcdef";
say s[3]; # prints d

We only allow indexing if the string is stored in a variable, thus let’s update the corresponding rule to have an optional integer index in square brackets:

multi rule expr(4) {
    | <number>
    | <variable-name> [ '[' <integer> ']' ]?
    | '(' <expression> ')'
}

In the action, check if the integer attribute is present, and return a requested character:

method expr($/) {
    . . .
    
    elsif $<string> {
        $/.make($<string>.made);
    }
    elsif $<variable-name> {
        if $<integer> {
            $/.make(
                %!var{$<variable-name>}.substr(
                +$<integer>, 1));                                            
        }
        else {
            $/.make(%!var{$<variable-name>});
        }
    }

    . . .
}

Indexing does not work with string interpolation, but you can always use a temporary variable to achieve the goal:

my ch = s[4];
say "The 4th character is \"$ch\""; # e

As an exercise, try implementing string indices in interpolated strings. Be ready to think of escaping square brackets outside of the index.

And that’s the end of this chapter. In the next chapters, we’ll work on complex data structures, arrays and hashes. You will see how we can re-use the code written for the strings.

Next: Chapter 7. Arrays and Hashes

One thought on “Chapter 6. Working with Strings”

Leave a Reply

Your email address will not be published. Required fields are marked *

Retype the CAPTCHA code from the image
Change the CAPTCHA codeSpeak the CAPTCHA code