# Chapter 2. Parsing a Number

In the following two chapters, we will temporary leave the compiler created in the previous chapter, and will be working on a separate helper project, a calculator. It is an interesting thing on its own, and we will explore it on an isolated example. In the next chapters, it will be integrated to the interpreter.

In the following two chapters, we will temporary leave the compiler created in the previous chapter, and will be working on a separate helper project, a calculator. It is an interesting thing on its own, and we will explore it on an isolated example. In the next chapters, it will be integrated to the interpreter.

## Finding a number

Calculators work with numbers, so the first thing we’ll do will be a parser that parses numbers. In the previous chapter, we only worked with non-negative integers, but there are much more number formats that a good calculator has to understand. It includes, for example, negative numbers and floating-point numbers, which can be also written in scientific notation. We also have to allow the cases when people omit zeros before the decimal point and type `.5` instead of `0.5`.

Let’s iteratively create the parser for different kinds of numbers. As we are going to make a lot of changes in the grammar, a test suite comes to the rescue.

`my @cases = 7, 77, -84;for @cases -> \$number {    say "Test \$number";    say Number.parse(\$number);}`

The `@cases` array contains a list of numbers that will be tested against the `Number` grammar. Here is its first version:

`grammar Number {    rule TOP {        <number>    }    token number {        \d+    }}`

The whole grammar expects a single number, which is a sequence of digits. It works with our current test cases, and the program’s output is quite predictable:

`Test 7｢7｣ number => ｢7｣Test 77｢77｣ number => ｢77｣Test -84Nil`

The first two numbers pass the exam, but the third one does not. For a negative number, the parse method returns `Nil`.

So, we need to extend the `number` token and add an optional minus sign in front of the number. An important thing that it still has to be a token, as you normally do not expect a space between the sign and the number.

`token number {    '-'? \d+}`

This small change makes another half of integer numbers, negative integers, valid. The above test will now pass.

`Test -84｢-84｣ number => ｢-84｣`

But what if you spell a positive number as `+7`

`my @cases = 7, 77, -84, '+7', 0;`

A grammar parser always consumes strings, but we can let Raku convert positive numbers such as `7` or `77` to avoid extra quoting in the list of test cases. It the case of `+7`, we have to preserve the plus sign, so this test case is an explicit string.

This number is not a valid number according to the grammar. As it expects the whole string to be a number, we cannot expect that the plus sign will be simply ignored. The parser needs to know about the possible sign at the beginning, thus, we need to add it to the token body:

`token number {    <[+-]>? \d+}`

Let us also reduce the noise in the output, and do not print the parse tree:

`for @cases -> \$number {    my \$test = Number.parse(\$number);    say (\$test ?? 'OK ' !! 'NOT OK ') ~ \$number;}`

The output now is partially compatible with the TAP (Test Anything Protocol) which is widely used in testing Perl and Raku modules.

`OK 7OK 77OK -84OK +7OK 0`

Add a non-valid number and you’ll immediately see that it fails, for example:

`NOT OK 3.14`

The first attempt to implement a parser for the flowing-point numbers can be as simple as this:

`token number {    <[+-]>? \d+ ['.' \d+]?}`

We just added an optional decimal part. But what if you meet a number with a point but with no integer part? It will not pass the test, but it is easy to solve by changing `\d+` to `\d*` in the left part of the regex:

`token number {    <[+-]>? \d* ['.' \d+]?}`

Unfortunately, this change breaks the token, as it can be applied to a single sign or even to an empty string. All these test cases are now `OK`:

`my @cases =    7, 77, -84, '+7', 0,    3.14, -2.78, 5.0, '.5',    '', '-', '+';`

It is a bit tricky to try to express all the options in a single regex. It is much easier to explicitly list all the alternatives. As we know that the sign can appear in front of any number, let us group the alternatives in square brackets:

`token number {    <[+-]>? [        | \d+         | \d* ['.' \d+]    ]}`

This approach makes the whole token more readable and more extendable. We can add another alternative to match the numbers in scientific notation.

`token number {    <[+-]>? [        | \d+        | \d* ['.' \d+]        | \d+ <[eE]> <[+-]>? \d+    ]}`

More test cases pass the grammar now:

`'3E4', '-33E55', '3E-3', '-1E-2'`

What we are missing are the numbers in scientific notation with non-integer mantissa, e.g., `3.14E2` or `.5E-3`. Another alternative can solve this issue:

```token number {
<[+-]>? [
| \d+
| \d* ['.' \d+]
| \d+ <[eE]> <[+-]>? \d+
| \d* ['.' \d+] <[eE]> <[+-]>? \d+
]
}```

In this form, there are parts that repeat, such as `\d+` or `\d* ['.' \d+]`. In such compact rules that may be fine, but it is also possible to decompose it further and to introduce sub-tokens that are responsible for such repeated parts. The transformed `number` token and its family look like this:

`token number {    <sign>? [        | <integer>        | <floating-point>        | <integer> <exponent>        | <floating-point> <exponent>    ]}token sign {    <[+-]>}token exp {    <[eE]>}token integer {    \d+}token floating-point {    \d* ['.' <integer>]}token exponent {    <exp> <sign>? <integer>}`

While there is much more code comparing to the previous version, each individual token is simpler to understand. Compare, for example, the previous and the current regexes for a number in scientific notation having a floating point number in its mantissa. Earlier, it was:

`\d* ['.' \d+] <[eE]> <[+-]>? \d+`

After the transformation, it became:

`<floating-point> <exponent>`

Looking at the alternatives of the main token,

`| <integer>| <floating-point>| <integer> <exponent>| <floating-point> <exponent>`

you immediately see what they describe. Even more, we can reduce it further:

`token number {    <sign>? [        | <integer>        | <floating-point>    ] <exponent>?}`

In other words, a number is either an integer or a floating-point value, which is preceded by an optional sign and can be followed by an optional exponent part. In this form, the description is so clean and compact. All the details (the regex “noise”) are hidden in the supplementary tokens.

## Getting the value

The number has been parsed but what is its value? For a compiler, we need not only to check the validity of the number format but also to convert it from a string to a number, integer or floating-point. In this section, we will append actions to the `Number` grammar so that we can build the number and finally print it.

Let us start with integers and keep the number in a global variable.

`my \$n = 0;class NumberActions {    method integer(\$/) {        \$n = +\$/;    }}`

Everything looks simple here. An integer value is derived directly from converting the matched piece of the string to a numeric value using the `+` prefix operator. To see how it works, let us change the main test loop so that it prints the parsed value:

`for @cases -> \$number {    my \$test = Number.parse(\$number, :actions(NumberActions));        if (\$test) {        say "OK \$number = \$n";    }    else {        say "NOT OK \$number";    }}`

With non-negative integer numbers, it works well:

`OK 7 = 7OK 77 = 77OK -84 = 84OK +7 = 7OK 0 = 0`

The negative number did not work. It looks like the `+7` string was processed correctly, but actually that’s not fully true as we completely ignored the sign. This time, the task is a bit more complicated. The first idea is to flip the sign if we meet the minus sign:

`method sign(\$/) {    \$n *= -1 if ~\$/ eq '-';}`

That does not work though, as the sign is parsed before the digits are parsed, and negating `\$n` means applying a sign to zero. We can use a separate variable to keep the information about the sign but that’s not the best thing probably. But let’s do that because this will reveal another problem.

`my \$n = 0;my \$sign = 1;class NumberActions {    method integer(\$/) {        \$n = \$sign * +\$/;    }    method sign(\$/) {        \$sign = -1 if ~\$/ eq '-';    }}`

This helps to detect the first negative number, but it breaks all the rest. Of course, you can check if the sign was `'+'` but the problem is that the `\$n` and `\$sign` variables are global, and they must be reset before the next variable is parsed. This is a good time to move them into the actions class.

`class NumberActions {    has \$.n = 0;    has \$!sign = 1;    method integer(\$/) {        \$!n = \$!sign * +\$/;    }    method sign(\$/) {        \$!sign = -1 if ~\$/ eq '-';    }}`

The `\$n` variable is intentionally made a public data member as we have to get the result somehow. You also need to change the test loop.

`for @cases -> \$number {    my \$actions = NumberActions.new();    my \$test = Number.parse(\$number, :actions(\$actions));        if (\$test) {        say "OK \$number = " ~ \$actions.n;    }    else {        say "NOT OK \$number";    }}`

The main change here is that you pass an instance of the `NumberActions` class to the `parse` method of the grammar. Now, in each iteration, the parser creates its own variables to keep the result.

We moved far enough to have all the integers be parsed correctly:

`OK 7 = 7OK 77 = 77OK -84 = -84OK +7 = 7OK 0 = 0`

For the floating-point numbers, it does not work as smoothly:

`OK 3.14 = 14OK -2.78 = -78OK 5.0 = 0OK .5 = 5OK -5.3 = -3OK -.3 = -3OK 3E4 = 4OK -33E55 = -55OK 3E-3 = -3OK 3.14E2 = 2OK .5E-3 = -3`

As you can see, either the decimal or the exponential part wins. In both cases, that was the last integer part that went through the grammar. Indeed, earlier, we transformed the grammar to factor out the repeating parts. The bell ran when we had to introduce the `\$sign` variable but now we are suffering even more. All that needs to be handled differently. And here is how AST can help.

## Using AST

AST, or abstract syntax tree, is a mechanism that allows collecting and keeping data parsed at different stages. If the `integer` token is called twice when reading a floating-point number `3.14`, or if it was called three times for the number with an exponent, e.g., `3.14E2`, all those integers can be kept in the AST and used later to build the value corresponding to the whole string.

There are two methods available for the `\$/` variable inside the methods of the actions class: `make` and `made`. With `make`, you store a value (you assign an attribute to the current node of the parse tree). With `made`, you read the earlier-stored value.

Add the following calls to the token actions:

`method integer(\$/) {    \$/.make(+\$/);}method sign(\$/) {    \$/.make(~\$/ eq '-' ?? -1 !! 1);}`

The values are now saved even if the method is called more than once. To understand how that works, let us look inside the grammar object:

`my \$test = Number.parse(\$number, :actions(\$actions));dd \$test;`

The `dd` routine is a Rakudo-specific tool that shows the inner structure of the object. For the input number `-84`, the following object is built after parsing:

`Number \$test = Match.new(list => (), hash => Map.new((:number(Match.new(list => (), hash => Map.new((:integer(Match.new(list => (), hash => Map.new(()), made => 84, orig => -84, pos => 3, from => 1)),:sign(Match.new(list => (), hash => Map.new(()), made => -1, orig => -84, pos => 1, from => 0)))), made => Any, orig => -84, pos => 3, from => 0)))), made => Any, orig => -84, pos => 3, from => 0)`

It looks messy but you should be able to spot the two places we are mostly interested in:

`made => 84, orig => -84, pos => 3, from => 1`
`made => -1, orig => -84, pos => 1, from => 0`

The `from` and `pos` keys are set to point to the first character and to the character after the last one that matches the regex. So, the first of these two sub-hashes are the results of parsing the digits (from position 1 to position 3 in the string `'-84'`, thus `84`). The second hash corresponds to the minus character (positions 0 to 1 in the same string).

The `made` attribute is set to `84` and `−1` respectively, which confirms that the grammar was able to parse the number and its sign correctly.

These values can now be used to make the result in the parent token.

`method number(\$/) {    my \$n = \$<integer>.made;    \$n *= \$<sign>.made if \$<sign>;    \$/.make(\$n);}`

It accesses the integer value and the sign multiplier via the `made` attributes of the `\$<integer>` and the `\$<sign>` objects. The last line passes the result to the next level, and you can access it from the `TOP` rule:

`method TOP(\$/) {    \$/.make(\$<number>.made);}`

For the current task of parsing numbers, the `TOP` rule and method could be completely replaced with the content of the `number` token and the action method (notice that `TOP` is now a token, not a rule):

`grammar Number {    token TOP {        <sign>? [            | <integer>            | <floating-point>        ] <exponent>?    }    . . .}class NumberActions {    method TOP(\$/) {        my \$n = \$<integer>.made;        \$n *= \$<sign>.made if \$<sign>;        \$/.make(\$n);    }    . . .}`

If you look at the object returned by the `parse` method, you will see that it contains the following fields:

`from => 0, orig => -84, made => -84, pos => 3`

The `made` value contains our desired negative integer. It is accessible from outside of the grammar and from outside of the actions using the same `made` attribute:

`my \$test = Number.parse(\$number, :actions(\$actions));if (\$test) {    say "OK \$number = " ~ \$test.made;}`

The task is complete, and we can move on to the integers with an exponential part, e. g., `3E4`. With such numbers, the `integer` token is triggered twice, but that is not a problem, as both integer numbers reside in the corresponding `made` attributes of different objects.

Create an action to work with the exponential part:

`method exponent(\$/) {    my \$e = \$<integer>;    \$e *= -1 if \$<sign> && ~\$<sign> eq '-';    \$/.make(\$e);}`

And use the value to multiply the number:

```method TOP(\$/) {
\$n *= 10 ** \$<exponent>.made if \$<exponent>;
\$/.make(\$n);
}```

Run the test suite and examine what it produces for the numbers like `3E4` or `−1E−2`:

`OK 3E4 = 30000OK -33E55 =-330000000000000000000000000000000000000000000000000000000OK 3E-3 = 0.003OK -1E-2 = -0.01`

At this moment, the only token that has no associated action is `floating-point`. (The `exp` token does not need any, as its only task is to match with either `e` or `E`). Let’s look at it once again:

`token floating-point {    \d* ['.' <integer>]}`

When we were creating the token, we replaced the `\d+` part with `<integer>`. Actually, an optional sequence `\d*` can be also replaced with it:

`token floating-point {    <integer>? ['.' <integer>]}`

There are two `<integer>` calls within the same token now! What do you write in the action in this case? It’s simple: if the name is mentioned more than once, you get an array and thus you can refer to the first occurrence as `\$<integer>`, and as `\$<integer>` to the second one.

The only problem is that in our case the first integer part is optional. If you parse `3.14`, you’ll get two elements as expected, but if you parse `.14`, then `14` will go to the element with the index 0. One of the possible solutions is just to check the length of the array. Evaluating the value is a relatively simple task.

`method floating-point(\$/) {    my \$int = 0;    my \$frac = 0;    if \$<integer>.elems == 2 {        (\$int, \$frac) = \$<integer>;    }    else {        \$frac = \$<integer>;    }    my \$n = \$int + \$frac / 10 ** \$frac.chars;    \$/.make(\$n);}`

The `TOP` token also has to be updated to get the value of the floating-point number if it was parsed:

`my \$n = \$<integer> ??         \$<integer>.made !! \$<floating-point>.made;`

The task seems to be solved. All the numbers, including those with a decimal point and an exponential part, are successfully handled:

```OK 3.14 = 3.14
OK -2.78 = -2.78
OK 5.0 = 5
OK .5 = 0.5
OK -5.3 = -5.3
OK -.3 = -0.3
OK 3E-3 = 0.003
OK -1E-2 = -0.01
OK 3.14E2 = 314
OK .5E-3 = 0.0005```

## Final notes

In this chapter, we managed to convert strings to numbers, but what kind of numbers are they? To get the numbers, we were using operators that are available in Raku, such as binary `+`, arithmetic operators, and the power operator `**`. The result of the calculations that use all of those is a number that is an instance of one of the numeric types that Raku offers.

You can see the class name by explicitly printing it in the test loop:

```say "OK \$number = " ~ \$test.made ~
' (' ~ \$test.made.^name ~ ')';```

All numbers that did not contain a decimal point became `Int`s, all the rest are `Rat`s:

`OK 7 = 7 (Int)OK 77 = 77 (Int)OK -84 = -84 (Int)OK +7 = 7 (Int)OK 0 = 0 (Int)OK 3.14 = 3.14 (Rat)OK -2.78 = -2.78 (Rat)OK 5.0 = 5 (Rat)OK .5 = 0.5 (Rat)OK -5.3 = -5.3 (Rat)OK -.3 = -0.3 (Rat)OK 3E4 = 30000 (Int)OK -33E55 = -330000000000000000000000000000000000000000000000000000000 (Int)OK 3E-3 = 0.003 (Rat)OK -1E-2 = -0.01 (Rat)OK 3.14E2 = 314 (Rat)OK .5E-3 = 0.0005 (Rat)`

Now, look again at the `floating-point` method in the actions class. Although its algorithm is straightforward and produces correct results, it is quite wordy and needs a few lines of code. Alternatively, you can pass this task to the host language itself! Let Raku parse the floating-point number for you:

`my \$n = +"\$int.\$frac";\$/.make(\$n);`

Wait, what is `"\$int.\$frac"`? It is a string that was matched by the `floating-point` token during the parsing process, and it means that instead of reconstructing the string and converting it to a number, we can access it directly converting the `\$/` object to a number directly:

`method floating-point(\$/) {    \$/.make(+\$/);}`

Does this code resemble something that you’ve already seen? The body of this method is exactly the same as the body of the `integer` method:

`method integer(\$/) {    \$/.make(+\$/);}`

Indeed, we allowed Raku to build the number for us when it contained digits only. We can delegate it again if we met a number with a decimal point, too.

But that’s not all. The numbers that our `Number` grammar allows are all valid Raku numbers, and it is possible to replace all our actions with a single line of code:

`class NumberActions {    method TOP(\$/) {        \$/.make(+\$/);    }}`

After this change, the types of the number differ a bit. Raku treats a number in the scientific notation as `Num`, not `Rat`. You can confirm that by running the test loop again:

`OK 7 = 7 (Int)OK 77 = 77 (Int)OK -84 = -84 (Int)OK +7 = 7 (Int)OK 0 = 0 (Int)OK 3.14 = 3.14 (Rat)OK -2.78 = -2.78 (Rat)OK 5.0 = 5 (Rat)OK .5 = 0.5 (Rat)OK -5.3 = -5.3 (Rat)OK -.3 = -0.3 (Rat)OK 3E4 = 30000 (Num)OK -33E55 = -3.3e+56 (Num)OK 3E-3 = 0.003 (Num)OK -1E-2 = -0.01 (Num)OK 3.14E2 = 314 (Num)OK .5E-3 = 0.0005 (Num)`

The output format here also depends on how Raku prints the numbers of different numerical types.

In this particular task, all our manual labour is replaced by the compiler actions in the host language. Of course, that was possible because we choose standard data formats that many programming languages deal with. Do not be afraid to remove your own code as soon as you discovered a simpler way to solve the task. The techniques of working with AST that were demonstrated in this chapter, are the base for our future adventure in this book. Stay tuned!

P. S. An attentive reader may have noticed that the `Number` grammar does not include numbers such as `4.`, where there is an integer part, a decimal point but no decimal part. These numbers are not allowed in Raku itself, so I did not include it to the grammar.

Next: Chapter 3. Creating a Calculator