oddly accurate

In hoc2, we start by extending our mathematical expressions a bit, adding support for negative numbers. Next, we’ll allow the user to preserve state between calculations, by adding (single-letter) variables. We’ll also add a special @ variable for the last value computed.

To play with the new features, use make run-hoc2; they’re illustrated briefly below:

Variables

a
        0
a = 42
        42

a + 3
        45

a=1; b=2; c = d = e = a
        1
        2
        1

“Last-Value” Shorthand

Error Recovery

a2
./hoc2/hoc: syntax error(on line 7)

a +
./hoc2/hoc: syntax error(on line 9)

1/0
./hoc2/hoc: Division by zero (on line 9)

Allowing Semicolons to Separate Expressions

Generalizing our “expression separator” is the smallest change in hoc2. Rather than separating the expressions in our “lists” with newlines alone, we say that any “terminator” can be used to separate expressions.

  list:           /* nothing */
-         |       list '\n'
-         |       list expr '\n'    { printf("\t%.8g\n", $2); }
+         |       list terminator
+         |       list expr terminator
+                 {
+                     printf("\t%.8g\n", $2);
+                 }
+         ;
+ terminator:    '\n'
+         |      ';'
+         ;

Supporting Negative Numbers

All of the changes for negative numbers are contained within our yacc grammar. A “negative number”, when expressed to hoc, is really just shorthand for -1 * <expr>. This operation is called unary minus. An important detail about this operator is that it binds more tightly than any other operation. Since we already have a semantic meaning for -, we need to define negation separately to distinguish it from subtraction.

This is done by giving it a new name, with a higher precedence. We mark it as %nonassoc, because it makes no sense to have two negation operators next to each other with no joining operator. (That is, -2 -5 doesn’t mean anything).

  %left       '+' '-'      /* left-associative, same precedence */
  %left       '*' '/'      /* left-assoc; higher precedence */
+ %nonassoc   UNARY_MINUS  /* highest precedence */

You’ll notice that we do not make any lexing changes; indeed, our lexer would have a difficult time understanding whether it should return - vs UNARY_MINUS, since it depends on the surrounding symbols. Instead, the UNARY_MINUS token is used only to define a new level of precedence (since it appears last).

Recall that yacc uses the precedence of the rightmost token in a production to determine the precedence of the entire production. Yacc’s combination of tokens with precedence levels means that to create a new precedence level, we must have a new token, even if it’s unused.

So, the only use we have for the UNARY_MINUS token is to explicitly increase the precedence of the '-'expr syntax to the UNARY_MINUS level.

expr:   ...
+    |  '-'expr         %prec UNARY_MINUS
+       {
+           $$ = (-1 * $2);
+       }

The `@` Shorthand

The special @ expression returns the value of the previous evaluated expression. This is relatively simple to add, since we have only a single action where we use the “final result” of our evaluated expressions: the list action.

We update our C preamble to declare a variable for our “previous value”:

double previous_value = 0; // tracks '@'

We’ll write a remember() macro to use when we want to store a value into that previous_value.

// Used for the '@' feature
#define remember(v) (previous_value = (v))
%}

And then we remember() every value we display to the user.

list:   ...
        |   list expr terminator
        {
            printf("\t%.8g\n", remember($2));
        }

All that remains is to add a production for the @ expression:

  expr:      NUMBER          { $$ = $1; }
+     |      '@'             { $$ = previous_value; }

Variables

Variables require more extensive changes than our other features; the language, lexer, and surrounding program logic all coordinate to support them:

We’ll need to store variables somewhere, so we’ll add some state for current variable values.
Our language will need to change to recognize variable tokens, and to add syntax for both fetching the variable’s current value and updating that value during assignment.

Step 1: Variable State

Since our set of “potential” variables is very small (a–z), the most straightforward implementation is to keep a static array for them.

In our C prologue, we can add a declaration for our variables array.

+ /*
+  * Intepreter state
+  */
+ double variables[26]; // current value of each variable

Step 2: Updating our Tokens

Next, we will define the semantics of a variable. Our tokenizer will need to inform the parser that it’s received a variable token, and also produce a value the actions can use to get or set the variable value. For our simple variables array, an index into the array would allow both operations.

This change wouldn’t work by itself, since we need to distinguish integers that are variable indexes from the numbers our users type. For this, we’ll need to let yacc know that our tokenizer and parser understand more than one type of value.

First, we replace our simple YYSTYPE declaration with a union of allowed types. Then, we’ll associate each token with the type that they produce.

- #define YYSTYPE double // data type of yacc's stack
+ %}
+
+ /* We declare our %union of types in the yacc section, not in C code. */
+ %union {
+  double hoc_value; /* some tokens produce doubles */
+  int hoc_index;    /* some tokens produce an array index */
+ }

To associate our tokens with their type, we add the type to their declaration; unsurprisingly, NUMBER tokens produce doubles, and VAR tokens are indexes. So long as we allow assignment expressions to also produce the value assigned (as C does), then we can say that every expr will continue to produce a double as well.

%token  <hoc_value> NUMBER
%token  <hoc_index> VAR

/* After evaluation, all expressions produce doubles */
%type   <hoc_value> expr

The only other token required is =; this token needs its own declaration because it is right-associative (the right-hand side is evaluated before the left). It also has very low precedence; we want all the other operators on the line to be evaluated first.

%right  '='       /* right-associative, much like C */
%left   '+' '-'   /* left-associative, same precedence */
%left   '*' '/'   /* left-assoc; higher precedence */
%left   UNARY_MINUS /* highest precedence */

Step 3: Recognizing Variables

It’s quite simple for yylex to recognize variables; the islower() function is part of ctype.h. The index of any letter, c is simply c-'a'. The only other point of interest is the change in our use of yylval. Once our lexer can return any one of a union of types, we are required to assign to the appropriate union field of yylval.

  if (c == '.' || isdigit(c)) {
      ungetc(c, stdin);
      scanf("%lf", &yylval.hoc_value);
      return NUMBER;
  }

+ if (islower(c)) {
+   yylval.hoc_index = c - 'a';
+   return VAR;
+ }

Step 4: Using Variable Tokens in Parser Actions

Two new rules for expr provide all the support we need for variables. Notice that the compiler allows us to use the VAR values as indexes, since we’ve declared they’ll produce integer values.

  expr: ...
+       |   VAR             { $$ = variables[$1]; }
+       |   VAR '=' expr    { $$ = (variables[$1] = $3); }

Also note that, unlike our lexer, the parser actions do not need to specify which union field they are producing; we continue to assign to $$, not $$.hoc_value. That’s because yacc determines the appropriate field based on the type we’ve declared for the non-terminal we’re producing. Since all expr nonterminals produce the hoc_value type (double), that’s the field that’s updated when we assign to the $$ variable.

Handling Errors

There are two kinds of errors we will handle for hoc2, parsing errors and logic errors (e.g. divide-by-zero). We will need different mechanisms to handle each of the situations, but in either case, the goal is the same: we want to allow hoc2 to keep running.

How `yacc` Handles Parse Errors

By default, as we’ve seen, yacc will simply fail the parsing process on error. It doesn’t know how much of its current stack is acceptable in the presence of the error, and where in our input stream it should resume. If we want the parsing process to continue, we can add productions to our grammar to let yacc accept errors instead.

Yacc includes support for a built-in error token, which can be used wherever it makes the most sense for your language to handle an error.

  list:           /* nothing */
          |       list terminator
          |       list expr terminator
+         |       list error terminator { yyerrok; }

By allowing an error to occur as part of a list, we’re telling yacc that the error is an acceptable part of the list, and it may continue.

Errors and the Parse Stack

We’ve seen that we can have an error occur between expressions, but what if we mistype something in the middle of an expression? For example, we have an extra multiplication operator in this line:

62 * 83 - (43 + * 12) + 12

What does yacc do with:

The existing tokens & nonterminals on the parse stack?
The input characters that follow the error?

Luckily for us, yacc handles these cases exactly as one would hope:

First, it discards tokens and nonterminals from the stack until we are in a scenario that matches the rule including error. In our case, this means it will disard the parsed values of:

62 * 83 - (43 +

Second, it will throw away tokens from its input stream until it finds one that matches the error case; in our case, a terminator. This leaves the parser in a “clean” state to handle the next input.

Error Messages During Parsing Recovery

Yacc knows that, once an error has occurred, it’s quite possible that another will be seen soon after. For general parsing, this makes a lot of sense. Consider the case of recovering from a missing closing brace inside a function definition. How can the parser know when the function has truly ended? So, by default, yacc disables error messages after it’s seen an error, and waits until “enough” valid input has been read to recover.

In our case, the next line presents many new and exciting opportunities for syntax errors, so we want to see every error message. The yyerrok macro tells yacc we’re ready to receive error messages again.

Recovering from Logic Errors

We also need to handle expressions whose syntax is completely valid, but are expressing an invalid operation or logical error. This becomes more important as our language becomes more complex, but in hoc2 we’ll handle only one example: division-by-zero.

If we reach a scenario where we cannot execute the current set of input, it’s safest to completely reset our parsing process, rather than trying to move forward and hoping yacc’s stack will be valid. To that end, we will add a mechanism to report the logic error, and then go back to the beginning of the parsing process.

Our Error API

As hoc becomes more complex, we’ll want to provide more detailed error messages to users. To that end, we’ll want to expose a printf-style API for error messages.

Since we’re going to report errors from multiple locations, we want one centralized error-logging facility: the warning() function. It will allow us to add additional data into our messages.

Since we need additional behavior for execution errors (resetting our parsing state), we’ll also expose an exec_error() function with the same interface.

void warning(const char *msg, ...)
void exec_error(const char *msg, ...)

We declare prototypes for both the logging and error-recovery functions in our preamble, so that we can use them inside our parser actions and yyerror function.

  int yylex(void);
  void yyerror(const char *s);

+ __attribute__((format(printf, 1, 2)))
+ void warning(const char *msg, ...);

+ __attribute__((format(printf, 1, 2)))
+ void exec_error(const char *msg, ...);

 /*
  * Intepreter state
  */

Both of these functions are in the printf style, since they accept a “format string” and zero or more data arguments. We’d ideally like the compiler to tell us if we get the arguments wrong, just as it does for printf. Luckily, Both GCC and clang support adding the format attribute to our function declarations, which will do just that!

The only use of exec_error is inside our division action:

expr:     NUMBER      { $$ = $1; }
      |   expr '/' expr
+         {
+             if ($3 == 0) {
+               exec_error("Division by zero");
+             } else {
+               $$ = $1 / $3;
+             }
+         }
+

And the only warning() user is our yyerror function:

  void yyerror(const char *s) {
-   fprintf(stderr, "%s: %s (on line %d)\n", program_name, s, line_number);
+   warning("%s", s);
  }

Our only use of warninghoc2/hoc2.y

Implementing Error Handling

Formatted Messages

Both of our error-handling functions are variadic. Since there’s no way for one variadic function to call another, we’ll instead use a printf variant that accepts a va_list object. Since they use the same message format, it’d be nice to create a helper function for output, but this causes some compilers to forget that we just verified the input arguments, and output a warning when using -Wnonliteral-format. Instead, we use a couple of macros to keep the format consistent.

  void yyerror(const char *s) {
    warning("%s", s);
  }

+ #define print_error_prefix() fprintf(stderr, "%s: ", program_name)
+ #define print_error_suffix() fprintf(stderr, " (on line %d)\n", line_number)

helper macros for consistent formatting hoc2/hoc2.y

Now, implementing warning is as simple as converting our variadic arguments into a va_list object.


+ void warning(const char *msg, ...) {
+   va_list args;
+   print_error_prefix();
+
+   va_start(args, msg);
+   vfprintf(stderr, (msg), args);
+   va_end(args);
+
+   print_error_suffix();
+ }
+

Resetting Parsing State

In order to competely reset parsing state when an execution error occurs, we want to begin parsing input again, after reporting the error. To do this, we use a setjmp/longjmp pair (from setjmp.h) to place ourselves back at the beginning of our main() routine. (Note that we don’t reset our variables array here, though we could if desired) The jmpbuf state is declared in the C preamble:

  double previous_value = 0; // tracks '@'
+ jmp_buf begin; // used to recover from parsing errors; jump back to REPL.

It’s initialized in main():

int main(int argc, char *argv[]) {
      (void)argc;
      program_name = argv[0];
+
+     setjmp(begin);
      return yyparse();
  }

Now, all that remains is to longjmp() back to that location. Our exec_error function does just that, after warning the user of their mistake.

  void warning(const char *msg, ...) {
    ...
  }
+
+ void exec_error(const char *msg , ...) {
+   va_list args;
+   print_error_prefix();
+
+   va_start(args, msg);
+   vfprintf(stderr, (msg), args);
+   va_end(args);
+
+   print_error_suffix();
+   longjmp(begin, 0);
+ }

What’s Next

hoc2 is a good first example of adding some real functionality to the actions of our parser. In hoc3, we’ll extend this functionality to handle most of what we’d need from a real scientific calculator.

✕

Source Code

Makefile


CFLAGS ?= -std=c2x -Wall -Wextra -pedantic -Wformat -Wformat-extra-args -Wformat-nonliteral
YFLAGS ?= -d

hoc: hoc2
    cp hoc2 hoc

hoc2: hoc2.o

clean:
    rm -f hoc
    rm -f hoc2
    rm -f hoc2.o
    rm -f y.tab.h

hoc2.y


%{
///----------------------------------------------------------------
/// C Preamble
///----------------------------------------------------------------

/*
 * "higher-order calculator" - Version 2
 * From "The UNIX Programming Environment"
 */

#include <stdio.h>
#include <ctype.h>
#include <setjmp.h>
#include <stdarg.h>

int yylex(void);
void yyerror(const char *s);

__attribute__((format(printf, 1, 2)))
void warning(const char *msg, ...);

__attribute__((format(printf, 1, 2)))
void exec_error(const char *msg, ...);

/*
 * Intepreter state
 */
double variables[26]; // current values of each variable
double previous_value = 0; // tracks '@'
jmp_buf begin; // used to recover from parsing errors; jump back to REPL.

// Used for the '@' feature
#define remember(v) (previous_value = (v))
%}


%union {
  double hoc_value;
  int hoc_index;
}

%token  <hoc_value>     NUMBER
%token  <hoc_index>     VAR
%type   <hoc_value>     expr

/*
 * These rules increase in precedence as we move DOWNWARD.
 */
%right                  '='       /* right-associative, much like C */
%left                   '+' '-'   /* left-associative, same precedence */
%left                   '*' '/'   /* left-assoc; higher precedence */
%nonassoc               UNARY_MINUS  /* Even higher precedence
                        than * or /. Can't use '-', as we've used
                        it already for subtraction. */
%%

list:           /* nothing */
        |       list terminator
        |       list expr terminator
                {
                    printf("\t%.8g\n", remember($2));
                }
        |       list error terminator { yyerrok; }
;

terminator:    '\n' | ';'
;

expr:           NUMBER          { $$ = $1; }
        |       '@'             { $$ = previous_value; }
        |       VAR             { $$ = variables[$1]; }
        |       VAR '=' expr    { $$ = (variables[$1] = $3); }
        |       expr '+' expr   { $$ = $1 + $3; }
        |       expr '-' expr   { $$ = $1 - $3; }
        |       expr '*' expr   { $$ = $1 * $3; }
        |       expr '/' expr
                {
                    if ($3 == 0) {
                        exec_error("Division by zero");
                    } else {
                      $$ = $1 / $3;
                    }
                }
        |       '-'expr         %prec UNARY_MINUS { $$ = (-1 * $2); }
        |       '(' expr ')'    { $$ = $2; }
;
%% // end of grammar


/* error tracking state */
char *program_name;
int line_number = 1;

int main(int argc, char *argv[]) {
    (void)argc;
    program_name = argv[0];

    setjmp(begin);
    return yyparse();
}

/* our simple, hand-rolled lexer. */
int yylex(void) {
    int c;
    do {
        c = getchar();
    } while (c == ' ' || c == '\t');

    if (c == EOF) {
        return 0;
    }

    if (c == '.' || isdigit(c)) {
        ungetc(c, stdin);
        scanf("%lf", &yylval.hoc_value);
        return NUMBER;
    }

    if (c == '\n') {
        line_number++;
    }

    if (islower(c)) {
      yylval.hoc_index = c - 'a';
      return VAR;
    }

    return c;
}

void yyerror(const char *s) {
  warning("%s", s);
}

#define print_error_prefix() fprintf(stderr, "%s", program_name)
#define print_error_suffix() fprintf(stderr, "(on line %d)\n", line_number)

void warning(const char *msg, ...) {
  va_list args;
  print_error_prefix();

  va_start(args, msg);
  vfprintf(stderr, (msg), args);
  va_end(args);

  print_error_suffix();
}

void exec_error(const char *msg , ...) {
  va_list args;
  print_error_prefix();

  va_start(args, msg);
  vfprintf(stderr, (msg), args);
  va_end(args);

  print_error_suffix();
  longjmp(begin, 0);
}