In hoc2
, we start by extending our mathematical
expressions a bit, adding support for negative numbers. Next, we’ll
allow the user to preserve state between calculations, by adding
(single-letter) variables. We’ll also add a special
@
variable for the last value
computed.
To play with the new features, use make run-hoc2
; they’re illustrated
briefly below:
a0
= 42
a 42
+ 3
a 45
=1; b=2; c = d = e = a
a1
2
1
3.14159
3.14159
*@
@9.8695877
a2./hoc2/hoc: syntax error(on line 7)
+
a ./hoc2/hoc: syntax error(on line 9)
1/0
./hoc2/hoc: Division by zero (on line 9)
Generalizing our “expression separator” is the smallest change in hoc2. Rather than separating the expressions in our “lists” with newlines alone, we say that any “terminator” can be used to separate expressions.
list: /* nothing */- | list '\n'
- | list expr '\n' { printf("\t%.8g\n", $2); }
+ | list terminator
+ | list expr terminator
+ {
+ printf("\t%.8g\n", $2);
+ }
+ ;
+ terminator: '\n'
+ | ';'
+ ;
All of the changes for negative numbers are contained within our yacc
grammar. A “negative number”, when expressed to hoc
, is
really just shorthand for -1 * <expr>
. This operation
is called unary minus. An important detail about this
operator is that it binds more tightly than any other operation. Since
we already have a semantic meaning for -
,
we need to define negation separately to distinguish it from
subtraction.
This is done by giving it a new name, with a higher
precedence. We mark it as %nonassoc
, because it makes no sense to
have two negation operators next to each other with no joining operator.
(That is, -2 -5
doesn’t mean
anything).
%left '+' '-' /* left-associative, same precedence */
%left '*' '/' /* left-assoc; higher precedence */+ %nonassoc UNARY_MINUS /* highest precedence */
You’ll notice that we do not make any lexing changes; indeed, our
lexer would have a difficult time understanding whether it should return
-
vs UNARY_MINUS
, since it depends on the
surrounding symbols. Instead, the UNARY_MINUS
token is used
only to define a new level of precedence (since it appears
last).
Recall that yacc
uses the precedence of the rightmost
token in a production to determine the precedence of the entire
production. Yacc’s combination of tokens with precedence levels means
that to create a new precedence level, we must have a new token, even if
it’s unused.
So, the only use we have for the UNARY_MINUS
token is to
explicitly increase the precedence of the '-'expr
syntax to the
UNARY_MINUS
level.
expr: ...+ | '-'expr %prec UNARY_MINUS
+ {
+ $$ = (-1 * $2);
+ }
@
ShorthandThe special @
expression
returns the value of the previous evaluated expression.
This is relatively simple to add, since we have only a single action
where we use the “final result” of our evaluated expressions: the
list
action.
We update our C preamble to declare a variable for our “previous value”:
double previous_value = 0; // tracks '@'
We’ll write a remember()
macro to use when we want to
store a value into that previous_value
.
// Used for the '@' feature
#define remember(v) (previous_value = (v))
%}
And then we remember()
every value we display to the
user.
: ...
list| list expr terminator
{
("\t%.8g\n", remember($2));
printf}
All that remains is to add a production for the @
expression:
expr: NUMBER { $$ = $1; }+ | '@' { $$ = previous_value; }
Variables require more extensive changes than our other features; the language, lexer, and surrounding program logic all coordinate to support them:
Since our set of “potential” variables is very small
(a
–z
), the most straightforward implementation
is to keep a static array for them.
In our C prologue, we can add a declaration for our
variables
array.
+ /*
+ * Intepreter state
+ */
+ double variables[26]; // current value of each variable
Next, we will define the semantics of a variable. Our tokenizer will
need to inform the parser that it’s received a variable
token, and also produce a value the actions can use to get or
set the variable value. For our simple variables
array, an index into the array would allow both
operations.
This change wouldn’t work by itself, since we need to distinguish integers that are variable indexes from the numbers our users type. For this, we’ll need to let yacc know that our tokenizer and parser understand more than one type of value.
First, we replace our simple YYSTYPE
declaration with a
union of allowed types. Then, we’ll associate each
token with the type that they
produce.
- #define YYSTYPE double // data type of yacc's stack
+ %}
+
+ /* We declare our %union of types in the yacc section, not in C code. */
+ %union {
+ double hoc_value; /* some tokens produce doubles */
+ int hoc_index; /* some tokens produce an array index */
+ }
To associate our tokens with their type, we add the type to
their declaration; unsurprisingly, NUMBER
tokens produce
double
s, and VAR
tokens are indexes. So long
as we allow assignment expressions to also produce the value
assigned (as C does), then we can say that every expr
will
continue to produce a double
as well.
%token <hoc_value> NUMBER
%token <hoc_index> VAR
/* After evaluation, all expressions produce doubles */
%type <hoc_value> expr
The only other token required is =
; this token needs its own declaration
because it is right-associative (the right-hand side is
evaluated before the left). It also has very low precedence; we
want all the other operators on the line to be evaluated first.
%right '=' /* right-associative, much like C */
%left '+' '-' /* left-associative, same precedence */
%left '*' '/' /* left-assoc; higher precedence */
%left UNARY_MINUS /* highest precedence */
It’s quite simple for yylex
to recognize variables; the
islower()
function is part of ctype.h
. The
index of any letter, c
is simply c-'a'
. The
only other point of interest is the change in our use of
yylval
. Once our lexer can return any one of a union of
types, we are required to assign to the appropriate union field of
yylval
.
if (c == '.' || isdigit(c)) {
ungetc(c, stdin);
scanf("%lf", &yylval.hoc_value);
return NUMBER;
}
+ if (islower(c)) {
+ yylval.hoc_index = c - 'a';
+ return VAR;
+ }
Two new rules for expr
provide all the support we need
for variables. Notice that the compiler allows us to use the
VAR
values as indexes, since we’ve declared they’ll produce
integer values.
expr: ...+ | VAR { $$ = variables[$1]; }
+ | VAR '=' expr { $$ = (variables[$1] = $3); }
Also note that, unlike our lexer, the parser actions
do not need to specify which union field they are producing; we
continue to assign to $$
, not $$.hoc_value
.
That’s because yacc
determines the appropriate field based
on the type we’ve declared for the non-terminal we’re
producing. Since all expr
nonterminals produce the
hoc_value
type (double
), that’s the field
that’s updated when we assign to the $$
variable.
There are two kinds of errors we will handle for hoc2
,
parsing errors and logic errors
(e.g. divide-by-zero). We will need different mechanisms to handle each
of the situations, but in either case, the goal is the same: we want to
allow hoc2
to keep running.
yacc
Handles
Parse ErrorsBy default, as we’ve seen, yacc
will simply
fail the parsing process on error. It doesn’t know how much of
its current stack is acceptable in the presence of the error, and where
in our input stream it should resume. If we want the parsing process to
continue, we can add productions to our grammar to let yacc
accept errors instead.
Yacc includes support for a built-in error
token, which
can be used wherever it makes the most sense for your language to handle
an error.
list: /* nothing */
| list terminator
| list expr terminator+ | list error terminator { yyerrok; }
By allowing an error
to occur
as part of a list
, we’re telling
yacc
that the error is an acceptable part of the list, and
it may continue.
We’ve seen that we can have an error occur between expressions, but what if we mistype something in the middle of an expression? For example, we have an extra multiplication operator in this line:
62 * 83 - (43 + * 12) + 12
What does yacc
do with:
Luckily for us, yacc
handles these cases exactly as one
would hope:
First, it discards tokens and nonterminals from the stack until we
are in a scenario that matches the rule including
error
. In our case, this means it will disard the parsed
values of:
62 * 83 - (43 +
Second, it will throw away tokens from its input stream until it
finds one that matches the error case; in our case, a
terminator
. This leaves the parser in a “clean” state to
handle the next input.
Yacc knows that, once an error has occurred, it’s quite possible that another will be seen soon after. For general parsing, this makes a lot of sense. Consider the case of recovering from a missing closing brace inside a function definition. How can the parser know when the function has truly ended? So, by default, yacc disables error messages after it’s seen an error, and waits until “enough” valid input has been read to recover.
In our case, the next line presents many new and exciting
opportunities for syntax errors, so we want to see every error
message. The yyerrok
macro tells yacc
we’re
ready to receive error messages again.
We also need to handle expressions whose syntax is completely valid,
but are expressing an invalid operation or logical error. This becomes
more important as our language becomes more complex, but in
hoc2
we’ll handle only one example: division-by-zero.
If we reach a scenario where we cannot execute the current set of
input, it’s safest to completely reset our parsing process, rather than
trying to move forward and hoping yacc
’s stack will be
valid. To that end, we will add a mechanism to report the logic error,
and then go back to the beginning of the parsing process.
As hoc
becomes more complex, we’ll want to provide more
detailed error messages to users. To that end, we’ll want to expose a
printf
-style API for error messages.
Since we’re going to report errors from multiple locations, we want
one centralized error-logging facility: the warning()
function. It will allow us to add additional data into our messages.
Since we need additional behavior for execution errors (resetting our
parsing state), we’ll also expose an exec_error()
function
with the same interface.
void warning(const char *msg, ...)
void exec_error(const char *msg, ...)
We declare prototypes for both the logging and error-recovery
functions in our preamble, so that we can use them inside our
parser actions and yyerror
function.
int yylex(void);
void yyerror(const char *s);
+ __attribute__((format(printf, 1, 2)))
+ void warning(const char *msg, ...);
+ __attribute__((format(printf, 1, 2)))
+ void exec_error(const char *msg, ...);
/*
* Intepreter state */
Both of these functions are in the printf
style, since
they accept a “format string” and zero or more data arguments. We’d
ideally like the compiler to tell us if we get the arguments wrong, just
as it does for printf
. Luckily, Both GCC and clang support
adding the format
attribute to our function declarations, which will do just that!
The only use of exec_error
is inside our division
action:
expr: NUMBER { $$ = $1; }
| expr '/' expr+ {
+ if ($3 == 0) {
+ exec_error("Division by zero");
+ } else {
+ $$ = $1 / $3;
+ }
+ }
+
And the only warning()
user is our yyerror
function:
void yyerror(const char *s) {- fprintf(stderr, "%s: %s (on line %d)\n", program_name, s, line_number);
+ warning("%s", s);
}
Our only use of
warning
hoc2/hoc2.y
Both of our error-handling functions are variadic. Since there’s no
way for one variadic function to call another, we’ll instead use a
printf
variant that accepts a va_list
object.
Since they use the same message format, it’d be nice to create a helper
function for output, but this causes some compilers to forget that we
just verified the input arguments, and output a warning when using
-Wnonliteral-format
. Instead, we use a couple of macros to
keep the format consistent.
void yyerror(const char *s) {
warning("%s", s);
}
+ #define print_error_prefix() fprintf(stderr, "%s: ", program_name)
+ #define print_error_suffix() fprintf(stderr, " (on line %d)\n", line_number)
helper macros for consistent formatting hoc2/hoc2.y
Now, implementing warning
is as simple as converting our
variadic arguments into a va_list
object.
+ void warning(const char *msg, ...) {
+ va_list args;
+ print_error_prefix();
+
+ va_start(args, msg);
+ vfprintf(stderr, (msg), args);
+ va_end(args);
+
+ print_error_suffix();
+ }
+
In order to competely reset parsing state when an execution error
occurs, we want to begin parsing input again, after reporting the error.
To do this, we use a setjmp
/longjmp
pair (from
setjmp.h
) to place ourselves back
at the beginning of our main()
routine. (Note that we
don’t reset our variables
array here,
though we could if desired) The jmpbuf
state is
declared in the C preamble:
double previous_value = 0; // tracks '@'+ jmp_buf begin; // used to recover from parsing errors; jump back to REPL.
It’s initialized in main()
:
int main(int argc, char *argv[]) {
(void)argc;
program_name = argv[0];+
+ setjmp(begin);
return yyparse(); }
Now, all that remains is to longjmp()
back to that
location. Our exec_error
function does just that, after
warning the user of their mistake.
void warning(const char *msg, ...) {
...
}+
+ void exec_error(const char *msg , ...) {
+ va_list args;
+ print_error_prefix();
+
+ va_start(args, msg);
+ vfprintf(stderr, (msg), args);
+ va_end(args);
+
+ print_error_suffix();
+ longjmp(begin, 0);
+ }
hoc2
is a good first example of adding some real
functionality to the actions of our parser. In hoc3
, we’ll extend this functionality
to handle most of what we’d need from a real scientific calculator.
?= -std=c2x -Wall -Wextra -pedantic -Wformat -Wformat-extra-args -Wformat-nonliteral
CFLAGS ?= -d
YFLAGS
: hoc2
hoc
cp hoc2 hoc
: hoc2.o
hoc2
:
clean-f hoc
rm -f hoc2
rm -f hoc2.o
rm -f y.tab.h rm
%{
///----------------------------------------------------------------
/// C Preamble
///----------------------------------------------------------------
/*
* "higher-order calculator" - Version 2
* From "The UNIX Programming Environment"
*/
#include <stdio.h>
#include <ctype.h>
#include <setjmp.h>
#include <stdarg.h>
int yylex(void);
void yyerror(const char *s);
((format(printf, 1, 2)))
__attribute__void warning(const char *msg, ...);
((format(printf, 1, 2)))
__attribute__void exec_error(const char *msg, ...);
/*
* Intepreter state
*/
double variables[26]; // current values of each variable
double previous_value = 0; // tracks '@'
; // used to recover from parsing errors; jump back to REPL.
jmp_buf begin
// Used for the '@' feature
#define remember(v) (previous_value = (v))
%}
%union {
double hoc_value;
int hoc_index;
}
%token <hoc_value> NUMBER
%token <hoc_index> VAR
%type <hoc_value> expr
/*
* These rules increase in precedence as we move DOWNWARD.
*/
%right '=' /* right-associative, much like C */
%left '+' '-' /* left-associative, same precedence */
%left '*' '/' /* left-assoc; higher precedence */
%nonassoc UNARY_MINUS /* Even higher precedence
than * or /. Can't use '-', as we've used
it already for subtraction. */
%%
: /* nothing */
list| list terminator
| list expr terminator
{
("\t%.8g\n", remember($2));
printf}
| list error terminator { yyerrok; }
;
: '\n' | ';'
terminator;
: NUMBER { $$ = $1; }
expr| '@' { $$ = previous_value; }
| VAR { $$ = variables[$1]; }
| VAR '=' expr { $$ = (variables[$1] = $3); }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr
{
if ($3 == 0) {
("Division by zero");
exec_error} else {
= $1 / $3;
$$ }
}
| '-'expr %prec UNARY_MINUS { $$ = (-1 * $2); }
| '(' expr ')' { $$ = $2; }
;
%% // end of grammar
/* error tracking state */
char *program_name;
int line_number = 1;
int main(int argc, char *argv[]) {
(void)argc;
= argv[0];
program_name
(begin);
setjmpreturn yyparse();
}
/* our simple, hand-rolled lexer. */
int yylex(void) {
int c;
do {
= getchar();
c } while (c == ' ' || c == '\t');
if (c == EOF) {
return 0;
}
if (c == '.' || isdigit(c)) {
(c, stdin);
ungetc("%lf", &yylval.hoc_value);
scanfreturn NUMBER;
}
if (c == '\n') {
++;
line_number}
if (islower(c)) {
.hoc_index = c - 'a';
yylvalreturn VAR;
}
return c;
}
void yyerror(const char *s) {
("%s", s);
warning}
#define print_error_prefix() fprintf(stderr, "%s", program_name)
#define print_error_suffix() fprintf(stderr, "(on line %d)\n", line_number)
void warning(const char *msg, ...) {
va_list args;
();
print_error_prefix
(args, msg);
va_start(stderr, (msg), args);
vfprintf(args);
va_end
();
print_error_suffix}
void exec_error(const char *msg , ...) {
va_list args;
();
print_error_prefix
(args, msg);
va_start(stderr, (msg), args);
vfprintf(args);
va_end
();
print_error_suffix(begin, 0);
longjmp}