The last version of hoc defined in UPE, hoc6
provides some major features that complete its transition from
calculator to minimally-useful scripting language.
Users can now define their own subroutines to create reusable
behavior. The implementation changes for this feature are probably—with
the exception of our transition from hoc3 to
hoc4—the largest overall.
The print statement is much
improved in hoc6. We can print multiple comma-separated
expressions, and even more importantly, string literals!
hoc6 enables hoc programs to read input data from files
and/or the console. Together with the enhanced print
statement, hoc programs can now prompt the user for input,
read pre-calculated data from files, and engage interactively with
users.
Subroutines are divided into two categories: functions, which are required to return a value, and procedures, which are required not to. You must decide what kind of subroutine you’re writing up-front, because they use different keywords:
The func keyword defines a
function:
func theAnswer() {
return 42
}
A simple hoc
function
Whereas the proc keyword is
used for procedures:
proc tellMeTheAnswer() {
print 42
}
A simple hoc
procedure
Procedures are still allowed to use the return keyword in order to exit early,
but it must be a “bare return”, with no value.
Subroutines may call other subroutines or themselves; they can also use variables freely. Note, however, that the variables defined inside a subroutine are not scoped; they’ll still be available outside of the routine!
The following example returns the factorial of the global variable
x. If x isn’t defined, we’ll get an error when
we call it.
func factorialOfX() {
if (x <= 0) {
return 1
}
f = x
result = 1
while (f > 1) {
result = result * f
f = f - 1
}
return result
}
A hoc function that uses
variables
You can see that all of the variables used inside the function are still available after it returns.
factorialOfX()
./hoc6/hoc: Undefined variable x (on line 15)
x = 5
factorialOfX()
120
x
5
f
1
result
120
When defining a subroutine in hoc6, there is no list of
formal parameters. Instead, the function/procedure body can freely use
the syntax $1, $2, etc. to refer to the first, second,
etc. argument. We can use this to make a much nicer factorial
function:
func fac() {
if ($1 <= 0) {
return 1;
} else {
return $1 * fac($1-1)
}
}fac(k) computes k!
When executing your subroutine, hoc will only verify
that such an argument was actually passed when you try to
use it.
fac(11)
39916800
fac()
./hoc6/hoc: Not enough arguments passed to subroutine: fac (on line 10)
The arguments to a subroutine are passed by value; we are allowed to
modify their values without changing them in the caller. Notice
how we change $1 at the beginning
of compare() below.
func compare() {
$1 = $1 - $2
if ($1 < 0) {
return -1;
} else if ($1 == 0) {
return 0;
}
return 1;
}compare(x, y) returns
-1, 0, or 1, for x < y, x = y, or x > y.
hoc6 has a much-improved print statement. Not only can
multiple expressions be printed, e.g:
print PI, sin(PI), cos(PI)
We can also now print string literals, mixed with arbitrary expressions.
print "The sine value of ", PI, " is ", int(sin(PI)), "."Importantly, this is the only functionality supported for strings. They may not be:
In hoc6, we add a new builtin:
read(varName)
This allows our hoc programs to read user input into a
variable and check that the input was valid in a single
call.
First, when starting hoc, we can specify a set of one or more input
files, and/or stdin. The read(varName)
function will read from one file at a time; when a file is exhausted,
hoc transparently moves to the next file.
We’d use the command ./hoc6 datafile to read input from
datafile, and ./hoc6 - to read from stdin
(presumably, the console).
This can be combined arbitrarily. The following shippet first reads
variables from two static files, and once they’re exhausted, will read
any remaining data from stdin.
./hoc6 input-file1 input-file2 -
hoc6To read a double from our current input sources, you can use the
read(myVar) statement. hoc responds to the read request as
follows:
hoc will attempt to read a double from
the current input source
If the data is valid, it will be stored into the variable named
myVar, and read(myVar) will return
1. If we are out of data,
myVar is set to 0 and read(myVar)
returns 0. If the data was invalid, we get
an exec_error.
This allows us to write programs that interactively use
stdin during execution, such as:
> ./hoc -
{
print "Enter some numbers; press EOF when finished: \n"
sum = 0
while (read(x)) {
sum = sum + x
}
print "The sum is: ", sum, "\n"
}
Enter some numbers; press EOF when finished:
1 2 3 4 5 6 7 8
<Ctrl-D pressed>
The sum is: 36
<We can continue using hoc here>
We can think of a user-defined subroutine as a named set of machine
instructions that are permanently installed in the
hoc virtual machine. When we define a subroutine, we
install the instructions, but we don’t execute them. This is a
significant change; until now, hoc has executed all
top-level statements as soon as they’re recognized.
We’ll break down the changes into meaningful areas, as described below. step-by-step. Since procedures and functions only differ in their return values, our discussion applies to both.
prog area.prog space - Previously, our
hoc machine overwrote whatever code had been installed in
prog as soon as it was executed. We need a way to mark
subroutine instructions as “reserved” so that they are not overwritten
by new statements.First, let’s add the ability to recognize a subroutine definition. We have two kinds of subroutines, so we should examine their components:
func subrName() {
<statements>
}
Some components of this syntax we already have; the block of
statements can nicely be handled by stmtlist, for
example.
However, we’ll need to add the func & proc keywords, as well as adding parser
actions to associate the subroutine’s name Symbol with
its instructions.
Like loops and conditionals, it’ll be helpful to have new
tokens—PROC_KW and FUNC_KW—for our “introductory” keywords
proc and func. This will let us take action
before any other components of the declaration are parsed.
-%token <hoc_symbol> IF_KW ELSE_KW PRINT_KW WHILE_KW
+%token <hoc_symbol> IF_KW ELSE_KW PRINT_KW WHILE_KW FUNC_KW PROC_KWNew keywords for subroutine declarations hoc6/hoc.y
The keywords must also be added to our language builtins.
static struct hoc_keyword {
const char *name;
int token_type;
} keywords[] = {
{"if", IF_KEYWORD},
...
+ {"func", FUNC_KW},
+ {"proc", PROC_KW},
{NULL, 0},
};New keywords for subroutines hoc6/builtins.c
Once we’ve found a func or
proc keyword, we want to parse the
subroutine’s name, and create an associated Symbol
object. To that end, we’ll need a new Symbol type; in fact,
we’ll define two, so that we can handle the differences in
return-value behavior. For lack of a better alternative, we’ll name them
UPROC and UFUNC, for procedures and functions
respectively.
%token <hoc_value> NUMBER
- %token <hoc_symbol> VAR CONST BUILTIN UNDEF
+ %token <hoc_symbol> VAR CONST BUILTIN UNDEF UFUNC UPROC
%type <hoc_symbol> assignableNew token types for subroutine declarations hoc6/hoc.y
The rules for naming a function will be the same as those for naming
any other symbol, so we can reuse the VAR token for parsing
a subroutine’s name. However, when we eventually call a
subroutine, we’ll need to be able to differentiate them from variables
and builtin functions, so we create a small wrapper, subr_decl, to override the installed
Symbol’s type. This is the same pattern we used for
assignable before, to handle VAR vs
CONST symbols.
First, we’ll declare that, like our assignable wrapper,
subr_decl procuces a Symbol.
%token <hoc_value> NUMBER
- %type <hoc_symbol> assignable
+ %type <hoc_symbol> assignable subr_declsubr_decl produces a
UFUNC or UPROC hoc6/hoc.y
The production for a subr_decl is easy, since we have
our introductory keywords letting us know when a declaration has begun.
Our only job is to override the Symbol.type installed by
the lexer.
+ subr_decl: PROC_KW VAR
+ {
+ $2->type = UPROC;
+ $$ = $2;
+ }
+ | FUNC_KW VAR
+ {
+ $2->type = UFUNC;
+ $$ = $2;
+ }
+ ;The subr_decl production
updates Symbol types hoc6/hoc.y
$2->type before
this production runs?
yylex() function installs it as
UNDEF.
We can complete our language support for subroutine definnitions by recognizing their overall shape, including the body.
Modeling ourselves after our if and while
constructs, we note that these declarations should be
statements, and we can use the stmt
production to recognize their bodies.
Like all statements, procedure declarations return the address at
which they begin. These declarations are entirely defined by their
body, so we’ll return the address of their body stmt.
stmt: expr { install_instruction(inst_pop); }
| '{' stmtlist '}' { $$ = $2; }
+ | subr_decl '(' ')' stmt
+ {
+ $$ = $4;
+ }A subroutine declaration is a new kind of
stmt hoc6/hoc.y
Since user-defined subroutines are really just a set of instructions,
a Symbol referring to a subroutine is really holding the
address of the associated instructions.
It seems like the time to add an Addr member to the
Symbol.data union, but unfortunately, this causes a new
problem. Remember that the MachineCode struct already
contains a Symbol pointer, because that’s how we represent
symbol-table references. This causes cyclic dependencies between the
Symbol and MachineCode types.
We can solve this by moving our forward declaration of
MachineCode & Addr above our
Symbol definition in hoc.h, and let the
compiler resolve the full struct layout.
+ ///----------------------------------------------------------------
+ /// Addresses (Shared by multiple types)
+ ///----------------------------------------------------------------
+ typedef struct MachineCode MachineCode;
+ typedef MachineCode *Addr;
...
struct Symbol {
char *name;
short type; // Generated from our token types
Symbol *next;
union {
double val;
+ Addr addr;
struct BuiltinFunc func;
} data;
};
+ ///----------------------------------------------------------------
+ /// Machine Code
+ ///----------------------------------------------------------------
...
- typedef struct MachineCode MachineCode;
- typedef MachineCode *Addr;We need a way to “install” routines into our machine permanently, so
that we can execute them multiple times. The biggest obstacle to this
process is our machine_reset_program() function, which
resets progp every time we execute a parsed statement. We
need a way to prevent our subroutine bodies from being overwritten.
One workable method is to mark all space from prog to
the end of the procedure as “reserved”. Instead of resetting
progp back to the beginning of program memory, we’ll use a
new variable, prog_start, which indicates the first
non-reserved word that we can use.
MachineCode prog[PROGSIZE];
+ Addr prog_start = prog; // No installation above this address
Addr progp; // Next free location for code installation
static Addr pc; // the current location in the executing programThe prog* variables have the following relationship:
Now, when we define a routine, we’ll preserve its code by shifting
prog_start to just after its body. This is
handled by reserve_subr, a new machine function. That
function also updates the given Symbol to point to the
location where the subroutine begins.
/**
* Reserves all currently-installed code as a subroutine, under Symbol
* `subr_sym`. The value of `subr_sym` will be updated to the address
* of the installed routine. Ensures that the code will not be
* overwritten by future installations, even after a machine reset.
*/
void reserve_subr(Symbol *subr_sym); Addr install_code(MachineCode mc) {
*progp = mc;
Addr result = progp++; // return location of THIS instruction
return result;
}
+ void reserve_subr(Symbol *subr_sym) {
+ subr_sym->data.addr = prog_start;
+ prog_start = progp;
+ }Then, we make machine_reset_program() aware of
prog_start, so that we never again overwrite any code
between prog and prog_start.
void machine_reset_program(void) {
stackp = stack;
- progp = progp;
+ progp = prog_start;Finally, we update our parser action for subroutine-declaration
statements to call our new reservation function and populate the
Symbol value. Now, writing a subroutine declaration will
install code, but won’t execute any!
stmt: expr { install_instruction(inst_pop); }
| '{' stmtlist '}' { $$ = $2; }
| subr_decl '(' ')' stmt
{
+ reserve_subr($1);
$$ = $4;
}hoc.y: Reserving space for
procedures hoc6/hoc.y
stmt, and where do they come from?
STOP. It’s installed by the list: list stmt production, and since
prog_start is updated before the
list production runs, it will be evaluated and then
complete.
If you are following along, now is a good time to test your
implementation. We can do that by adding a temporary expr
production that understands how to produce a value for an installed
declaration.
%%
expr: ...
... | UPROC
{
$$ = install_instruction(inst_pushlit);
install_literal($1->data.addr - prog);
}
| UFUNC
{
$$ = install_instruction(inst_pushlit);
install_literal($1->data.addr - prog);
}Now, using the function/procedure name as a variable will evaluate to
its address, relative to prog. As expected, a more complex
body (like calling a builtin), corresponds to a larger block of
instructions.
./hoc6/hoc
proc f1() { x = 123 + 345 }
proc f2() { print sin(cos(x+42)) }
proc f3() { y = 10 }
f1
0
f2
10
f3
24
hoc now correctly installs our subroutines into
permanent storage. We now focus on calling those subroutines,
which requires a coordinated dance between the caller and called
code.
The basic idea of executing subroutines in our hoc VM involves a combination of two cooperating instructions:
subrexec, to enter
a named subroutine, saving our current location
uprocret, to
return from a procedure to the saved location. (There’s also a
ufuncret instruction, covered when
we implement return values.)
We’ll add this functionality in stages:
Representing Subroutine Calls & Callers - Whenever we call a procedure, we must keep track of where we will resume execution afterwards. Since subroutines may call other subroutines (or themselves), we need to track an arbitrary stack of callers.
Language Changes
Executing Subroutines
Returning to Callers
The bare minimum required for executing a reserved block of code
—ignoring, for now, arguments and return values—is the ability to jump
to the subroutine’s address, then jump back the caller
after executing. The origin / destination data for a call is encoded
into a call frame. Unlike most of our objects, the
Frame structure has no use outside of the machine module,
and so we can define it inside machine.c.
typedef struct Frame {
Symbol *subr_called; // the UFUNC/UPROC we have jumped to
Addr ret_pc; // return location
} Frame;Below is an example of how relationship between the subrexec instruction, the Frame it creates, and the uprocret instruction, which uses the
information to return to the caller.
Because subroutines may call other subroutines (or themselves), a
single Frame of call data is not
sufficient. Instead, this information is often organized into a
stack of per-call data, imaginatively named the call
stack. Since a Frame is pushed onto the stack
every time a subroutine is invoked, and popped when a routine returns to
its caller, the currently-executing routine always has access to its own
data at the top of the stack.
#define FRAMELIMIT 100 // Recursion depth limit
static Frame frames[FRAMELIMIT];
static const Frame *OVERFLOW_FRAME = frames + FRAMELIMIT;
static Frame *fp = frames; // Next frame to useWe also add some simple functions for manipulating the frame stack.
/**
* Push a new frame for a subroutine call onto the frame stack, or fail on
* overflow.
*
* The frame is initialized as a call to Symbol `s`, returning to address
* `caller_pc`. Caller must initialize all other frame data.
*/
static Frame *frame_push(Symbol *s, Addr caller_pc) {
if (fp == OVERFLOW_FRAME) {
exec_error("Call-Stack Overflow while calling '%s'", s->name);
}
fp->subr_called = s;
fp->ret_pc = caller_pc;
return fp++;
}
/** Pop most recent Frame from the frame stack, or fail on underflow */
static Frame *frame_pop(void) {
if (fp == frames) {
exec_error("Call-Stack Underflow while attempting to return");
}
return --fp;
}
/** Access top of frame stack, or fail if no Frame exists */
static Frame *frame_peek(void) {
if (fp == frames) {
exec_error("Call-Stack Underflow while attempting to peek at top frame");
}
return fp - 1;
}subrexec
InstructionWith our new call stack, we can implement the subrexec instruction, which performs the
work to actually change our PC location and push a new Frame onto our stack. This instruction
is always two words long, since we also need the Symbol
which points to our procedure.
To actually run the subroutine, we use a recursive
execute() call, using Symbol->data.addr as
the pc value.
int inst_subrexec(void) {
Symbol *sym = (pc++)->symbol;
Frame *f = frame_push(sym, pc);
execute(sym->data.addr);
return 0;
}While it may seem as though we could simply change the value of the
PC in order to “jump” to the subroutine, a nested execute()
call is more compatible with with our implementation of
ifcode and whilecode. Since those instructions
use recursion, a direct-jump subroutine implementation without
using recursion would lead to executing code in C stack frames we would
not expect.
As an example, consider the following hoc program.
subrexec and the uprocret instructions, what would this
program print?
Most likely, something like:
1
2
4
5
3
Because our ifcode instructions recurse, we’re inside 4
execute() functions by the time we reach
print 1:
execute() from our REPLexecute() inside inst_ifcode for
if (1).execute() inside inst_ifcode for
if (2).execute() inside inst_ifcode for
if (3).Now, after we finish print 1, we return,
and we assume that merely jumps to the instructions for
print 2. Since that’s the end of a stmtlist
block, the following instruction will be a STOP.
Now things get odd. This will exit the fourth layer
of the execute() function, bringing us back to the
inst_ifcode we were running for if (3)…
inside our procedure! And that will begin running
print 4! There are many problematic situations we can
create for ourselves in this manner.
To install the subrexec instruction, we must add the
language rules for calling a function or procedure. For now, we’ll
disallow specifying any arguments, so the address we produce for the
parser will just be the instruction for calling the procedure /
function. Notice that procedure calls, which don’t return a
value, must be statements, as they don’t leave a value
on the stack.
%%
stmt: ...
| UPROC '(' ')'
{
$$ = install_instruction(inst_subrexec);
install_ref($1);
}hoc.y: Calling
procedures hoc6/hoc.y
Function calls are expressions, and so they can be used
inside conditions, etc. (The job of placing a value onto the data stack
will be handled by the return statement of the function;
trying to call a function at this point will cause errors!)
%%
expr: ...
| UFUNC '(' ')'
{
$$ = install_instruction(inst_subrexec);
install_ref($1);
}hoc.y: Calling
functions hoc6/hoc.y
Our procedures must, at some point, return to their caller. We need a
facility to “jump back” to our calling code. Since we’re executing
recursively, this means we need to add a way to signal that it’s time to
return from our current execute() call, and update the
pc value to match our return
address.
For hoc6, we will use a simple implementation, adding an
is_returning flag that is examined by
execute(). Whenever that flag is true, the VM knows it
should just be unwinding nested execute calls, so the
execute() function returns immediately, without
incrementing the pc value again or executing anything.
That flag should stay true until we find ourselves back int the
inst_subrexec function, since we’ve returned from the
called subroutine. At that point, we can safely set the flag to false,
knowing that the return instruction has already updated the
PC.
You can see a visual representation of this flow below:
subrexec runs within some execute() call
(dotted). It finds the procedure address, and starts its own
execute() flow.uprocret instructionuprocret sets pc = frame.ret_pc, and sets the is_returning flagexecute() sees the flag is enabled, and returns to
subrexecsubrexec disables is_returning, and
returns to its enclosing execute() call. the
pc value is still set to the value from
uprocret and not updated - execution continues
correctly.
First, we must add the flag for is_returning:
#define FRAMELIMIT 100 // Recursion depth limit
#define OVERFLOW_FRAME (frames + FRAMELIMIT)
static Frame frames[FRAMELIMIT];
static Frame *fp = frames; // Next frame to use
+ static bool is_returning; // true when returning from a user subroutineNext, we’ll add the uprocret instruction; it must:
Frame from our frame stack, so we know
our return addresspc to that return address, so that execution
can resume in our caller.is_returning true, so that the next
execute() iteration returns early, and we may return to
subrexec.int inst_uprocret(void) {
Frame *f = frame_pop();
is_returning = true;
pc = f->ret_pc;
return 0;
}A “return” instruction hoc6/machine.c
Finally, we update execute to respect this state.
void execute(Addr start_addr) {
pc = start_addr;
+ is_returning = false;
- while (!is_stop_inst(pc)) {
+ while (!is_stop_inst(pc) && !is_returning) {There’s one other aspect of our execution flow that must be considered before we’re done. To see it, consider the following hoc program, and decide what output will result when running it.
tester = 0
proc earlyReturnTest() {
if (tester > 0) {
print 2
return
}
print 1
}
The reason for this bug lies in our current implementation of the
ifcode and whilecode instructions, and the
nested execution flow we’re using. Because ifcode and
whilecode also overwrite the
pc value after their nested execute() calls,
they may override the pc value just set by
uprocret!
Consider the following flow:
Looking back to our example, the return keyword inside
our function sets pc and is_returning. We
immediately return to the ifcode function, whose nested
execute just ended. But the inst_ifcode function
also sets the PC value, overriding our return
location.
The relevant parts of hoc5’s inst_ifcode()
are commented below:
int inst_ifcode(void) {
Addr ifbody_addr = pc[0].addr;
...
Addr end_addr = pc[2].addr;
Addr cond_addr = pc + 3;
execute(cond_addr);
Datum d = stack_pop();
if (d.value) {
execute(ifbody_addr); // uprocret changes our PC inside here
}
pc = end_addr; // and it's overwritten here... :(
return 0;
}In hoc5,
inst_ifcode unconditionally overwrites pc to
end_addr
How can we make our return
keyword force its way past any enclosing if and
while blocks, until it reaches its parent
subrexec instruction?
is_returning ChecksAs the UPE authors suggest, a simple (if hacky) fix, is to make these
instructions aware of the our is_returning state, since
they are part of our ‘pc-change’ flow. Before making any manual changes
to the pc value, we check whether we’re actually returning
(and so should stop executing). This lets us “bubble up” from a
return statement, through any enclosing ifs or
whiles, to the subroutine we’re returning from! Here’s our
fixed ifcode flow:
The actual changes are minor:
inst_ifcode int inst_ifcode(void) {
...
execute(cond_addr);
Datum d = stack_pop(); // must be value
if (d.value) {
execute(ifbody_addr);
} else if (maybe_elsebody.type == CT_ADDR) {
execute(maybe_elsebody.addr);
}
}
- pc = end_addr;
+ if (!is_returning) {
+ pc = end_addr;
+ }
return 0;
}inst_whilecode int inst_whilecode(void) {
Addr body_addr = (pc++)->addr;
Addr end_addr = (pc++)->addr;
Addr cond_addr = pc;
while (true) {
+ if (is_returning) {
+ return 0; // PC is already set to the appropriate address
+ }
execute(cond_addr);
if (stack_pop().value) {
execute(body_addr);
} else {
break;
}
}
// condition false
pc = end_addr;
return 0;
}return
StatementsTo allow users to trigger the return, we add a return
keyword to our token declarations and language
builtins:
- %token <hoc_symbol> IF_KW ... FUNC_KW PROC_KW
+ %token <hoc_symbol> IF_KW ... FUNC_KW PROC_KW RETURN_KWstatic struct {
const char *name;
int token_type;
} keywords[] = {
{"if", IF_KW},
{"else", ELSE_KW},
{"func", FUNC_KW},
{"proc", PROC_KW},
{"print", PRINT_KW},
{"while", WHILE_KW},
+ {"return", RETURN_KW},
{NULL, 0},
};And a return statement to install the
uprocret machine instruction. The return statement has an
interesting new feature: it should only be used while
defining a subroutine.
Adding the keyword itself is straightforward; we add the
RET_KEYWORD to hoc.y and the associated
"return" string to builtins.c.
But in order to prevent users from writing a return
statement outside of a subroutine, we’ll need to know, while executing
the parser action, whether we’re currently defining a
subroutine or not.
To do this, we’ll add a new piece of parser state:
current_subr.
/** Tracks the symbol for the subroutine currently being parsed (if any). */
Symbol *current_subr = NULL;While our parser is parsing a subroutine, we want this variable to
track the Symbol of that subroutine. Once we’ve finished
parsing the routine, we must reset it to NULL. This is most
easily done inside our subr_decl action:
subr_decl: PROC_KW VAR
{
$2->type = UPROC;
+ current_subr = $2;
$$ = $2;
}
| FUNC_KW VAR
{
$2->type = UFUNC;
+ current_subr = $2;
$$ = $2;
}
;And then removing the definition after parsing the body
stmt: ...
| subr_decl '(' ')' stmt
{
reserve_subr($1);
+ current_subr = NULL;
$$ = $4;
}Once we have access to the current subroutine, enforcing that return statements can’t happen outside of them is trivial.
stmt: expr { install_instruction(inst_pop); }
...
+ | RETURN_KW
+ {
+ if (current_subr == NULL) {
+ exec_error("Syntax error - "
+ "`return` keyword used outside of a subroutine");
+ }
+ if (current_subr->type != UPROC) {
+ exec_error("Syntax error - "
+ "bare return used outside of a procedure");
+ }
+ $$ = install_instruction(inst_uprocret);
+ }It’s all well and good to allow users to return from their
procedures, but what if they forget, or their procedure simply runs
until it ends? We need to make absolutely sure that our
procedures will return to their caller’s location. To do this, we can
simply force every procedure to end with a
uprocret instruction, during our parser action for
procedure declarations. At this point, the procedure body has been
installed, so we can safely add one more instruction to the end of the
procedure’s body before reserving space for it in our machine.
stmt: ...
| subr_name '(' ')' stmt
{
+ /*
+ * Note: the order of these two lines matters!
+ * `uprocret` should be part of the subr body,
+ * so reserve space *after* installing it.
+ */
+ install_instruction(inst_uprocret);
reserve_subr($1);
current_subr = NULL;
$$ = $4;
}Forcing returns from procedures hoc6/hoc.y
This is another good place to test your implementation; no additional code should be required for defining & calling procedures.
At this point, we can declare procedures and then call them. Next, we will implement passing data into our subroutines, via arguments, and then accessing that data inside our subroutine bodies.
Arguments to our subroutines will be managed on our data stack, must be sure to add and remove them on entrance and exit.
First, we must decide how to encode arguments for our machine to find
when executing subrexec. We’ve seen one way of passing
arguments already in hoc3, when using BUILTIN
functions. Those functions had a fixed number of arguments; those
values were stored on the data stack, and each inst_callN
variant knew how many to remove.
Our subroutines, on the other hand, allow an arbitrary number of arguments. During a call, arguments will be placed onto the (data) stack, and stay there for the entire duration of the call. In order for the subroutines to use the arguments, we must be able to locate them on the stack; furthermore, to avoid invalid memory accesses, we should check that an argument exists before trying to access it.
We’ll encode the number of arguments and their stack
location inside our Frame during our
subrexec call. The number of arguments will be generated as
part of the subrexec instruction, to ensure that we only
access paramters that actually exist. The stack location is
only known at call time, so we’ll update our subrexec
instruction to populate that value.
Then, our subroutine bodies will be able to generate instructions
that can lookup argument values, using this data from the current
Frame.
First, we’ll update our Frame struct with fields to
represent arguments passed.
typedef struct Frame {
Symbol *subr_called; // the UFUNC/UPROC we have jumped to
Addr ret_pc; // return location
+ Datum *argp; // pointer to first argument
+ int arg_count; // number of arguments
} Frame;Arguments add two additional responsibilities to our parser. The first is some additional complexity around the addresses of our procedure-call statements. Consider the hoc statement:
callMyProcedure()
Previously, this stmt generated a single, two-word
instruction, subrexec, and the stmt production
for the call produced the address of the first word.
%%
stmt: ...
| UPROC '(' ')'
{
$$ = install_instruction(inst_subrexec);
install_ref($1);
}Now, if we instead call a subroutine that requires arguments, like:
add3(6, 14, 7)
Then the address of the subrexec instruction is
not sufficient. Like BUILTINs, we require the
address where our argument expressions begin to be
generated.
The second need is our requirement that the parser should tell us how many arguments were passed in a given call, so that it can generate our new version of subrexec, which includes the number of arguments.
You can see the required information in the diagram below:
We need a new piece of syntax for tracking the start address and number of argument expressions in a call.
To track a “count” value, we’ll add yet another new value type to our
parser – integer. This’ll be
useful for cases where we want to refer to an “index” or “count”, rather
than trying to use a double to specify a whole number
value, such as “3 arguments passed” or “get argument for parameter
2”.
%union {
double hoc_value;
Symbol *hoc_symbol;
Addr hoc_address;
+ int hoc_integer;
}
%type <hoc_symbol> assignable subr_nameSince we need two new pieces of information, we’ll create two new non-terminals to represent them inside our subroutine calls.
First, we’ll create a non-terminal, arglist, that accepts a list of
expressions, and produces the number of expressions in
that list.
%type <hoc_symbol> assignable subr_decl
+ %type <hoc_integer> arglistFor our arglist production, the expressions themselves
are still installed by the expr nonterminal; the
arglist action only cares about how many expressions it has
seen. It counts them by using itself recursively; we also include a
special case for a single argument.
%%
arglist: /* nothing */ { $$ = 0; }
| expr { $$ = 1; }
| arglist ',' expr { $$ = $1 + 1; }Next, we need to track the address where our argument expressions
begin to be installed. We’ve done this kind of thing before, with the
end production in our if and
while statements. We create a new args_start
production to to “mark our place” when we start parsing an argument
list.
It produces an address:
- %type <hoc_address> assign call end expr if stmt stmtlist while
- %type <hoc_address> args_start assign call end expr if stmt stmtlist whileAnd it operates almost like the end production; however,
it doesn’t install anything.
%%
args_start: /* nothing */ { $$ = progp; }
;Finally, we can add these nonterminals into our procedure and function call syntax.
Procedure Calls
stmt: expr { install_instruction(inst_pop); }
| '{' stmtlist '}' { $$ = $2; }
- | UPROC '(' ')'
+ | UPROC '(' args_start arglist ')'
{
- $$ = install_instruction(inst_subrexec);
+ install_instruction(inst_subrexec);
install_ref($1);
+ install_literal($4);
+ $$ = $3;
}Function Calls
expr: NUMBER
...
- | UFUNC '(' ')'
+ | UFUNC '(' args_start arglist ')'
{
- $$ = install_instruction(inst_subrexec);
+ install_instruction(inst_subrexec);
install_ref($1);
+ install_literal($4);
+ $$ = $3;
}Then we’ll need to update inst_subrexec to handle the
new information in our instruction format; upon reaching an
subrexec call, we’ve got the arguments for the call at the
top of the stack, so we set our argp value to
point to the first argument’s Datum. Since we’ve
got a stack of arguments, the first argument is exactly
arg_count indices behind our stackp
pointer.
int inst_subrexec(void) {
Symbol *sym = (pc++)->symbol;
+ double arg_count = (pc++)->literal;
Frame *f = frame_push(sym, pc);
+ f->arg_count = arg_count;
+ f->argp = stackp - f->arg_count;
execute(sym->data.addr);
is_returning = false;
return 0;
}In order to actually use our arguments within the body of a
procedure or function, we need to transform references like $1, $2, etc. into stack
accesses, which either fetch or
overwrite a value on the stack.
arggetThe basic idea will be to parse a parameter
reference $k into a new
argget instruction. This
instruction is two words long; the second word, k, specifies the parameter number we
want to access. The machine will:
arg_countargp
argsetAssigning a value to an argument will work similarly; the value at
the top of the stack replaces the Datum value for argument
k. It’s then put back
at the top of the stack, since assignments are expressions.
First, we’ll need a way to recognize parameter references, and turn
the reference $k into the literal
integer k.
Lucky for us, we just added a new “integer” value-type into our
language, so we can associate a new PARAM_NUMBER token with
that type. (Remember: It’s capitalized because it’s a terminal.)
%type <hoc_symbol> assignable udef_name
%type <hoc_integer> arglist
+ %token <hoc_integer> PARAM_NUMBERThen, we can update our yylex()
function to handle parameters as well. A $ with no prefix or other surrounding
characters is only valid as a parameter reference, so we can place this
conditional logic anywhere before our final switch
statement.
If scanf fails to read an integer, or we get a
non-positive integer value, we consider it a lexing failure. This allows
more informative error messages, but whether to perform this much
semantic validation inside your lexer is a matter of opinion.
int yylex(void) {
int c;
...
if (c == '\n') {
line_number++;
}
+ // Parse parameters
+ if (c == '$') {
+ if (scanf("%d", &yylval.hoc_integer) != 1) {
+ exec_error("Invalid parameter specification", "");
+ } else if (yylval.hoc_integer <= 0) {
+ exec_error("Parameters must be nonnegative", "");
+ }
+ return PARAM_NUMBER;
+ }We can use our PARAM_NUMBER token to write new
expr productions for evaluating or assigning a parameter.
Like the productions for return statements, these should
only be allowed while defining a procedure.
expr:
...
+ | PARAM_NUMBER
+ {
+ if (current_subr == NULL) {
+ exec_error("Syntax error - "
+ "Parameter '$%d' not allowed here", $1);
+ }
+ $$ = install_instruction(inst_argget);
+ install_literal($1);
+ }
+ | PARAM_NUMBER '=' expr
+ {
+ if (current_subr == NULL) {
+ exec_error("Syntax error - "
+ "Parameter '$%d' not allowed here", $1);
+ }
+ $$ = install_instruction(inst_argset);
+ install_literal($1);
+ }
| expr '+' expr { install_instruction(inst_add); }argget and argsetThe inst_argget and inst_argset
instructions use the Frame.argp pointer to access the
correct location in the stack. The only arithmetic required is
converting the parameter numbers into zero-indexed offsets.
int inst_argget(void) {
int arg_number = (pc++)->literal;
Frame *current_frame = frame_peek();
if (current_frame->arg_count < arg_number) {
exec_error("Subroutine '%s' uses parameter %d, but only got %d.",
current_frame->subr_called->name,
arg_number,
current_frame->arg_count);
}
stack_push(current_frame->argp[arg_number - 1]);
return 0;
}
int inst_argset(void) {
int arg_number = (pc++)->literal;
Frame *current_frame = frame_peek();
if (current_frame->arg_count < arg_number) {
exec_error("Subroutine '%s' uses parameter %d, but only got %d.",
current_frame->subr_called->name,
arg_number,
current_frame->arg_count);
}
Datum d = stack_pop();
current_frame->argp[arg_number - 1] = d;
stack_push(d);
return 0;
}There’s one last detail to handle related to argument values. We need
to remove them when we return from our subroutines. To that
end, we add a small helper function to the machine module,
so that both procedures and functions can remove their arguments once
complete.
/**
* Reset any machine state indicated by the given frame. This includes:
* - Removing any arguments from the stack for the call.
* - Resetting the program counter
* - Setting global "returning" flag
*/
static void return_from_frame(Frame *f) {
// remove args from stack
while (f->arg_count-- > 0) {
stack_pop();
}
is_returning = true;
pc = f->ret_pc;
}Our uprocret instruction can then be made a little
simpler.
int inst_uprocret(void) {
Frame *f = frame_pop();
- is_returning = true;
- pc = f->ret_pc;
+ undo_frame_state(f);
return 0;
+ }Finally, we can make the finishing touches that allow us to make
functions work. Since calling a function produces a
value, a function call must be an expression rather than a statement.
Furthermore, a function must include the
return keyword (with a value), which we’ll need to check at
runtime rather than parse-time.
The ufuncret instruction is used for returning from a
function, it should be used just after our return value has
been produced and pushed onto the stack. There’s just one problem:
there’s a bunch of arguments underneath it, and they need to
go! So, the ufuncret instruction removes the top stack
value, saves it, then calls return_from_frame to remove the
arguments from the stack. After that’s complete, we can re-push
the expected return value onto the top of the stack, which will be
available to the caller once control returns to them.
int inst_ufuncret(void) {
Frame *f = frame_pop();
Datum result = stack_pop();
undo_frame_state(f);
stack_push(result);
return 0;
}The format of a value-return is simply
return <expr>, which means that the expression will
already have pushed the value onto the data stack by the time we install
the return instruction. This makes the action for value-return
statements simple; it’s only real job is verifying that we’re actually
inside a function; otherwise, we disallow the installation of a
ufuncret instruction.
| RETURN_KW expr
+ {
+ if (current_subr == NULL) {
+ exec_error("Syntax error - "
+ "return keyword used outside of a subroutine");
+ }
+ if (current_subr->type != UFUNC) {
+ exec_error("Syntax error: -"
+ "`return <expr>` used outside of a function");
+ }
+ install_instruction(inst_ufuncret);
+ $$ = $2;
+ }
+When a function is installed, our parser will still end it
with the uprocret instruction, since we use the same code
to install both types of subroutine. Unlike procedures, however,
functions should never reach that instruction. We should always
hit a ufuncret instruction first.
To make sure users understand what they did wrong, we will update the
uprocret instruction to raise an error at runtime
if it detects that we’re inside a function, since that means we
forgot to return a value
int inst_uprocret(void) {
Frame *f = frame_pop();
+ if (f->def_called->type != UPROC) {
+ exec_error("Function '%s' does not return a value", f->subr_called->name);
+ }
undo_frame_state(f);
return 0;
}And with that, our function implementation should be complete!
We add a (very) minimal implementation of strings for use in
print statements. This is their only allowed use; in
particular, they cannot be used as variables. In this version of
hoc, they also leak memory.
The implementation adds a new string-printing instruction, which has a two-word layout:
This only requires support for parsing and installing strings into our machine’s memory.
We begin by writing a (very) small dynamic-string module, hoc6/str.c, based on the common
len + capacity implementation. For the sake of
brevity (hah), we don’t discuss the implementation here,
but we’ll come back to strings in ehoc. Its interface is
all we’re interested in.
typedef struct str str;
/** Allocate a string; runtime error on allocation failure */
str *str_new(void);
/** Append a character to the end of a string */
bool str_append(str *s, char c);
/**
* Get a "view" of string `s` for use as a valid, NUL-terminated C
* string. The pointer returned is non-owning, and will be
* invalidated by any manipulations made to the string.
*
* In particular, if the caller wishes to store the resulting char
* buffer for other uses, it should be copied.
*/
const char *str_view(str *s);
/** Clear the string's contents */
void str_clear(str *s);
/** Deallocate `s`, freeing its memory. */
void str_free(str *s);The dynamic string interfacehoc6/hoc.h
We’ll use this library to allocate storage for strings as they’re parsed.
The biggest change required for string parsing is recognizing string literals and allocating strings for them during the lexing process.
First we add the STRING token
%union {
double hoc_value;
Symbol *hoc_symbol;
Addr hoc_address;
int hoc_integer;
+ str *hoc_string;
}
%token <hoc_value> NUMBER
%token <hoc_symbol> VAR CONST BUILTIN UNDEF UFUNC UPROC COMMAND
/* These keyword symbols are efectively singletons */
%token <hoc_symbol> IF_KW ELSE_KW PRINT_KW WHILE_KW FUNC_KW PROC_KW RETURN_KW READ_KW
%type <hoc_symbol> assignable udef_name
%type <hoc_integer> arglist
%token <hoc_integer> PARAM_NUMBER
+ %token <hoc_string> STRINGWe’ll force all strings to be double-quoted (though they may contain escaped quotes), and so once we see a double-quote, we can collect characters until we reach the next one.
int yylex(void) {
...
// Parse parameters
...
+ // Parse strings
+ if (c == '"') {
+ str *s = str_new();
+ while (true) {
+ c = getchar();
+ if (c == '"') {
+ break;
+ }
+ if (c == EOF || c == '\n') {
+ exec_error("String wasn't closed: '%s'", str_view(s));
+ }
+ str_append(s, escaped(c));
+ }
+ yylval.hoc_string = s;
+ return STRING;
+ }
+
// Get numbers
...The only interesting detail is how to escape special characters; the
escaped function allows us to handle this
transparently.
/**
* If c is '\': Returns next input character, string-escaped.
* Otherwise: returns c
*/
int escaped(char c) {
// Each escaped character is followed by its output char
static char translations[] = "b\b"
"f\f"
"n\n"
"t\t";
if (c != '\\') {
return c;
}
c = getchar();
char *cp = strchr(translations, c);
// islower() used so that we don't match the output chars or NUL
if (islower(c) && cp != NULL) {
return cp[1];
}
return c;
}While we’re working on yylex(), we might as well fixup
our other use of static strings in hoc: Symbol names. We
only allowed 100-character identifiers before, whereas we could easily
support more than that. We replace our use of a static buffer with a
str object instead. (Note that because
install_symbol copies the string passed to
it, we are safe to pass it a str_view here.)
If you want to leave the identifier limit in place, you could do so
by leaving the hoc.h constant and while
condition check in place. Otherwise, you can remove
SYMBOL_NAME_LIMIT from hoc.h as well.
We’ll use a single static str for all lexing calls;
there’s no reason to allocate & free memory for these repeatedly.
Its internal memory use will expand to the maximum identifier length
used during hoc’s runtime.
int yylex(void) {
+ static str *symbol_str = NULL;
int c;When we find the beginning of a symbol, we’ll reset our string, instead of setting up a buffer.
// Get numbers
...
if (isalpha(c)) {
Symbol *s;
- char buf[SYMBOL_NAME_LIMIT + 1];
- size_t nread = 0;
+ // reset symbol string for new input
+ if (symbol_str == NULL) {
+ symbol_str = str_new();
+ } else {
+ str_clear(symbol_str);
+ }
...Then, we can remove the SYMBOL_NAME_LIMIT checks and
replace character updates with str functions.
do {
- buf[nread++] = c;
+ str_append(symbol_str, c);
c = getchar();
- } while (nread < SYMBOL_NAME_LIMIT && (isalpha(c) || isdigit(c)));
+ } while (isalpha(c) || isdigit(c));
- // Just in case we exceeded the limit
- while ((isalpha(c) || isdigit(c))) {
- c = getchar();
- }
-
// at this point, we have a non-alphanumeric 'c'
ungetc(c, stdin);
- buf[nread] = '\0';
+ const char *buf = str_view(symbol_str);
if ((s=lookup(buf)) == NULL) {
s = install_symbol(buf, UNDEF, 0.0);
}
yylval.hoc_symbol = s;
return (s->type == UNDEF ? VAR : s->type);
}MachineCode to hold stringsWe’ll add yet another potential type for MachineCode
elements, allowing a word of our VM memory to point to a string.
struct MachineCode {
CodeType type;
union {
double literal;
Inst inst;
Symbol *symbol;
Addr addr;
+ str *string;
};
};MachineCode with str
addition hoc6/hoc.h
prstr
instructionOnce our parser can install prstr instructions, actually
printing the string is trivial. We use our str_view method
to create a C string from the string pointer; since we print it
immediately, we don’t need to do any defensive copying. (Make sure to
add a prototype for inst_prstr to hoc.h.)
int inst_prstr(void) {
str *s = (pc++)->string;
printf("%s", str_view(s));
return 0;
}(Note the lack of \n in the output format string. This
will be useful in the next output feature: printing multiple
expressions)
print StatementNow to combine our string-printing functionality with the existing expression-printing.
We’ll replace our print expr statement with a
generalized prlist nonterminal.
stmt: ...
- | PRINT_KW expr
- {
- install_instruction(inst_prexpr);
- $$ = $2;
- }
+ | PRINT_KW prlist
+ {
+ $$ = $2;
+ }
...This nonterminal handles any nonzero number of STRING or
expr elements, and prints each of them. We can base our
list-handling rules on our arglist production.
%%
prlist: expr
{
install_instruction(inst_prexpr);
$$ = $1;
}
| STRING
{
$$ = install_instruction(inst_prstr);
install_string($1);
}
| prlist ',' expr
{
install_instruction(inst_prexpr);
$$ = $1;
}
| prlist ',' STRING
{
install_instruction(inst_prstr);
install_string($3);
$$ = $1;
}To complete our minimal I/O implementation, we’ll add features to read data from input files or the user’s terminal. In fact, we’ll allow specifying multiple files on the command line, and proceed from one to the next as they’re exhausted.
The UPE authors add the input-reading code to hoc.y. For
hoc6, we’ll instead add a new inputfiles
module, though the functionality is the same. We’ll revisit input/output
organization in adhoc.
The call-graph for our user-input flow follows the following simplified diagram:
The main program initializes our input sources based on the
command-line arguments, using the init_inputfiles
interface:
/**
* Initializes the inputfiles module with the first `filecount`
* members of the `filenames` array; Each should be a file path.
*/
void init_inputfiles(char **argv, int filecount);To actually read data, we’ll use a new varread
instruction, installed by the parser when we see a read
expression. That machine instruction will use the only other function in
the inputfiles module: read_double().
/**
* Returns true if a double value could be read from the runtime
* input files, false otherwise. `dest` will be set to the value read,
* if any, or zero.
*
* If input data exists but is in an invalid format, causes an
* execution error.
*/
bool read_double(double *dest);The user input layer of hoc is built around a set of zero or more
input files, specified on the command line. Any one of these
files may be stdin, which is specified by the filename
-.
As clients use the read_double function to read data,
internally, the input layer reads from the the “current inputfile”.
Whenever an EOF is read, the layer transparently switches
to the next input file and restarts the read, so that the caller is
unaware.
Since the user specified a list of input files on the command line,
most of our state revolves around pointers into the
filenames array. We also need a FILE* for the
current input file.
static FILE *curr_inputfile; // the FILE* we're reading from
static char **next_filename; // next filename to read
static char **end_namep; // represents "no more files"Our inputfiles module must have its global state
initialized before it can be used; our initialization function properly
sets curr_inputfile and curr_namep.
void init_inputfiles(char **filenames, int filecount) {
curr_inputfile = NULL;
next_filename = filenames;
end_namep = filenames + filecount;
next_inputfile(); // ensures curr_inputfile is either valid or NULL
}At the beginning of main, we
call
init_inputfiles() with any elements of theargvarray after the program name, assuming all command-line arguments are input files. This ensures that all calls to our input functions have a validcurr_inputfile, and if it'sNULL`,
we are out of input.
int main(int argc, char *argv[]) {
program_name = argv[0];
+ init_inputfiles(argv + 1, argc-1);
install_hoc_builtins();
...We call next_inputfile to switch to the next input file
whenever we read an EOF. It returns true if another input
file is found, or false if all specified inputs are exhausted.
/**
* Change curr_inputfile to point to the next input file, if
* any. Closes any existing input file (unless that file points to
* stdin).
*
* Returns true if there is more input to read; false otherwise.
*/
static bool next_inputfile(void) {
if (curr_inputfile != NULL && curr_inputfile != stdin) {
fclose(curr_inputfile);
}
curr_inputfile = NULL;
do {
if (next_filename == end_namep) {
return false;
}
// sync curr_inputfile with curr_namep
if (strcmp(*next_filename, "-") == 0) {
curr_inputfile = stdin;
} else {
curr_inputfile = fopen(*next_filename, "r");
if (curr_inputfile == NULL) {
warning("Could not open input filename: '%s'", *next_filename);
}
}
next_filename += 1;
} while (curr_inputfile == NULL);
return true;
}Finally, the read_double function is implemented to hide
the list of inputs from callers, and instead handles EOF
conditions itself.
bool read_double(double *dest) {
// curr_inputfile is NULL when all input is exhausted
if (curr_inputfile == NULL) {
warning("Cannot read input data - No input files remaining");
return false;
}
while (true) {
switch (fscanf(curr_inputfile, "%lf", dest)) {
case EOF:
if (!next_inputfile()) {
return false;
}
break; // otherwise try again
case 0:
exec_error("Could not read double from '%s' - invalid data format",
*curr_namep);
default:
return true;
}
}
}read expressionAfter adding a read keyword to builtins.c,
we can add a final expression production for our new syntax:
read(destVariable). The varread instruction we
install has two words; the second is the destination variable.
expr: NUMBER
...
| assign
| call
+ | READ_KW '(' VAR ')'
+ {
+ $$ = install_instruction(inst_varread);
+ install_ref($3);
+ }Our final instruction for hoc6 is varread,
which uses the read_double interface to assign a variable
value and push a success boolean onto the stack.
int inst_varread(void) {
Symbol *dest = (pc++)->symbol;
bool success = read_double(&dest->data.val);
stack_push(dat(success));
if (!success) {
dest->data.val = 0;
}
dest->type = VAR;
return 0;
}There are a huge number of opportunities to explore in hoc at this point! Naming just a few:
Hopefully you’re as excited as I was upon reaching the end of
hoc6 to keep exploring– there is so much low hanging fruit!
Can you build a language you’d actually want to use in a real task?
If there’s enough interest, I’ll add more details to the
adhoc source code where some of these ideas are explored.
Please feel free to drop me a
line; feedback is always welcome!
CFLAGS ?= -std=c2x -Wall -Wextra -pedantic -Wformat -Wformat-extra-args -Wformat-nonliteral
YFLAGS ?= -d
LDFLAGS=-lm
objects := builtins.o math.o symbol.o machine.o str.o hoc.o inputfiles.o
.PHONY:all
all: hoc
hoc: $(objects)
cc $(CFLAGS) $(objects) $(LDFLAGS) -o hoc
hoc.c hoc.tab.h: hoc.y
yacc $(YFLAGS) hoc.y -o hoc.tab.c
mv hoc.tab.c hoc.c
%.o: hoc.h hoc.tab.h %.c
cc $(CFLAGS) -c -o $@ $*.c
.PHONY:clean
clean:
rm -f hoc.tab.h
rm -f $(objects)
rm -f hoc.c
rm -f hoc
#include "hoc.h"
#include "hoc.tab.h"
#include <math.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h> // rand
/* Defined in math.c */
extern double Abs(double), Acos(double), Atan(double), Atan2(double, double),
Cos(double), Exp(double), Integer(double), Lg(double), Ln(double),
Log10(double), Pow(double, double), Random(void), Sin(double), Sqrt(double);
static struct {
const char *name;
double value;
} constants[] = {
{"PI", 3.14159265358979323846},
{"E", 2.71828182845904523536},
{NULL, 0},
};
static struct {
const char *name;
int args;
union {
double (*call0)(void);
double (*call1)(double);
double (*call2)(double, double);
};
} builtins[] = {
{"abs", 1, .call1 = Abs},
{"acos", 1, .call1 = Acos},
{"atan", 1, .call1 = Atan},
{"atan2", 2, .call2 = Atan2},
{"cos", 1, .call1 = Cos},
{"exp", 1, .call1 = Exp},
{"int", 1, .call1 = Integer},
{"lg", 1, .call1 = Lg},
{"ln", 1, .call1 = Ln},
{"log10", 1, .call1 = Log10},
{"pow", 2, .call2 = Pow},
{"rand", 0, .call0 = Random},
{"sin", 1, .call1 = Sin},
{"sqrt", 1, .call1 = Sqrt},
{NULL, 0, .call0 = NULL},
};
static struct {
const char *name;
int token_type;
} keywords[] = {
{"if", IF_KW},
{"else", ELSE_KW},
{"func", FUNC_KW},
{"proc", PROC_KW},
{"print", PRINT_KW},
{"read", READ_KW},
{"while", WHILE_KW},
{"return", RETURN_KW},
{NULL, 0},
};
extern void command_list(void);
static struct {
const char *name;
void (*command_func)(void);
} commands[] = {
{"LIST", command_list},
{NULL, 0},
};
void install_hoc_builtins(void) {
for (int i = 0; constants[i].name != NULL; i++) {
install_symbol(constants[i].name, CONST, constants[i].value);
}
for (int i = 0; builtins[i].name != NULL; i++) {
Symbol *s = install_symbol(builtins[i].name, BUILTIN, 0.0);
// TODO: Why not have separate install_value vs install_function?
s->data.func.args = builtins[i].args;
switch (builtins[i].args) {
case 0:
s->data.func.call0 = builtins[i].call0;
break;
case 1:
s->data.func.call1 = builtins[i].call1;
break;
case 2:
s->data.func.call2 = builtins[i].call2;
break;
default:
fprintf(
stderr, "Invalid arg count for symbol '%s'\n", builtins[i].name);
exit(1);
}
}
for (int i = 0; keywords[i].name != NULL; i++) {
install_symbol(keywords[i].name, keywords[i].token_type, 0.0);
}
for (int i = 0; commands[i].name != NULL; i++) {
Symbol *s = install_symbol(commands[i].name, COMMAND, 0.0);
s->data.command_func = commands[i].command_func;
}
}
#ifndef HOC_INC
#define HOC_INC 1
/*
* Global types & declarations for use in hoc
*/
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
///----------------------------------------------------------------
/// strings
///----------------------------------------------------------------
typedef struct str str;
/** Allocate a string; runtime error on allocation failure */
str *str_new(void);
/** Append a character to the end of a string */
bool str_append(str *s, char c);
/**
* Get a "view" of string `s` for use as a valid, NUL-terminated C
* string. The pointer returned is non-owning, and will be
* invalidated by any manipulations made to the string.
*
* In particular, if the caller wishes to store the resulting char
* buffer for other uses, it should be copied.
*/
const char *str_view(str *s);
/** Clear the string's contents */
void str_clear(str *s);
/** Deallocate `s`, freeing its memory. */
void str_free(str *s);
///----------------------------------------------------------------
/// Addresses (Shared by multiple types)
///----------------------------------------------------------------
typedef struct MachineCode MachineCode;
typedef MachineCode *Addr;
///----------------------------------------------------------------
/// Symbols
///----------------------------------------------------------------
typedef struct Symbol Symbol;
/** Function descriptor for builtins: these always return doubles */
struct BuiltinFunc {
int args;
union {
/* We use an anonymous union for the function pointer; there's
no meaningful name for the union as a field */
double (*call0)(void);
double (*call1)(double);
double (*call2)(double, double);
};
};
struct Symbol {
short type; // Generated from our token types
char *name;
Symbol *next;
union {
double val;
Addr addr;
struct BuiltinFunc func;
void (*command_func)(void);
} data;
};
Symbol *install_symbol(const char *name, short type, double value);
Symbol *lookup(const char *name);
///----------------------------------------------------------------
/// Machine Code
///----------------------------------------------------------------
typedef int (*Inst)(void);
typedef enum CodeType {
CT_INST = 'I',
CT_LITERAL = 'L',
CT_SYM_REF = 'S',
CT_ADDR = 'A',
CT_CHRSTR = 'C',
} CodeType;
struct MachineCode {
CodeType type;
union {
double literal;
Inst inst;
Symbol *symbol;
Addr addr;
str *string;
}; // anonymous union; no need to name it
};
#define STOP_INST NULL
#define is_stop_inst(mc) ((mc)->type == CT_INST && (mc)->inst == STOP_INST)
///----------------------------------------------------------------
/// Machine Program
///----------------------------------------------------------------
extern MachineCode prog[];
extern Addr prog_start;
/**
* Installs the given `MachineCode` object into the next available
* location in the machine's current program. A runtime error will
* occur if the program is full.
*
* Returns the address at which the code was installed.
*/
Addr install_code(MachineCode mc);
/** The next available address in the machine's current program */
extern Addr progp;
/**
* Reserves all currently-installed code as a subroutine, under Symbol
* `subr_sym`. The value of `subr_sym` will be updated to the address
* of the installed routine. Ensures that the code will not be
* overwritten by future installations, even after a machine reset.
*/
void reserve_subr(Symbol *subr_sym);
#define install_literal(lit) \
install_code((struct MachineCode){.type = CT_LITERAL, .literal = lit})
#define install_ref(sp) \
install_code((struct MachineCode){.type = CT_SYM_REF, .symbol = sp})
#define install_instruction(instr) \
install_code((struct MachineCode){.type = CT_INST, .inst = instr})
#define install_address(address) \
install_code((struct MachineCode){.type = CT_ADDR, .addr = address})
#define install_string(s) \
install_code((struct MachineCode){.type = CT_CHRSTR, .string = s})
void execute(Addr start_addr);
///----------------------------------------------------------------
/// Instructions
///----------------------------------------------------------------
int inst_add(void);
int inst_and(void);
int inst_argget(void);
int inst_argset(void);
int inst_assignvar(void);
int inst_assignconst(void);
int inst_call(void);
int inst_div(void);
int inst_eval(void);
int inst_eq(void);
int inst_gt(void);
int inst_gte(void);
int inst_ifcode(void);
int inst_lt(void);
int inst_lte(void);
int inst_mul(void);
int inst_neq(void);
int inst_not(void);
int inst_or(void);
int inst_pop(void);
int inst_prexpr(void);
int inst_print(void);
int inst_prstr(void);
int inst_pushlast(void);
int inst_pushlit(void);
int inst_pushref(void);
int inst_sub(void);
int inst_subrexec(void);
int inst_ufuncret(void);
int inst_uprocret(void);
int inst_varread(void);
int inst_whilecode(void);
///----------------------------------------------------------------
/// Commands
///----------------------------------------------------------------
void command_list(void);
void machine_run_command(Symbol *);
///----------------------------------------------------------------
/// Machine Stack
///----------------------------------------------------------------
typedef union Datum {
double value;
Symbol *symbol;
} Datum;
void machine_startup(void);
void machine_reset_program(void);
///----------------------------------------------------------------
/// Runtime Functions
///----------------------------------------------------------------
/** input files */
/**
* Initializes the inputfiles module with the first `filecount`
* members of the `filenames` array; Each should be a file path.
*/
void init_inputfiles(char **filenames, int filecount);
/**
* Returns true if a double value could be read from the runtime
* input files, false otherwise. `dest` will be set to the value read,
* if any, or zero.
*
* If input data exists but is in an invalid format, causes an
* execution error.
*/
bool read_double(double *dest);
/** global error handling */
__attribute__((format(printf, 1, 2))) void warning(const char *msg, ...);
__attribute__((format(printf, 1, 2))) void exec_error(const char *msg, ...);
/** builtins */
void install_hoc_builtins(void);
#endif // HOC_INC
%{
///----------------------------------------------------------------
/// C Preamble
///----------------------------------------------------------------
/*
* "higher-order calculator" - Version 5
* From "The UNIX Programming Environment"
*/
#include <stdbool.h>
#include <stdio.h>
#include <ctype.h>
#include <signal.h>
#include <setjmp.h>
#include <stdarg.h>
#include <string.h>
#include "hoc.h"
///----------------------------------------------------------------
/// global state
///----------------------------------------------------------------
/** Tracks the symbol for the subroutine currently being parsed (if any). */
Symbol *current_subr = NULL;
jmp_buf begin;
///----------------------------------------------------------------
/// local declarations
///----------------------------------------------------------------
#define YACC_SUCCESS 0
int yylex(void);
void yyerror(const char *s);
void check_call_args(const Symbol *s, int actual);
%}
/*
* Grammar points:
*
* - EXPRESSIONS provide a value on the top of the satck.
* - STATEMENTS are not expressions, and do not provide a value.
*/
%union {
double hoc_value;
Symbol *hoc_symbol;
Addr hoc_address;
int hoc_integer;
str *hoc_string;
}
%token <hoc_value> NUMBER
%token <hoc_symbol> VAR CONST BUILTIN UNDEF UFUNC UPROC COMMAND
/* These keyword symbols are efectively singletons */
%token <hoc_symbol> IF_KW ELSE_KW PRINT_KW WHILE_KW FUNC_KW PROC_KW RETURN_KW READ_KW
%type <hoc_symbol> assignable subr_decl
%type <hoc_integer> arglist
%token <hoc_integer> PARAM_NUMBER
%token <hoc_string> STRING
%type <hoc_address> args_start assign call end expr if prlist stmt stmtlist while
%right '=' /* right-associative, much like C */
%left OR
%left AND
%left EQ GT GTE LT LTE NEQ
%left '+' '-' /* left-associative, same precedence */
%left '*' '/' /* left-assoc; higher precedence */
%right '^' /* exponents */
%left UNARY_MINUS NOT /* Even higher precedence
than mul or div. Can't use '-', as we've used
it already for subtraction. */
%%
start: list
{
install_instruction(STOP_INST);
YYACCEPT;
}
;
list: /* nothing */
| list terminator
| list stmt terminator
{
install_instruction(STOP_INST);
YYACCEPT;
}
| list assign terminator
{
install_instruction(inst_pop);
install_instruction(STOP_INST);
YYACCEPT;
}
| list expr terminator
{
install_instruction(inst_print);
install_instruction(STOP_INST);
YYACCEPT;
}
| list error terminator { yyerrok; }
;
assign: assignable '=' expr
{
$$ = $3;
install_instruction(inst_pushref);
install_ref($1);
install_instruction(inst_assignvar);
}
| assignable ':' '=' expr
{
$$ = $4;
install_instruction(inst_pushref);
install_ref($1);
install_instruction(inst_assignconst);
}
;
arglist: /* nothing */ { $$ = 0; }
| expr { $$ = 1; }
| arglist ',' expr { $$ = $1 + 1; }
;
stmt: expr { install_instruction(inst_pop); }
| '{' stmtlist '}' { $$ = $2; }
| subr_decl '(' ')' stmt
{
/*
* Note: the order of these two lines matters!
* `uprocret` should be part of the subr body,
* so reserve space *after* installing it.
*/
install_instruction(inst_uprocret);
reserve_subr($1);
current_subr = NULL;
$$ = $4;
}
| UPROC '(' args_start arglist ')'
{
install_instruction(inst_subrexec);
install_ref($1);
install_literal($4);
$$ = $3;
}
| RETURN_KW
{
if (current_subr == NULL) {
exec_error("Syntax error - "
"`return` keyword used outside of a subroutine");
}
if (current_subr->type != UPROC) {
exec_error("Syntax error - "
"bare return used outside of a procedure");
}
$$ = install_instruction(inst_uprocret);
}
| RETURN_KW expr
{
if (current_subr == NULL) {
exec_error("Syntax error - "
"return keyword used outside of a subroutine");
}
if (current_subr->type != UFUNC) {
exec_error("Syntax error - "
"`return <expr>` used outside of a function");
}
install_instruction(inst_ufuncret);
$$ = $2;
}
| while cond stmt end
{
$1[0].type = CT_INST;
$1[0].inst = inst_whilecode;
$1[1].type = CT_ADDR;
$1[1].addr = $3;
$1[2].type = CT_ADDR;
$1[2].addr = $4;
}
| if cond stmt end
{
$1[0].type = CT_INST;
$1[0].inst = inst_ifcode;
$1[1].type = CT_ADDR;
$1[1].addr = $3;
$1[3].type = CT_ADDR;
$1[3].addr = $4;
}
| if cond stmt end ELSE_KW stmt end
{
$1[0].type = CT_INST;
$1[0].inst = inst_ifcode;
$1[1].type = CT_ADDR;
$1[1].addr = $3;
$1[2].type = CT_ADDR;
$1[2].addr = $6;
$1[3].type = CT_ADDR;
$1[3].addr = $7;
}
| COMMAND
{
machine_run_command($1);
$$ = progp;
}
| PRINT_KW prlist
{
$$ = $2;
}
;
prlist: expr
{
install_instruction(inst_prexpr);
$$ = $1;
}
| STRING
{
$$ = install_instruction(inst_prstr);
install_string($1);
}
| prlist ',' expr
{
install_instruction(inst_prexpr);
$$ = $1;
}
| prlist ',' STRING
{
install_instruction(inst_prstr);
install_string($3);
$$ = $1;
}
;
stmtlist: /* nothing */ { $$ = progp; }
| stmtlist stmt terminator { $$ = $1; }
| stmtlist terminator
;
cond: '(' expr ')'
{
install_instruction(STOP_INST);
}
;
while: WHILE_KW
{
$$ = install_instruction(STOP_INST);
install_instruction(STOP_INST);
install_instruction(STOP_INST);
}
;
if: IF_KW
{
$$ = install_instruction(STOP_INST);
install_instruction(STOP_INST);
install_instruction(STOP_INST);
install_instruction(STOP_INST);
}
;
end: /* nothing */
{
install_instruction(STOP_INST);
$$ = progp;
}
;
call: BUILTIN '(' ')'
{
check_call_args($1, 0);
$$ = install_instruction(inst_pushref);
install_ref($1);
install_instruction(inst_call);
}
| BUILTIN '(' expr ')'
{
check_call_args($1, 1);
install_instruction(inst_pushref);
install_ref($1);
install_instruction(inst_call);
$$ = $3; /* Block starts with pushing args */
}
| BUILTIN '(' expr ',' expr ')'
{
check_call_args($1, 2);
$$ = install_instruction(inst_pushref);
install_ref($1);
install_instruction(inst_call);
$$ = $3; /* Block starts with pushing args */
}
;
subr_decl: FUNC_KW VAR
{
current_subr = $2;
$2->type=UFUNC;
$$ = $2;
}
| PROC_KW VAR
{
current_subr = $2;
$2->type=UPROC;
$$ = $2;
}
;
/*
* Every expr construct, when *executed*, must push a value onto the
* stack
*/
expr: NUMBER
{
$$ = install_instruction(inst_pushlit);
install_literal($1);
}
| '@'
{
$$ = install_instruction(inst_pushlast);
}
| assignable
{
$$ = install_instruction(inst_pushref);
install_ref($1);
install_instruction(inst_eval);
}
| assign
| call
| READ_KW '(' VAR ')'
{
$$ = install_instruction(inst_varread);
install_ref($3);
}
| UFUNC '(' args_start arglist ')'
{
install_instruction(inst_subrexec);
install_ref($1);
install_literal($4);
$$ = $3;
}
| PARAM_NUMBER
{
if (current_subr == NULL) {
exec_error("Syntax error - "
"Parameter references not allowed here");
}
$$ = install_instruction(inst_argget);
install_literal($1);
}
| PARAM_NUMBER '=' expr
{
if (current_subr == NULL) {
exec_error("Syntax error - "
"Parameter references not allowed here");
}
$$ = install_instruction(inst_argset);
install_literal($1);
}
| expr '+' expr { install_instruction(inst_add); }
| expr '-' expr { install_instruction(inst_sub); }
| expr '*' expr { install_instruction(inst_mul); }
| expr '/' expr { install_instruction(inst_div); }
| expr LT expr { install_instruction(inst_lt); }
| expr LTE expr { install_instruction(inst_lte); }
| expr GT expr { install_instruction(inst_gt); }
| expr GTE expr { install_instruction(inst_gte); }
| expr EQ expr { install_instruction(inst_eq); }
| expr NEQ expr { install_instruction(inst_neq); }
| expr AND expr { install_instruction(inst_and); }
| expr OR expr { install_instruction(inst_or); }
| expr '^' expr
{
install_instruction(inst_pushref);
install_ref(lookup("pow"));
install_instruction(inst_call);
}
| '-'expr %prec UNARY_MINUS /* prec makes us bind tighter than subtraction */
{
install_instruction(inst_pushlit);
install_literal(-1);
install_instruction(inst_mul);
$$=$2;
}
| NOT expr %prec NOT
{
install_instruction(inst_not);
$$=$2;
}
| '(' expr ')' { $$ = $2; }
;
args_start: /* nothing */ { $$ = progp; }
;
terminator: '\n'
| ';'
;
assignable: VAR
| CONST
%% // end of grammar
/* error tracking */
char *program_name;
int line_number = 1;
bool reached_eof = false;
int main(int argc, char *argv[]) {
program_name = argv[0];
init_inputfiles(argv + 1, argc - 1);
install_hoc_builtins();
machine_startup();
setjmp(begin);
machine_reset_program();
while (!reached_eof && yyparse() == YACC_SUCCESS) {
execute(NULL);
machine_reset_program();
}
return 0;
}
/**
* "Looks ahead 1 character for a possibility. If it's found, returns
* `match`. If not:
* - places character back onto input stream for next read
* - returns the `otherwise` result.
*/
static int lookahead(char expect, int match, int otherwise) {
int c = getchar();
if (c == expect) {
return match;
} else {
ungetc(c, stdin);
return otherwise;
}
}
/**
* If c is '\': Returns next input character, string-escaped.
* Otherwise: returns c
*/
int escaped(char c) {
// Each escaped character is followed by its output char
static char translations[] = "b\b"
"f\f"
"n\n"
"t\t";
if (c != '\\') {
return c;
}
c = getchar();
char *cp = strchr(translations, c);
// islower() used so that we don't match the output chars or NUL
if (islower(c) && cp != NULL) {
return cp[1];
}
return c;
}
/* our simple, hand-rolled lexer. */
int yylex(void) {
static str *symbol_str = NULL;
int c;
// Skip space
do {
c = getchar();
} while (c == ' ' || c == '\t');
// Skip comments
if (c == '/' && lookahead('/', 0, '/') == 0) {
while (c != '\n' && c != EOF) {
c = getchar();
}
ungetc(c, stdin);
}
/* Newline / EOF handling
*
* The first time we see an EOF, we "fake" a newline,
* so that the list production can recognize any in-progress
* statement. The next call will return the EOF.
*/
if (c == EOF && !reached_eof) {
ungetc(c, stdin);
reached_eof = true;
c = '\n';
}
if (c == '\n') {
line_number++;
}
// Parse parameters
if (c == '$') {
if (scanf("%d", &yylval.hoc_integer) != 1) {
exec_error("Invalid parameter specification");
} else if (yylval.hoc_integer <= 0) {
exec_error("Parameter number '%d' invalid - cannot be negative",
yylval.hoc_integer);
}
return PARAM_NUMBER;
}
// Parse strings
if (c == '"') {
str *s = str_new();
while (true) {
c = getchar();
if (c == '"') {
break;
}
if (c == EOF || c == '\n') {
exec_error("String wasn't closed: '%s'", str_view(s));
}
str_append(s, escaped(c));
}
yylval.hoc_string = s;
return STRING;
}
// Get numbers
if (c == '.' || isdigit(c)) {
ungetc(c, stdin);
scanf("%lf", &yylval.hoc_value);
return NUMBER;
}
if (isalpha(c)) {
Symbol *s;
// reset symbol string for new input
if (symbol_str == NULL) {
symbol_str = str_new();
} else {
str_clear(symbol_str);
}
do {
str_append(symbol_str, c);
c = getchar();
} while (isalpha(c) || isdigit(c));
// Just in case we exceeded the limit
while ((isalpha(c) || isdigit(c))) {
c = getchar();
}
// at this point, we have a non-alphanumeric 'c'
ungetc(c, stdin);
const char *buf = str_view(symbol_str);
if ((s = lookup(buf)) == NULL) {
s = install_symbol(buf, UNDEF, 0.0);
}
yylval.hoc_symbol = s;
return (s->type == UNDEF ? VAR : s->type);
}
if (isalpha(c)) {
Symbol *s;
static str *sbuf = NULL;
// setup string for new input
if (sbuf == NULL) {
sbuf = str_new();
} else {
str_clear(sbuf);
}
do {
c = getchar();
} while (isalpha(c) || isdigit(c));
ungetc(c, stdin);
const char *buf = str_view(sbuf);
if ((s = lookup(buf)) == NULL) {
s = install_symbol(buf, UNDEF, 0.0);
}
yylval.hoc_symbol = s;
return (s->type == UNDEF ? VAR : s->type);
}
switch (c) {
case '<':
return lookahead('=', LTE, LT);
case '>':
return lookahead('=', GTE, GT);
case '=':
return lookahead('=', EQ, '=');
case '!':
return lookahead('=', NEQ, NOT);
case '|':
return lookahead('|', OR, '|');
case '&':
return lookahead('&', AND, '&');
default:
return c;
}
}
void yyerror(const char *s) {
warning("%s", s);
if (current_subr != NULL) {
current_subr->type = UNDEF;
current_subr = NULL;
}
}
/*
* Verify Symbol's declared arg count matches actual,
* or display a helpful error message with mismatch
*/
void check_call_args(const Symbol *s, int actual) {
int expected = s->data.func.args;
if (expected != actual) {
exec_error("Wrong number of arguments for %s: expected %d, got %d",
s->name,
expected,
actual);
}
}
#define print_error_prefix() fprintf(stderr, "%s: ", program_name)
#define print_error_suffix() fprintf(stderr, " (on line %d)\n", line_number)
void warning(const char *msg, ...) {
va_list args;
va_start(args, msg);
print_error_prefix();
vfprintf(stderr, (msg), args);
print_error_suffix();
va_end(args);
}
void exec_error(const char *msg, ...) {
va_list args;
va_start(args, msg);
print_error_prefix();
vfprintf(stderr, (msg), args);
print_error_suffix();
va_end(args);
longjmp(begin, 0);
}
#include "hoc.h"
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
///----------------------------------------------------------------
/// global state
///----------------------------------------------------------------
static FILE *curr_inputfile; // the FILE* we're reading from
static char **next_filename; // next filename to read
static char **end_namep; // represents "no more files"
/**
* Change curr_inputfile to point to the next input file, if
* any. Closes any existing input file (unless that file points to
* stdin).
*
* Returns true if there is more input to read; false otherwise.
*/
static bool next_inputfile(void) {
if (curr_inputfile != NULL && curr_inputfile != stdin) {
fclose(curr_inputfile);
}
curr_inputfile = NULL;
do {
if (next_filename == end_namep) {
return false;
}
// sync curr_inputfile with curr_namep
if (strcmp(*next_filename, "-") == 0) {
curr_inputfile = stdin;
} else {
curr_inputfile = fopen(*next_filename, "r");
if (curr_inputfile == NULL) {
warning("Could not open input filename: '%s'", *next_filename);
}
}
next_filename += 1;
} while (curr_inputfile == NULL);
return true;
}
void init_inputfiles(char **filenames, int filecount) {
curr_inputfile = NULL;
next_filename = filenames;
end_namep = filenames + filecount;
next_inputfile(); // ensures curr_inputfile is either valid or NULL
}
bool read_double(double *dest) {
// curr_inputfile is NULL only when all input is exhausted
if (curr_inputfile == NULL) {
warning("Cannot read input data - No input files remaining");
return false;
}
while (true) {
switch (fscanf(curr_inputfile, "%lf", dest)) {
case EOF:
if (!next_inputfile()) {
return false;
}
break; // otherwise try again
case 0:
exec_error("Could not read double from '%s' - invalid data format",
*next_filename);
default:
return true;
}
}
}
#include "hoc.h"
#include "hoc.tab.h"
#include <assert.h>
#include <stdbool.h>
#include <stddef.h> // NULL
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/*
* MACHINE STATE
* =============
*
*
* == PROG == == FRAMES == == STACK ==
* ┌─────────────┐◀───┐ ┌─────────────┐ ┌─────────────┐
* │ MachineCode │ pc │ Frame │ │ Datum │
* ├─────────────┤ ├─────────────┤ ├─────────────┤
* │ MachineCode │ │ Frame │ │ Datum │
* ├─────────────┤ ├─────────────┤ ├─────────────┤
* │ MachineCode │ progp │ Frame │ framep │ Datum │ stackp
* ├─────────────┤◀────┘ ├─────────────┤◀───┘ ├─────────────┤ ◀──┘
* │ ... │ │ ... │ │ ... │
* └─────────────┘ └─────────────┘ └─────────────┘
*
*
* The Program
* -----------
* We have our "program" stored as a list of MachineCode
* objects, each encoding one of:
* - an instruction
* - a literal value
* - a symbol reference
*
* We "execute" our program by setting the program counter (pc)
* pointer to the first MachineCode (which *must* be an instruction),
* and handling each instruction we find. Within an instruction, the
* program counter might be updated so that by the *end* of any
* instruction, the program counter is always pointing to another
* instruction.
*
*
* The Stack
* ---------
*
* Each instruction manipulates the stack, pushing,
* popping, or combining values from the stack. The stack elements
* are Datum objects, which act much like MachineCodes, but with two
* differences.
*
* 1. Datums cannot represent an instruction; they resolve to a value
* 2. The type of a datum can always be resolved from their context,
* and so no "type" field is necessary.
*
*
* The Frame Stack
* ---------------
*
* When a user-defined subroutine is called, a new Frame is
* placed on a separate "frame stack". The frames contain enough
* information to:
*
* - "Jump" to the location of the subroutine
* - The `def_called` field is a Symbol with the start Addr
*
* - Locate any arguments on the (data) stack
* - The argp/arg_count fields locate the arguments passed
*
* - Return to previous prog location after the subroutine finishes
* - Via the `ret_pc` field
*
*
* PROG STACK
* ┌─────────────┐ ┌─────────────┐
* ┌─▶│ MachineCode │ │ Datum │
* │ ├─────────────┤ FRAME ├─────────────┤
* │ │ MachineCode │ ┌─────────────┐ │ Datum │
* │ ├─────────────┤ │def_called: │ ├─────────────┤
* │ │ ... │ │"myfunc" │ │ Datum │
* │ ├─────────────┤ ├─────────────┤ ├─────────────┤
* │ │ MachineCode │ ┌──┤ret_pc │ ┌─▶│ Datum │
* │ ├─────────────┤ │ ├─────────────┤ │ ├─────────────┤
* │ │ MachineCode │ │ │argp ├──┘ │ Datum │
* │ ├─────────────┤ │ ├─────────────┤ ─▶├─────────────┤
* │ │ MachineCode │◀─┘ │arg_count: 2 │─ ┘ │ Datum │
* │ ├─────────────┤ └─────────────┘ ├─────────────┤
* │ │ ... │ │ . │
* │ └─────────────┘ │ . │
* │ └─────────────┘
* │ SYMBOL TABLE
* │ ┌─────────────────────────────────────────────────────┐
* │ │ ┌──────────────┐ │
* │ │ │name: "myfunc"│ │
* │ │ ├──────────────┤ │
* └──┼─│addr │ │
* │ └──────────────┘ │
* └─────────────────────────────────────────────────────┘
*/
/*
* Prog
*/
#define PROGSIZE 2000
MachineCode prog[PROGSIZE];
Addr prog_start = prog; // No installation above this address
Addr progp; // Next free location for code installation
static Addr pc; // the current location in the executing program
static Addr highest_addr_written = prog; // Used in LISTing contents
/*
* Data stack
*/
#define STACKSIZE 256
static Datum stack[STACKSIZE]; // the machine stack
static Datum *stackp; // next free spot on the stack
static Datum last_popped = {.value = 0}; // The last value popped
static void stack_push(Datum d);
static Datum stack_pop(void);
/*
* Frames and Frame stack
*/
typedef struct Frame {
Symbol *subr_called; // the UFUNC/UPROC we have jumped to
Addr ret_pc; // return location
Datum *argp; // pointer to first argument
int arg_count; // number of arguments
} Frame;
#define FRAMELIMIT 100 // Recursion depth limit
static Frame frames[FRAMELIMIT];
static const Frame *OVERFLOW_FRAME = frames + FRAMELIMIT;
static Frame *fp = frames; // Next frame to use
static bool is_returning; // true when returning from a user subroutine
///----------------------------------------------------------------
/// Initialization
///----------------------------------------------------------------
/** Initialize machine state for startup */
void machine_startup(void) {
srand(time(NULL)); // use current time as seed for random generator
machine_reset_program();
}
/** Initialize machine state for a new program */
void machine_reset_program(void) {
is_returning = false;
stackp = stack;
progp = prog_start;
}
///----------------------------------------------------------------
/// Code Installation
///----------------------------------------------------------------
Addr install_code(MachineCode mc) {
if (progp >= prog + PROGSIZE) {
exec_error("Max program size exceeded");
}
*progp = mc;
Addr result = progp++; // return location of THIS instruction
if (result > highest_addr_written) {
highest_addr_written = result;
}
return result;
}
void reserve_subr(Symbol *subr_sym) {
subr_sym->data.addr = prog_start;
prog_start = progp;
}
///----------------------------------------------------------------
/// Execution
///----------------------------------------------------------------
void execute(Addr start_addr) {
is_returning = false;
pc = (start_addr == NULL) ? prog_start : start_addr;
while (!is_stop_inst(pc) && !is_returning) {
if (pc >= progp) {
exec_error("PC was past end-of-program");
}
if (pc->type != CT_INST) {
exec_error("Programming error, type '%c' was not INST!", pc->type);
}
(*(*pc++).inst)(); // increment PC and execute inst
}
}
/** Push new value onto stack */
static void stack_push(Datum d) {
if (stackp >= stack + STACKSIZE) {
exec_error("Stack overflow");
}
*(stackp++) = d; // copy
}
/** Pop most recent value from stack, or fail on underflow */
static Datum stack_pop(void) {
if (stackp <= stack) {
exec_error("Stack underflow");
}
Datum d = *(--stackp);
return (last_popped = d);
}
/**
* Push a new frame for a subroutine call onto the frame stack, or fail on
* overflow.
*
* The frame is initialized as a call to Symbol `s`, returning to address
* `caller_pc`. Caller must initialize all other frame data.
*/
static Frame *frame_push(Symbol *s, Addr caller_pc) {
if (fp == OVERFLOW_FRAME) {
exec_error("Call-Stack Overflow while calling '%s'", s->name);
}
fp->subr_called = s;
fp->ret_pc = caller_pc;
return fp++;
}
/** Pop most recent Frame from the frame stack, or fail on underflow */
static Frame *frame_pop(void) {
if (fp == frames) {
exec_error("Call-Stack Underflow while attempting to return");
}
return --fp;
}
/** Access top of frame stack, or fail if no Frame exists */
static Frame *frame_peek(void) {
if (fp == frames) {
exec_error("Call-Stack Underflow while attempting to peek at top frame");
}
return fp - 1;
}
/**
* Reset any machine state indicated by the given frame. This includes:
* - Removing any arguments from the stack for the call.
* - Resetting the program counter
* - Setting global "returning" flag
*/
static void return_from_frame(Frame *f) {
for (int i = f->arg_count; i > 0; i--) {
stack_pop(); // remove arg from stack
}
is_returning = true;
pc = f->ret_pc;
}
///----------------------------------------------------------------
/// Instructions
///----------------------------------------------------------------
#define dat(x) ((Datum){.value = x})
int inst_pushlit(void) {
assert(pc->type == CT_LITERAL);
Datum d;
d.value = pc->literal;
stack_push(d);
pc++;
return 0;
}
int inst_pushref(void) {
assert(pc->type == CT_SYM_REF);
Datum d;
d.symbol = pc->symbol;
stack_push(d);
pc++;
return 0;
}
int inst_pushlast(void) {
stack_push(last_popped);
return 0;
}
int inst_eval(void) {
Datum d = stack_pop(); // must be symbol
switch (d.symbol->type) {
case UNDEF:
exec_error("Undefined variable: '%s'", d.symbol->name);
break;
case BUILTIN:
case UFUNC:
case UPROC:
exec_error("Callable '%s' cannot be evaluated", d.symbol->name);
break;
default: // VAR,CONST
stack_push(dat(d.symbol->data.val));
break;
}
return 0;
}
int inst_whilecode(void) {
/*
┌─────────────────────────────┐
│ whilecode │
├─────────────────────────────┤
│ address(body) │
├─────────────────────────────┤
│ address(stmt after body) │
├─────────────────────────────┤
│ cond │
│ .. │
├─────────────────────────────┤
│ STOP │
├─────────────────────────────┤
│ body │
│ .... │
├─────────────────────────────┤
│ STOP │
├─────────────────────────────┤
│ stmt after body │
└─────────────────────────────┘
*/
Addr body_addr = (pc++)->addr;
Addr end_addr = (pc++)->addr;
Addr cond_addr = pc;
while (true) {
if (is_returning) {
return 0; // PC is already set to the appropriate address
}
execute(cond_addr);
if (stack_pop().value) {
execute(body_addr);
} else {
break;
}
}
// condition false
pc = end_addr;
return 0;
}
int inst_ifcode(void) {
/*
┌─────────────────────────────┐
│ ifcode │
├─────────────────────────────┤
│ address(ifbody) │
├─────────────────────────────┤
│ address(elsebody) │
├─────────────────────────────┤
│ address(stmt after if) │
├─────────────────────────────┤
│ cond │
│ .. │
├─────────────────────────────┤
│ STOP │
├─────────────────────────────┤
│ ifbody │
│ .... │
├─────────────────────────────┤
│ STOP │
├─────────────────────────────┤
│ elsebody │
│ .... │
├─────────────────────────────┤
│ STOP │
├─────────────────────────────┤
│ stmt after if │
└─────────────────────────────┘
*/
Addr ifbody_addr = pc[0].addr;
MachineCode maybe_elsebody = pc[1];
Addr end_addr = pc[2].addr;
Addr cond_addr = pc + 3;
execute(cond_addr);
Datum d = stack_pop(); // must be value
if (d.value) {
execute(ifbody_addr);
} else if (maybe_elsebody.type == CT_ADDR) {
execute(maybe_elsebody.addr);
}
if (!is_returning) {
pc = end_addr;
}
return 0;
}
int inst_call(void) {
Datum s = stack_pop(); // must be symbol
if (s.symbol->type != BUILTIN) {
exec_error("Cannot call non-builtin '%s'", s.symbol->name);
}
double result;
switch (s.symbol->data.func.args) {
case 0:
result = s.symbol->data.func.call0();
break;
case 1: {
Datum arg = stack_pop();
result = s.symbol->data.func.call1(arg.value);
break;
}
case 2: {
Datum arg2 = stack_pop();
Datum arg1 = stack_pop();
result = s.symbol->data.func.call2(arg1.value, arg2.value);
break;
}
}
stack_push(dat(result));
return 0;
}
int inst_assignvar(void) {
Datum lhs = stack_pop(); // must be symbol
Datum rhs = stack_pop(); // must be value
if (!(lhs.symbol->type == VAR || lhs.symbol->type == UNDEF)) {
exec_error("Cannot assign to symbol '%s'", lhs.symbol->name);
}
lhs.symbol->data.val = rhs.value;
lhs.symbol->type = VAR;
stack_push(dat(rhs.value));
return 0;
}
int inst_assignconst(void) {
Datum lhs = stack_pop(); // must be symbol
Datum rhs = stack_pop(); // must be value
if (lhs.symbol->type == CONST) {
exec_error("Cannot reassign constant '%s'", lhs.symbol->name);
} else if (!(lhs.symbol->type == VAR || lhs.symbol->type == UNDEF)) {
exec_error("Cannot assign to symbol '%s'", lhs.symbol->name);
}
lhs.symbol->data.val = rhs.value;
lhs.symbol->type = CONST;
stack_push(dat(rhs.value));
return 0;
}
int inst_pop(void) {
stack_pop();
return 0;
}
int inst_print(void) {
Datum d = stack_pop();
printf("\t%.8g\n", d.value);
fflush(stdout);
return 0;
}
int inst_prexpr(void) {
Datum d = stack_pop();
printf("%.8g", d.value);
fflush(stdout);
return 0;
}
int inst_prstr(void) {
str *s = (pc++)->string;
printf("%s", str_view(s));
fflush(stdout);
return 0;
}
int inst_add(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value + rhs.value));
return 0;
}
int inst_and(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value && rhs.value));
return 0;
}
int inst_or(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value || rhs.value));
return 0;
}
int inst_not(void) {
Datum d = stack_pop(); // must be value
d.value = !d.value;
stack_push(d);
return 0;
}
int inst_lt(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value < rhs.value));
return 0;
}
int inst_lte(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value <= rhs.value));
return 0;
}
int inst_gt(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value > rhs.value));
return 0;
}
int inst_gte(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value >= rhs.value));
return 0;
}
int inst_eq(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value == rhs.value));
return 0;
}
int inst_neq(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value != rhs.value));
return 0;
}
int inst_sub(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value - rhs.value));
return 0;
}
int inst_div(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
if (rhs.value == 0) {
exec_error("Division by zero");
} else {
stack_push(dat(lhs.value / rhs.value));
}
return 0;
}
int inst_mul(void) {
Datum rhs = stack_pop(); // must be value
Datum lhs = stack_pop(); // must be value
stack_push(dat(lhs.value * rhs.value));
return 0;
}
int inst_subrexec(void) {
Symbol *sym = (pc++)->symbol;
double arg_count = (pc++)->literal;
Frame *f = frame_push(sym, pc);
f->arg_count = (int)arg_count;
f->argp = stackp - f->arg_count;
execute(sym->data.addr);
is_returning = false;
return 0;
}
int inst_argget(void) {
int arg_number = (int)(pc++)->literal;
Frame *current_frame = frame_peek();
if (current_frame->arg_count < arg_number) {
exec_error("Subroutine '%s' uses parameter %d, but only got %d.",
current_frame->subr_called->name,
arg_number,
current_frame->arg_count);
}
stack_push(current_frame->argp[arg_number - 1]);
return 0;
}
int inst_argset(void) {
int arg_number = (int)(pc++)->literal;
Frame *current_frame = frame_peek();
if (current_frame->arg_count < arg_number) {
exec_error("Subroutine '%s' uses parameter %d, but only got %d.",
current_frame->subr_called->name,
arg_number,
current_frame->arg_count);
}
Datum d = stack_pop();
current_frame->argp[arg_number - 1] = d;
stack_push(d);
return 0;
}
int inst_uprocret(void) {
Frame *f = frame_pop();
if (f->subr_called->type != UPROC) {
exec_error("Function '%s' does not return a value", f->subr_called->name);
}
return_from_frame(f);
return 0;
}
int inst_ufuncret(void) {
Frame *f = frame_pop();
Datum result = stack_pop();
return_from_frame(f);
stack_push(result);
return 0;
}
int inst_varread(void) {
Symbol *dest = (pc++)->symbol;
bool success = read_double(&dest->data.val);
stack_push(dat(success));
if (!success) {
dest->data.val = 0;
}
dest->type = VAR;
return 0;
}
///----------------------------------------------------------------
/// Commands
///----------------------------------------------------------------
void machine_run_command(Symbol *command_symbol) {
command_symbol->data.command_func();
}
#define check_inst(inst_name) \
else if (c.inst == inst_##inst_name) do { \
printf(inst_template, line, cp, "INST", #inst_name); \
} \
while (false)
void command_list(void) {
// Lists contents of prog until we hit two stop instructions in a row
#define header_template "%6s | %11s | %10s | %-30s\n"
#define inst_template "%6ld | %11p | %10s | %-30s"
#define rightline_template "%6ld | %11p | %10s | %30s"
#define sym_addr_template "%6ld | %11p | %10s | %-7.7s: %11p (Line %3ld)"
#define lit_template "%6ld | %11p | %10s | %30f"
#define str_template "%6ld | %11p | %10s | %30.30s"
#define addr_template "%6ld | %11p | %10s | %11p (Line %3ld)"
printf(header_template, "LINE", "ADDRESS", "TYPE", "CODE");
printf("------------------------------------------------------------------"
"\n");
for (size_t i = 0; prog + i <= highest_addr_written; i++) {
size_t line = i + 1;
MachineCode c = prog[i];
void *cp = &prog[i];
if (prog + i == prog_start && prog != prog_start) {
printf("---------------------------->"
" END SUBR "
"<----------------------------\n");
}
switch (c.type) {
case (CT_LITERAL):
printf(lit_template, line, cp, "L", c.literal);
break;
case (CT_SYM_REF):
switch (c.symbol->type) {
case UFUNC:
case UPROC: {
size_t addr_line = (c.symbol->data.addr - prog) + 1;
printf(sym_addr_template,
line,
cp,
"SUBR",
c.symbol->name,
(void *)c.symbol->data.addr,
addr_line);
break;
}
default:
printf(rightline_template, line, cp, "SYM", c.symbol->name);
}
break;
case (CT_ADDR): {
size_t addr_line = (c.addr - prog) + 1;
printf(addr_template, line, cp, "ADDR", (void *)c.addr, addr_line);
break;
}
case (CT_CHRSTR): {
printf(str_template, line, cp, "CHAR", str_view(c.string));
break;
}
case (CT_INST):
if (is_stop_inst(&c)) {
printf(inst_template,
line,
cp,
"INST",
"------------ STOP ------------");
}
check_inst(add);
check_inst(and);
check_inst(argget);
check_inst(argset);
check_inst(assignvar);
check_inst(assignconst);
check_inst(call);
check_inst(div);
check_inst(eq);
check_inst(eval);
check_inst(gt);
check_inst(gte);
check_inst(ifcode);
check_inst(lt);
check_inst(lte);
check_inst(mul);
check_inst(neq);
check_inst(not );
check_inst(or);
check_inst(pop);
check_inst(prexpr);
check_inst(prstr);
check_inst(print);
check_inst(pushlast);
check_inst(pushlit);
check_inst(pushref);
check_inst(sub);
check_inst(subrexec);
check_inst(ufuncret);
check_inst(uprocret);
check_inst(varread);
check_inst(whilecode);
else {
printf(inst_template, line, cp, "INST", "UNKNOWN!?");
}
break;
default:
printf("Unknown Code Type: %c\n", c.type);
}
printf("\n");
}
}
#include "hoc.h"
#include <errno.h>
#include <fenv.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h> // rand
#define HOC_RAND_MAX 1000000L
static double check_math_err(double result, const char *op);
double Abs(double x) {
return check_math_err(fabs(x), "fabs");
}
double Acos(double x) {
return check_math_err(acos(x), "acos");
}
double Atan(double x) {
return check_math_err(atan(x), "atan");
}
double Atan2(double x, double y) {
return check_math_err(atan2(x, y), "atan2");
}
double Cos(double x) {
return check_math_err(cos(x), "cos");
}
double Exp(double x) {
return check_math_err(exp(x), "exp");
}
double Integer(double x) {
return (double)(long)x;
}
double Lg(double x) {
return check_math_err(log2(x), "lg");
}
double Ln(double x) {
return check_math_err(log(x), "log");
}
double Log10(double x) {
return check_math_err(log10(x), "log10");
}
double Pow(double x, double y) {
return check_math_err(pow(x, y), "exponentiation");
}
double Random(void) {
// We have HOC_RAND_MAX buckets
const int bucket_size = RAND_MAX / HOC_RAND_MAX;
int result;
do {
result = rand() / bucket_size;
} while (result >= HOC_RAND_MAX);
return result + 1;
}
double Sin(double x) {
return check_math_err(sin(x), "sin");
}
double Sqrt(double x) {
return check_math_err(sqrt(x), "sqrt");
}
static double check_math_err(double result, const char *op) {
static const char *domain_msg = "argument outside domain";
static const char *range_msg = "result outside range";
static const char *other_msg = "floating-point exception occurred";
const char *msg = NULL;
if ((math_errhandling & MATH_ERREXCEPT) && fetestexcept(FE_ALL_EXCEPT)) {
// Special case: Inexact results are not errors.
if (fetestexcept(FE_INEXACT)) {
goto done;
}
if (fetestexcept(FE_INVALID)) {
msg = domain_msg;
} else if (fetestexcept(FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW)) {
msg = range_msg;
} else { // unknown
msg = other_msg;
}
feclearexcept(FE_ALL_EXCEPT);
} else if (errno) {
if (errno == EDOM) {
msg = domain_msg;
} else if (errno == ERANGE) {
msg = range_msg;
} else {
msg = other_msg;
}
errno = 0;
}
if (msg != NULL) {
exec_error("math error during %s: %s", op, msg);
}
done:
return result;
}
#include "hoc.h"
#include <stdbool.h>
#include <stddef.h>
#include <stdlib.h>
/*
* Invariants (Given instance `s`):
* 1. s.len <= s.capacity - 1;
* 2. sizeof(s.buf) == s.capacity
*
* These invariants exist so that anytime we want a view of the string
* for use with C functions, there is capacity within `s.buf` to
* append a NUL character.
*
* Any function that mutates the string should ensure these invariants.
*/
struct str {
char *buf;
size_t len;
size_t capacity;
};
#define INIT_CAP 20 /** Initial capacity of a string */
str *str_new(void) {
str *s = malloc(sizeof(str));
if (s == NULL) {
exec_error("Out of memory!");
}
s->capacity = INIT_CAP;
s->buf = malloc(s->capacity); // Ensures invariant 1
if (s->buf == NULL) {
free(s);
exec_error("Out of memory!");
}
s->len = 0; // Ensure invariant 2
return s;
}
bool str_append(str *s, char c) {
if (s->len == s->capacity - 1) {
/*
* If we'd break our invariant by adding a character,
* grow our buffer capacity and update fields to match
*/
char *tmp = realloc(s->buf, s->capacity * 2);
if (tmp == NULL) {
return false;
}
s->buf = tmp;
s->capacity = s->capacity * 2;
}
s->buf[s->len++] = c;
return true;
}
const char *str_view(str *s) {
s->buf[s->len] = '\0';
return s->buf;
}
void str_clear(str *s) {
s->len = 0;
}
void str_free(str *s) {
free(s->buf);
free(s);
}
#include "hoc.h"
#include "hoc.tab.h" // generated from yacc -d on our grammar
#include <stddef.h> // NULL
#include <stdlib.h> // malloc
#include <string.h> // strcmp
static Symbol *symbol_table = NULL;
void *emalloc(size_t nbytes);
Symbol *lookup(const char *name) {
Symbol *current = symbol_table;
while (current != NULL) {
if (strcmp(current->name, name) == 0) {
return current;
}
current = current->next;
}
return NULL;
}
Symbol *install_symbol(const char *name, short type, double value) {
size_t name_len = strlen(name);
Symbol *s = emalloc(sizeof(Symbol) // actual symbol
+ (sizeof(char)) * (name_len + 1)); // room for name
s->name = (char *)(s + 1);
strcpy(s->name, name);
s->name[name_len] = '\0';
s->type = type;
s->data.val = value;
s->next = symbol_table;
symbol_table = s;
return s;
}
void *emalloc(size_t nbytes) {
void *ptr = malloc(nbytes);
if (ptr == NULL) {
exec_error("Out of memory!");
}
return ptr;
}