Virtual Machines
A virtual machine, or bytecode interpreter, is a sort of simulation of a
computer, usually used to implement a script language. You define an
instruction set for this machine according to the needs of your script
language.
The instructions for the virtual computer are called "bytecodes". Each
bytecode uniquely identifies a function to be executed by a "bytecode
interpreter", which simply picks up the bytecodes in order and invokes the
appropriate functions. Java source code, for example, is compiled into
bytecode and a variety of bytecode interpreters are available to execute
this bytecode on different platforms.
Where does parsing come in? You need to compile statements in the script
language into bytecode instructions for your virtual machine. To do this,
the compiler needs to parse the statements, identify their component parts,
and transform them into bytecode.
The best way to create the compiler, from the point of view of freedom from
bugs, ease in making changes, and maintainability, is to generate it from a
formal grammar using a parser generator such as AnaGram. A grammar also has
the advantage of being self-documenting.
In the Extensible Interpreter Development Kit, the ScriptMethod class defines the
imaginary computer. It has functions for listing and executing compiled
bytecode.
|
Bytecode Interpreter Definitions
The bcidefs.h and bcidefs.cpp files contain the classes and
functions necessary to implement a simple bytecode interpreter. The
interpreter uses the opcodes defined in opcodes.h. Text representations of the
opcodes are found in the constant external array opcodeText[],
defined in bcidefs.cpp. Note that for listings of generated code to be
valid, any changes in the opcode enumeration must be reflected in the
opcodeText[] array.
Two classes are defined: ScriptMethod, which
implements a bytecode interpreter, and CodeFragment, which provides functions useful in
compiling code for the interpreter. Both the astci
and dci interpreters use these classes to compile and
execute bytecode.
The bytecode interpreter implements a very simple virtual machine
architecture. There are no registers. Instead, a pushdown
accumulator stack is used. Unary operations operate on the top stack element.
Binary operations operate on the top two stack elements replacing them with
the result of the operation. Apart from the DUMP
instruction there is only one instruction that references memory: the
LOCATE instruction which pushes a pointer to a
specified value onto the stack. FETCH and
STORE instructions then use this pointer to reference
memory.
The values of variables are stored in an array of Value objects contained in a Dataset object. Variable names are
identified in a dictionary, an instance of the AgDictionary<
AgString>
class. The dictionary assigns a unique integer to each variable name
included in the dictionary. The integer is used to index the array of Value
objects in the Dataset object to locate the value of the variable.
Constant values are stored in a constant array, accessed by immediate operators
that index the array of constants.
Instructions may consist of one or two bytecodes. The first bytecode is the
opcode. The second bytecode, if it exists, is the index of a variable
value in the data array, the index of a constant in the constant array, or
the offset of a branch instruction.
For convenience, bcidefs.h contains a typedef statement to define a
Bytecode type. When defined as a signed char, the interpreter
is limited to 128 variable names and 128 immediate constants. A more serious
limitation is that branch instructions are limited to 127 Bytecodes in the
forward direction and 128 Bytecodes in the reverse direction. To replace
these limitations with less restrictive ones, simply change the typedef
from signed char to signed short.
The implementation of the bytecode interpreter is found in the
apply() function of the
ScriptMethod class.
The Opcode enumeration is defined in opcodes.h.
The instructions that are implemented are as follows:
- Instructions with no operand. Instructions consist of a single bytecode.
- HLT
-
Stop execution of the interpreter. Cause an uninitialized Value object to be returned as the
value of the apply() function.
- FETCH
-
Dereference the pointer value on the top of the
stack and replace it with the value it references.
- I_FETCH
-
Dereference the pointer value on the top of the
stack, increment the value it references and then replace
the pointer with the incremented value.
- D_FETCH
-
Dereference the pointer value on the top of the
stack, decrement the value it references and then replace
the pointer with the decremented value.
- FETCH_I
-
Dereference the pointer value on the top of the stack, replace the
pointer with the referenced value, and then increment the referenced
value.
- FETCH_D
-
Dereference the pointer value on the top of the stack, replace the
pointer with the referenced value, and then decrement the referenced
value.
- STORE
-
Pop the value from the top of the stack, store it through the
pointer that is now on the top of the stack. Now dereference the
pointer on the top of the stack so that the value that was stored is
now the top element of the stack.
- POP
-
Remove top element from stack and discard it.
- SAP
-
Pop the value from the top of the stack, and store it
store it in the location specified by the next value on the stack.
The location value is also popped from the stack and discarded.
- PRINT
-
Pop the top element from the stack and print its value.
- RETURN
-
Pop the top element from the stack and save it as the return value
from the apply() function. Then stop execution of the interpreter.
-
Binary operators pop two values from the stack, compute and push the
result on the stack.
- ADD
- Add
- SUB
- Subtract
- MUL
- Multiply
- DIV
- Divide
- IDIV
- Integer divide
- RDIV
- Real divide
- MOD
- Remainder
- POW
- Raise to power
- AND
- And
- IOR
- Inclusive or
- XOR
- Exclusive or
- LS
- Left shift
- RS
- Right shift
-
Binary operators on memory. These operators pop the top element from the stack,
dereference the next stack element, fetch its value, compute the result,
store the result in the referenced location, and replace the reference on the
stack with the result of the computation.
- ADDM
- Add the top element to the referenced variable
- SUBM
- Subtract the top element from the referenced variable.
- MULM
- Multiply the referenced variable by the top element.
- DIVM
- Divide the referenced variable by the top element.
- >MODM
- Set the value of the referenced variable to the remainder after dividing
it by the top element.
- ANDM
- And the referenced variable with the top element of the stack.
- IORM
- Set the referenced variable to the inclusive of it with the top element
of the stack.
- XORM
- Set the referenced variable to the exclusive of it with the top element
of the stack.
- LSM
- Left shift the value of the referenced variable the number of bits specified
in the top element of the stack.
- RSM
- Right shift the value of the referenced variable the number of bits specified
in the top element of the stack.
- POWM
- Raise the value of the referenced variable to the power
specified by the top element of the stack.
- Comparison operators pop two values from the stack and push a result of
one or zero back on to the stack depending on whether the condition is true or
false.
- LT
- less than
- LE
- less than or equal
- GT
- greater than
- GE
- greater than or equal
- EQ
- equal to
- NE
- not equal to
-
Unary operators pop a value from the stack, compute and push the result
back on the stack.
- NEG
- Negate top element on stack
- NOT
- Logical NOT of top element on stack
- COM
- bitwise complement
- Cast operators set the type of the top element on the stack.
- CAST_LONG
-
Make value long. If the value is not integral it is truncated. If the
type is not integer or real, an InterpreterError exception is
thrown.
- CAST_DOUBLE
-
Make value double. If the type is not integer or real, an InterpreterError exception is
thrown.
-
The following instructions have a second parameter, K, which is either
the index of a variable or the index of a constant.
- LOCATE K
- Push a pointer to the K'th variable onto the stack.
- PUSHC K
- Push the K'th constant onto the stack.
- DUMP K
- Print the name and value of the K'th variable.
-
Branch instructions. K is the offset in number of bytecodes
relative to the address of the instruction after the branch instruction.
That is, BR -2 is an infinite loop.
- LAND K
- Branch on false. Remove the top element of the stack if the branch is not taken.
- LOR K
- Branch on true. Remove the top element of the stack if the branch is not taken.
- BR K
- Branch unconditionally.
- BRF K
- Branch on false. Always remove the top element of the stack.
- BRT K
- Branch on true. Always remove the top element of the stack.
- CALL K
- Call the K'th function in the
functionTable.
- RPT K
- If the top element of the stack is less than
or equal to zero, pop the top of the stack and branch. This can be used
to implement counted loops.
The Constant class is derived from the Value class. Its purpose is to handle a
rather peculiar situation with constants such as 0 and 0.0, integer and
floating point representations of zero, respectively. The Value
class rightly does not distinguish these two values. Nevertheless, in
order to handle some arithmetic operators, in particular the logical operators,
the actual type of the constant must be known. If the constantList field of ScriptMethod were
implemented using the Value class, constants with the same value but
different type would be recorded as integer or real depending on whether
the first instance of the constant in the script were integer or real. This
would make it impossible to select the correct arithmetic operation,
integer or double, when one of the operands was a constant.
The Constant class, therefore, overloads the equality operator to
distinguish between integer and floating point zero, so that correct
arithmetic operations can be selected. It also overloads the hash function
in order to take the type field into account, so that integer and real
constants with the same value can be distinguished.
Because constructors cannot be inherited, all the constructors of the
Value class must be replicated.
The ScriptMethod class implements the bytecode interpreter. A ScriptMethod
object is constructed from an AgDictionary<AgString>
object, an array of bytecodes and an array of constants. The dictionary
identifies the variables used in the bytecode. The class has two methods: a
list method which generates a listing of the
bytecode and an apply method which executes the bytecode using a specific
Dataset object.
AgDictionary<AgString> &dictionary;
-
Identifies variable names. Note that the dictionary must be defined
externally. This allows several interpreters to share the same dictionary.
AgArray<Bytecode> bytecode;
-
Contains the bytecode to be executed.
AgArray<Constant> constantList;
-
Contains the constant values used by the script.
ScriptMethod(const char *text,
AgDictionary<AgString> &d);
-
This ScriptMethod constructor is not implemented as part of bcidefs.cpp.
The implementation is found in the actual bytecode compiler module.
- ScriptMethod(const ScriptMethod &m);
-
Copy constructor.
~ScriptMethod();
-
The destructor has nothing to do.
Value
apply(DataSet &d) const;
-
Executes the stored bytecode using the specified DataSet object.
The DataSet object must have been constructed using the same
AgDictionary<AgString>
as was used to construct the ScriptMethod.
If the bytecode exits because of a return statement, the value of the expression,
if any, specified in the return statement, will be returned as the value of the
apply() function. If the script exited without executing a return
statement, or no expression was specified with the return statement, the
Value object returned has its type set
to uninitType.
Values in the DataSet may be initialized before calling the apply()
function and may be queried afterwards. A given Dataset object can be used
with any number of ScriptMethod objects, so long as they are constructed from
the same dictionary. Thus one ScriptMethod object can be used to initialize
values for another, for instance.
const ScriptMethod &list(ostream &f) const;
-
This function may be used to write the generated bytecode to
a file. The list function converts the offsets in branch instructions
to "absolute" addresses, in order to make it easier to follow the
code.
A CodeFragment object contains a sequence of bytecodes. CodeFragment
objects can be concatenated to produce longer CodeFragment objects. This
makes it possible to build up complex constructs from simpler sequences.
Class methods are defined for appending instructions to a CodeFragment
object, for concatenating CodeFragment objects and for implementing loop
and conditional constructs. Using CodeFragments, it is possible to
compile bytecode in a very intuitive manner.
In addition to the code itself, a CodeFragment object contains a list of
branch locations in the code that represent loop exit or loop continue
statements. The patchBranches() method can be used to set these
branches to appropriate values once the actual continue and exit points
of a loop are known.
The CodeFragment class contains a number of static member functions
which create code for conditional and loop constructs by appropriately
combining previously generated code.
CodeFragment()
-
Constructs an empty CodeFragment.
CodeFragment(const CodeFragment &)
-
Copy constructor.
~CodeFragment()
-
There is nothing for the destructor to do.
AgStack<Bytecode> bytecode;
-
The AgStack template class provides a convenient mechanism for
accumulating bytecode.
AgStack<int> branchList;
-
The branchList stack contains the offset locations within the bytecode of
the operands of loop continue and loop break branches in the CodeFragment
object. The operand is set to 0 for continue, 1 for break. When two
CodeFragment objects are concatenated, the concat() function
updates the offsets appropriately.
int size();
-
Returns the length of the code in the object in Bytecodes.
AgArray<Bytecode> getBytecode() ;
-
Creates an AgArray<Bytecode> object containing the bytecode in the
CodeFragment object.
CodeFragment &append(Opcode op);
-
Appends the specified opcode to the CodeFragment and returns a reference
to the CodeFragment itself.
CodeFragment &append(Opcode op, int offset) ;
-
Appends the specified opcode and the integer parameter to the CodeFragment.
The parameter is truncated to the size specified by the Bytecode typedef.
For fetch and store instructions, the parameter is the index of the variable
name in the AgDictionary<AgString>
object. For branch instructions, the integer
is the signed offset for the branch. The function returns a reference to the
CodeFragment object.
CodeFragment &appendContinue(Opcode op);
-
Appends the specified opcode to the CodeFragment object and also an
operand with the value equal to zero. The offset of the operand within
the bytecode is pushed onto the branchList stack for later completion by
the patchBranches() function.
CodeFragment &appendBreak(Opcode op);
-
Appends the specified opcode to the CodeFragment object and also an
operand with the value equal to one. The offset of the operand within the
bytecode is pushed onto the branchList stack for later completion by the
patchBranches() function.
CodeFragment &concat(const CodeFragment &code) ;
-
The specified fragment is appended to the CodeFragment object. The function
returns a reference to the compound object. The offsets in the branchList
belonging to the argument object are incremented by the size of the
CodeFragment object and added to the branchList of the object.
CodeFragment &patchBranches();
-
The patchBranches function scans the branchList and sets the
target location of each loop continue branch to be the first instruction
of the fragment. The target location of each break branch is set to be
the first location following the fragment. The branchList is then
reset.
int containsBreak();
-
Returns non-zero if the CodeFragment contains a break instruction, zero
otherwise.
-
static CodeFragment ifStatement(CodeFragment &condition, CodeFragment &statement);
-
This function combines the code that evaluates the condition with
the code that executes the statement to create the code for the complete
if statement.
-
static CodeFragment ifElse(CodeFragment &condition, CodeFragment &onTrue, CodeFragment &onFalse);
-
Given the code to evaluate the condition, the code to be
executed if the condition is true and the code to be executed
if the condition is false, this function combines these pieces of
code with the necessary branches to create the code for the
entire if-else statement.
-
static CodeFragment whileLoop(CodeFragment &condition, CodeFragment &statement);
-
This function combines the code for the loop condition and the
body of the loop with the appropriate branch instructions. To do so,
it calls appendBreak(BRF)to append the an exit branch to the
condition code and appendContinue(BR) to append a
continue branch to the statement code. It then concatenates the two
pieces of code and calls patchBranches() to calculate and insert
the necessary offsets for these branches as well as for any
break or continue statements that might be found
inside the body of the loop.
-
static CodeFragment doLoop(CodeFragment &statement, CodeFragment &condition);
-
The do-while loop differs from a while loop only in that the
loop condition is not executed the first time through the loop.
Therefore, the code can be generated by prefacing the loop with
a jump over the condition code and its conditional branch. Note
that this is done by placing the code for the condition physically
before the loop body instead of following the loop body as
it actually occurs in the script.
-
static CodeFragment repeatLoop(CodeFragment &statementList, CodeFragment &condition);
-
The repeat until loop differs from a do-while loop in two respects. First,
the first argument is effectively a list of statements rather than a
single statement, although this makes no difference whatsoever to this
function. Second, the treatment of the condition is the opposite of the
treatment for the do-while loop. In the repeat-until loop the loop
continues if the statement is false, and exits if the statement is true.
-
static CodeFragment forLoop(CodeFragment &initializer, CodeFragment &condition, CodeFragment &increment, CodeFragment &statement);
-
The code for a for-loop is laid out in the following order. First
the initializing code, then the increment code, followed by the condition
code and finally the body of the loop. Between the intializing code and
the increment code there is an unconditional branch over the increment
code, since the increment code must not be evaluated the first time
through the loop. The beginning of the increment code is the continue
location for the loop. With this layout, coding the loop is the same
as for the while loop.
static CodeFragment simpleForLoop (const CodeFragment &lValue,
const CodeFragment &begin,
const CodeFragment &increment,
const CodeFragment &end,
const CodeFragment &statement);
-
This function lays out the code for a simple Pascal-like for loop. Because
the loop parameters cannot be changed in the body of the loop, it is
possible to calculate, before entering the loop, the number of times the
statement is to be executed. This simplifies considerably the problem of
dealing with negative increments.
|