Parse Trees and Abstract Syntax Trees
A parse tree is a representation of the information obtained by parsing a
particular input. A parse tree contains a node
for each token identified during the parse.
An abstract syntax tree is a condensed form of parse tree which
omits elements such as parentheses and semicolons, which are syntactically
important but have no semantic value.
Note that the grammar that defines the parser itself is generally a
directed graph rather than a tree, since it has cycles wherever tokens
are defined recursively.
The advantage of creating a syntax tree is that once the whole tree is
available, more sophisticated analysis can be done than can be done while
parsing. You can traverse the tree as many times as necessary to collect
various kinds of data or examine the entire tree with a view to, say,
compiler optimization.
Internally, tree nodes are represented as C++ class objects. As each
grammatical construct (token) is identified in the input stream by the
parser, the associated reduction procedure creates an object of the
appropriate class. The pointer to the new object is stored on the parser
stack so the pointer can be saved in the parent node object when it in turn
is created. Objects are thus created for higher and higher level tokens,
forming a tree structure. Eventually, the whole process works back to the
grammar token (also known as the start or goal token) and the parse is
complete, with the full abstract syntax tree present in memory.
|
Abstract Syntax Tree Definitions
The astdefs.h
and astdefs.cpp files define classes used to
create an abstract syntax tree for the CLL and PLL syntaxes used in the Extensible Script Language Development
Kit. An abstract syntax tree is a means of recording the results of parsing
a script. It contains all the essential elements of the original script,
but organized for easy access. Two of the interpreter architectures used in
the kit use abstract syntax trees: astci
and astxi. Both use the same parser, ast, which builds the abstract syntax tree.
astxi then executes the script by walking the tree.
astci compiles bytecode by walking
the tree and then executes the resulting bytecode.
An abstract syntax tree consists of a root node and subnodes. Each subnode
may, in turn have its own subnodes. The AbstractSyntaxTree class encapsulates
the entire tree. The AstNode class is used to
describe a particular node. Each node has a type field which specifies what
kind of node it is. Each different type of node has a specific number of
subnodes, specified in the global array nSubnodes. The AstNode class provides for
accessing the subnodes of a node either as members of an array, or
explicitly by name.
Specific abstract syntax tree node types correspond generally to rules in
the syntax specification, although it sometimes happens that one node type
may represent more than one different rule.
Each different type of node, conceptually at least, has a unique name,
although the name does not appear explicitly as a variable or as a
structure in the program. Rather, in all cases,
the node name is prefaced with the string type
to denote the corresponding node type, or with subnodes to
denote the structure that identifies the subnodes of this particular
sort of node.
Nodes which contain more information than just pointers to subnodes are
implemented as classes derived from AstNode. One such class,
ListNode, contains a list of node pointers.
In what follows, syntax examples are taken largely from the CLL syntax.
The PLL usage is essentially identical.
The NodeType enum statement contains an identifier for each different type
of node needed to represent a tree describing the syntax specified in
either cll.syn or pll.syn.
The naming convention is to preface the name of the node with type.
Each distinct node type has a fixed number of subnodes. Nodes with a
variable number of subnodes are implemented as instances of ListNode. The nSubnodes array records the
number of subnodes for each type of node specified in the Nodetype enumeration.
When adding or deleting node types, care must be taken to make sure that
the entries in the nSubnodes array correspond correctly to the types in the
NodeType enumeration.
The function
AbstractSyntaxTree buildTree(const char *);
is the external interface of the abstract syntax tree parser.
Although it is defined in
astdefs.h, it is implemented in
ast.syn rather than
astdefs.cpp. It
parses the specified script text and creates, if successful, a tree
representation of the script. If unsuccessful, it throws an
ErrorDiagnostic object.
The AbstractSyntaxTree class encapsulates the pointer
to the root node of the tree. It also contains a stack where pointers to
all the nodes of the tree are recorded. The destructor for the tree can
then assure that all nodes have been properly deleted, even if, because of
an error, the creation of the tree has terminated abnormally.
The class also contains member functions which are wrappers for the
constructors of the AstNode class and the classes derived from it. These
wrapper functions ensure that all nodes that are created are properly
deleted when the time comes. Since the constructors and destructors for the
node classes are protected and AbstractSyntaxTree is declared as
a friend of each of the node classes, nodes can only be created by
the wrapper functions belonging to AbstractSyntaxTree,
guaranteeing that all nodes are properly recorded.
AgStack<AstNode *> nodeStack;
-
The nodeStack contains a list of all the nodes created in the course of
building the tree. When the tree is destroyed, these nodes are deleted
even though the tree may not have been completed, as could happen, for
instance, if there were a syntax error during creation of the tree.
AstNode *root;
-
The root is NULL until the tree has been completely built. It then
points to the root node of the tree. The root pointer, like the
subnode pointers in a node, is a reference only and does not need to
be deleted.
AbstractSyntaxTree();
-
Creates an AbstractSyntaxTree object with the root pointer
initialized to NULL.
AbstractSyntaxTree(const AbstractSyntaxTree &t);
-
The copy constructor does not create a new instance of the tree.
It simply provides another reference to the previously existing tree.
~AbstractSyntaxTree();
-
The destructor interrogates the nodeStack object to determine whether
there are multiple AbstractSyntaxTree objects referring to the same
underlying tree. If so, the destructor simply exits. If the current
reference is the last remaining reference to the tree, all nodes that
were created in the course of building the tree are deleted.
AstNode *stackNode(AstNode *n;
-
The stackNode function is used by the constructor wrapper functions
to record the node pointer on the nodeStack.
virtual FileLocation context();
- The context function is defined only as a placeholder. In the
ast.syn file, the Tree class is derived from
AbstractSyntaxTree and implements the context() function in
terms of the actual parser.
AstNode *node(NodeType t, ... );
-
The node function wraps the AstNode constructor,
automatically providing the context field and recording the node
pointer in the nodeStack.
AstNode *listNode(const AgStack<AstNode *>);
-
The listNode function wraps the ListNode constructor,
automatically providing the context field and recording the node
pointer in the nodeStack.
AstNode *assignmentNode(const AgString &, Opcode, AstNode *);
-
The assignmentNode function wraps the AssignmentNode constructor, automatically
providing the context field and recording the node pointer in the
nodeStack.
AstNode *binaryNode(Opcode, AstNode *, AstNode *);
-
The BinaryNode function wraps the BinaryNode constructor, automatically
providing the context field and recording the node pointer in the
nodeStack.
AstNode *unaryNode(Opcode, AstNode *);
-
The unaryNode function wraps the UnaryNode constructor, automatically
providing the context field and recording the node pointer in the
nodeStack.
AstNode *constantNode(const Value &);
-
The constantNode function wraps the ConstantNode constructor, automatically
providing the context field and recording the node pointer in the
nodeStack.
AstNode *variableNode(const AgString &);
-
The variableNode function wraps the VariableNode constructor, automatically
providing the context field and recording the node pointer in the
nodeStack.
AstNode *functionCallNode(const AgString &name, const AgStack<AstNode *> &n);
-
The functionCallNode function wraps the
FunctionCallNode constructor providing the context field and
recording the node pointer in the nodeStack.
The AstNode class is the base class for all AbstractSyntaxTree nodes.
It also is used to represent any node that contains only pointers to
subnodes. Nodes which have more content than merely a list of subnodes
are implemented as derived classes.
Since all AstNode objects are created by AbstractSyntaxTree member functions,
AbstractSyntaxTree is declared as a friend class, and the constructor and
destructor are declared as protected so that nodes cannot be inadvertently
created or destroyed outside the context of a given AbstractSyntaxTree.
const NodeType type;
-
The type field defines the content of the node. It can only be
set when the node is initially constructed and cannot be subsequently
altered. Thus a pointer to an object whose type field indicates it
is an instance of a derived class may be safely cast to a pointer to
the derived class.
FileLocation context;
-
The context field identifies the beginning location in the script source
of the text described by this node. It is used in diagnostic messages to
identify the location of a problem.
- Anonymous Union
-
The purpose of the anonymous union in the definition of AstNode is
to allow access to subnodes individually or as a group. When walking
a tree, if all nodes are to be dealt with in the same way, it is much
easier to do if they are seen as elements in an array. Thus the first
element in the anonymous union is an array of five subnode pointers.
Five is adequate, since the most complex node in the tree, the SimpleFor
node, has just five subnodes.
On the other hand, it is often necessary to deal differently with
different subnodes. In this case, it is desirable to access the subnodes
by name. Furthermore, the appropriate names depend on the type of the
node. Thus, the union contains a struct for each different node type.
The name of the struct field is given by subnodes followed by
the logical name of the node.
AstNode(const FileLocation &, NodeType t, ... );
-
In addition to the FileLocation reference and the NodeType, the constructor can
accept up to five pointers to AstNode objects. These values default to NULL. Five is
adequate, since no rule in the interpreter syntax identifies more than five subnodes.
virtual ~AstNode();
-
Since the AbstractSyntaxTree destructor handles destruction of node objects,
there is nothing for the AstNode destructor to do. The destructor is
virtual to allow the destructors for derived classes to be called
properly.
The abstract syntax tree nodes implemented as instances of AstNode are
as follows:
Null
-
This node is used where there is no information to be stored, as in
empty expression statements or for the last item in linked lists. Instead
of using a Null node, it would also be possible to simply use a null
pointer. However, were this to be done, all tree walking functions would
have to check explicitly for a null pointer. Use of a null node means the
null case can be dealt with in switch statements just like any other
node.
Script
-
The Script node represents the "start" production of the script syntax:
script $
-> statements, eof
The Script node is always the root node of the syntax tree.
The Script node contains a pointer to an instance of the ListNode
which lists all the statements in the script.
IfStatement
-
The IfStatement node has pointers to two subnodes: the condition
to be tested, and the statement to be executed if the condition turns out
to be true.
IfElseStatement
-
The IfElseStatement node has pointers to three subnodes: the condition
to be tested, the statement to be executed if the condition is true, and
the statement to be executed if the statement is false.
WhileStatement
-
The WhileStatement node has pointers to two subnodes: the condition that
controls the loop, and the statement which forms the body of the loop.
ForStatement
-
The ForStatement node has pointers to four subnodes: the initialization
code for the loop, the loop condition, the increment code, and finally,
the statement which forms the body of the loop.
SimpleFor
-
The SimpleFor node represents a Pascal type for loop. It has pointers to
five subnodes: the control variable, the initial value, the loop increment,
the final value and finally, the statement which forms the body of
the loop.
DoStatement
-
The DoStatement node has pointers to two subnodes: the statement which
forms the body of the loop, and the condition which controls the loop.
Note that even though the DoStatement node has subnodes that are the same
as those for a WhileStatement, the two types of nodes must be maintained
distinct because the functionality they represent is different.
RepeatStatement
-
The RepeatStatement node has pointers to two subnodes: the statement which
forms the body of the loop, and the condition which controls the loop.
Note that the RepeatStatement differs from the DoStatement only in the
treatment of the loop condition.
Break
-
The Break node represents a break statement in the script.
It has no subnodes.
Continue
-
The Continue node represents a continue statement in the script.
It has no subnodes.
ExpressionStatement
-
The ExpressionStatement node contains a pointer to an expression node.
The extra level of reference is necessary, since compiling an expression
statement is not the same as compiling an expression.
StatementBlock
-
The StatementBlock node is actually an instance of
ListNode. It simply lists all the statements
in a compound statement.
CommaExpression
-
The CommaExpression node contains pointers to the two
expressions linked together by the comma.
ConditionalExpression
-
The ConditionalExpression node contains a pointer to the condition expression and
pointers to the expressions to be evaluated depending on the outcome
of testing the condition.
LogicalOrExpression
-
The LogicalOrExpression node contains pointers to the two conditions.
Although it would be possible to combine the LogicalOrExpression and
LogicalAndExpression nodes into a single LogicalExpression node,
the differences in handling the two operations are such that doing so
would complicate rather than simplify matters.
LogicalAndExpression
-
The LogicalAndExpression node contains pointers to the two conditions.
See the discussion of the LogicalOrExpression for a discussion of the
possibility of making a single LogicalExpression node.
Dump
-
The Dump node represents an entire dump statement. It is actually an
instance of a ListNode. The list contains
pointers to Variable nodes which identify the
variables whose values are to be dumped.
Print
-
The Print node represents an entire print statement. Like the Dump node, it is actually an instance of a ListNode. The list contains pointers to expression
nodes whose values are to be printed in sequential order.
The following classes are all derived from AstNode.
Thus they inherit the type field, the context field, and the anonymous
union that belong to AstNode. In each case, they contain specific
additional information necessary to represent a node in the syntax tree.
The List node is used for syntactic constructs which create simple
lists. The lists are created using an
AgStack<AstNode *>
object. For example:
(AgStack<AstNode *>) statement list
-> =AgStack<AstNode *>();
-> statement list:s, statement:x =s.push(x);
Once the list has been created, a ListNode object is created to hold it.
Three types of node actually are instance of ListNode:
In all cases, it would be possible to create the list using ordinary
nodes and subnodes. However, if the statement list token
were implemented in this way, to execute or compile a statement list, one
level of recursion would be required for each statement in the list, since
the order of statements in the list would be the reverse of the order of
statements in the script. Since the entire script is actually just a
statement list, the depth of recursion could easily become excessive.
-
const AgArray<AstNode
*> list;
-
An array can be used, since, once created, no more entries will added to
the list. The constructor argument is an AgStack structure, but is
converted to an AgArray by the ListNode constructor. The benefit of using
an array is that less storage is required and access to the k'th element
of the array is independent of the value of k, whereas access to the k'th
element of an AgStack increases with the logarithm of k.
-
ListNode(const FileLocation &, AgStack<AstNode *>);
-
The node list is initialized with the contents of
the stack.
~ListNode();
-
There is nothing for the destructor to do, since the AgArray object takes
care of itself.
The assignment node is used to represent assignment expressions in the
CLL language and assignment statements in the PLL language. The relevant CLL syntax is as follows:
assignment expression
-> lvalue, assignment op, assignment expression
The Pll syntax is as follows:
simple statement
-> lvalue, ":=", expression, ';'
The node has two subnodes representing the lvalue and the assignment expression.
The assignment operator is the only member field of the derived class.
Opcode opcode;
-
This is the opcode used in the assignment statement.
It is defined by the token assignment op in ast.syn:
(Opcode) assignment op
-> '=' =STORE;
-> "+=" =ADDM;
-> "-=" =SUBM;
-> "*=" =MULM;
-> "/=" =DIVM;
-> "%=" =MODM;
-> "|=" =IORM;
-> "&=" =ANDM;
-> "^=" =XORM;
-> "<<=" =LSM;
-> ">>=" =RSM;
AssignmentNode(const FileLocation &, AstNode *d, Opcode op, AstNode *x);
-
There is no null constructor for the class.
~AssignmentNode();
-
There is nothing for the destructor to do.
The BinaryNode class represents all the binary arithmetic operations in
the script language. It consists of pointers to the nodes representing
the two operands and the binary operator that characterizes the node.
Opcode opcode;
-
This is the opcode that describes the binary operation. The
possible values are give by the following productions in ast.syn:
(Opcode) equality op
-> "==" =EQ;
-> "!=" =NE;
(Opcode) relational op
-> '<' =LT;
-> "<=" =LE;
-> '>' =GT;
-> ">=" =GE;
(Opcode) shift op
-> "<<" =LS;
-> ">>" =RS;
(Opcode) additive op
-> '+' =ADD;
-> '-' =SUB;
(Opcode) multiplicative op
-> '*' =MUL;
-> '/' =DIV;
-> '%' =MOD;
BinaryNode(const FileLocation &, Opcode op, AstNode *, AstNode *);
-
There is no null constructor for the class.
~BinaryNode();
-
There is nothing for the destructor to do.
The UnaryNode class represents all the unary arithmetic operations in the
script language including the standard unary arithmetic operators, the
cast operators, and the memory reference instructions which dereference a
pointer. The class contains a pointer to the node representing the
operand and the unary operator that characterizes the node.
Opcode opcode;
-
This is the opcode that describes the unary operation. There are
three classes of unary operators. The first class consists of memory
reference operators which dereference a pointer to fetch or store data.
These operators are:
- FETCH
- Fetch value addressed by top stack element.
- I_FETCH
- Increment, then fetch value addressed by top of stack.
- D_FETCH
- Decrement, then fetch value addressed by top of stack.
- FETCH_I
- Fetch, then increment value addressed by top of stack.
- FETCH_D
- Fetch, then increment value addressed by top of stack.
- STORE
- Store second item on stack in location addressed by the
top of stack. Don't pop the value.
The second set of unary operators is given by the following productions
in ast.syn:
(Opcode) unary op
-> '-' =NEG;
-> '!' =NOT;
-> '~' =COM;
The third set of unary operators consists of the cast operators, given
by the following productions in ast.syn:
primary
-> '(', "long", ')', primary:x =
AST.unaryNode(CAST_LONG, x);
-> '(', "double", ')', primary:x =
AST.unaryNode(CAST_DOUBLE, x);
UnaryNode(const FileLocation &, Opcode op, AstNode *);
-
There is no null constructor for the class.
~UnaryNode();
-
There is nothing for the destructor to do.
The ConstantNode class represents numeric constants in the script. The value
of the constant is stored as an instance of the Value
class.
Value value;
- Records the value and type of the constant.
ConstantNode(const FileLocation &, const Value &);
-
There is no null constructor for the class.
~ConstantNode();
-
There is nothing for the destructor to do.
The VariableNode class is used to represent references to variables.
There are only two such usages: in the dump statement and as an
lvalue. The class contains only a string representation of the name
of the variable.
AgString name;
-
The name field contains the name of the variable stored as an
AgString object.
It would be possible to use an index into an
AgDictionary<AgString>
instead of an AgString object, and thus save memory space, however
such savings would come at the cost of more complex interfaces.
VariableNode(const FileLocation &, const AgString &);
-
There is no null constructor for the class.
~VariableNode();
-
There is nothing for the destructor to do, since the AgString object takes
care of itself.
The FunctionCallNode class is the interface which allows calls to external
functions. It contains the ascii name of the function to be called and
a list of the argument expressions.
AgString name;
-
The name of the function which is to be invoked.
const AgStack<AstNode *> argList;
-
Contains the list of argument expressions.
FunctionCallNode(const FileLocation &,
const AgString &,
const AgStack<AstNode *> &);
-
There is no null constructor for the class.
~FunctionCallNode();
-
There is nothing for the destructor to do, since the AgString object takes
care of itself.
|