Parsifal Software


XIDEK
Extensible Interpreter Development Kit
Reference documentation



Parse Trees and Abstract Syntax Trees

A parse tree is a representation of the information obtained by parsing a particular input. A parse tree contains a node for each token identified during the parse.

An abstract syntax tree is a condensed form of parse tree which omits elements such as parentheses and semicolons, which are syntactically important but have no semantic value.

Note that the grammar that defines the parser itself is generally a directed graph rather than a tree, since it has cycles wherever tokens are defined recursively.

The advantage of creating a syntax tree is that once the whole tree is available, more sophisticated analysis can be done than can be done while parsing. You can traverse the tree as many times as necessary to collect various kinds of data or examine the entire tree with a view to, say, compiler optimization.

Internally, tree nodes are represented as C++ class objects. As each grammatical construct (token) is identified in the input stream by the parser, the associated reduction procedure creates an object of the appropriate class. The pointer to the new object is stored on the parser stack so the pointer can be saved in the parent node object when it in turn is created. Objects are thus created for higher and higher level tokens, forming a tree structure. Eventually, the whole process works back to the grammar token (also known as the start or goal token) and the parse is complete, with the full abstract syntax tree present in memory.

Abstract Syntax Tree Definitions

Introduction

The astdefs.h and astdefs.cpp files define classes used to create an abstract syntax tree for the CLL and PLL syntaxes used in the Extensible Script Language Development Kit. An abstract syntax tree is a means of recording the results of parsing a script. It contains all the essential elements of the original script, but organized for easy access. Two of the interpreter architectures used in the kit use abstract syntax trees: astci and astxi. Both use the same parser, ast, which builds the abstract syntax tree. astxi then executes the script by walking the tree. astci compiles bytecode by walking the tree and then executes the resulting bytecode.

An abstract syntax tree consists of a root node and subnodes. Each subnode may, in turn have its own subnodes. The AbstractSyntaxTree class encapsulates the entire tree. The AstNode class is used to describe a particular node. Each node has a type field which specifies what kind of node it is. Each different type of node has a specific number of subnodes, specified in the global array nSubnodes. The AstNode class provides for accessing the subnodes of a node either as members of an array, or explicitly by name.

Specific abstract syntax tree node types correspond generally to rules in the syntax specification, although it sometimes happens that one node type may represent more than one different rule.

Each different type of node, conceptually at least, has a unique name, although the name does not appear explicitly as a variable or as a structure in the program. Rather, in all cases, the node name is prefaced with the string type to denote the corresponding node type, or with subnodes to denote the structure that identifies the subnodes of this particular sort of node.

Nodes which contain more information than just pointers to subnodes are implemented as classes derived from AstNode. One such class, ListNode, contains a list of node pointers.

In what follows, syntax examples are taken largely from the CLL syntax. The PLL usage is essentially identical.

External Definitions

The NodeType Enumeration

The NodeType enum statement contains an identifier for each different type of node needed to represent a tree describing the syntax specified in either cll.syn or pll.syn.

The naming convention is to preface the name of the node with type.

The nSubnodes Array

Each distinct node type has a fixed number of subnodes. Nodes with a variable number of subnodes are implemented as instances of ListNode. The nSubnodes array records the number of subnodes for each type of node specified in the Nodetype enumeration.

When adding or deleting node types, care must be taken to make sure that the entries in the nSubnodes array correspond correctly to the types in the NodeType enumeration.

The buildTree declaration

The function
AbstractSyntaxTree buildTree(const char *);
is the external interface of the abstract syntax tree parser. Although it is defined in astdefs.h, it is implemented in ast.syn rather than astdefs.cpp. It parses the specified script text and creates, if successful, a tree representation of the script. If unsuccessful, it throws an ErrorDiagnostic object.


The AbstractSyntaxTree Class

Introduction

The AbstractSyntaxTree class encapsulates the pointer to the root node of the tree. It also contains a stack where pointers to all the nodes of the tree are recorded. The destructor for the tree can then assure that all nodes have been properly deleted, even if, because of an error, the creation of the tree has terminated abnormally.

The class also contains member functions which are wrappers for the constructors of the AstNode class and the classes derived from it. These wrapper functions ensure that all nodes that are created are properly deleted when the time comes. Since the constructors and destructors for the node classes are protected and AbstractSyntaxTree is declared as a friend of each of the node classes, nodes can only be created by the wrapper functions belonging to AbstractSyntaxTree, guaranteeing that all nodes are properly recorded.


Member Fields

AgStack<AstNode *> nodeStack;
The nodeStack contains a list of all the nodes created in the course of building the tree. When the tree is destroyed, these nodes are deleted even though the tree may not have been completed, as could happen, for instance, if there were a syntax error during creation of the tree.

AstNode *root;
The root is NULL until the tree has been completely built. It then points to the root node of the tree. The root pointer, like the subnode pointers in a node, is a reference only and does not need to be deleted.


Constructors/Destructor

AbstractSyntaxTree();
Creates an AbstractSyntaxTree object with the root pointer initialized to NULL.

AbstractSyntaxTree(const AbstractSyntaxTree &t);
The copy constructor does not create a new instance of the tree. It simply provides another reference to the previously existing tree.

~AbstractSyntaxTree();
The destructor interrogates the nodeStack object to determine whether there are multiple AbstractSyntaxTree objects referring to the same underlying tree. If so, the destructor simply exits. If the current reference is the last remaining reference to the tree, all nodes that were created in the course of building the tree are deleted.


Protected Member Functions

AstNode *stackNode(AstNode *n;
The stackNode function is used by the constructor wrapper functions to record the node pointer on the nodeStack.

virtual FileLocation context();
The context function is defined only as a placeholder. In the ast.syn file, the Tree class is derived from AbstractSyntaxTree and implements the context() function in terms of the actual parser.

Public Member Functions

AstNode *node(NodeType t, ... );
The node function wraps the AstNode constructor, automatically providing the context field and recording the node pointer in the nodeStack.

AstNode *listNode(const AgStack<AstNode *>);
The listNode function wraps the ListNode constructor, automatically providing the context field and recording the node pointer in the nodeStack.

AstNode *assignmentNode(const AgString &, Opcode, AstNode *);
The assignmentNode function wraps the AssignmentNode constructor, automatically providing the context field and recording the node pointer in the nodeStack.

AstNode *binaryNode(Opcode, AstNode *, AstNode *);
The BinaryNode function wraps the BinaryNode constructor, automatically providing the context field and recording the node pointer in the nodeStack.

AstNode *unaryNode(Opcode, AstNode *);
The unaryNode function wraps the UnaryNode constructor, automatically providing the context field and recording the node pointer in the nodeStack.

AstNode *constantNode(const Value &);
The constantNode function wraps the ConstantNode constructor, automatically providing the context field and recording the node pointer in the nodeStack.

AstNode *variableNode(const AgString &);
The variableNode function wraps the VariableNode constructor, automatically providing the context field and recording the node pointer in the nodeStack.

AstNode *functionCallNode(const AgString &name, const AgStack<AstNode *> &n);
The functionCallNode function wraps the FunctionCallNode constructor providing the context field and recording the node pointer in the nodeStack.


The AstNode Class

Introduction

The AstNode class is the base class for all AbstractSyntaxTree nodes. It also is used to represent any node that contains only pointers to subnodes. Nodes which have more content than merely a list of subnodes are implemented as derived classes.

Since all AstNode objects are created by AbstractSyntaxTree member functions, AbstractSyntaxTree is declared as a friend class, and the constructor and destructor are declared as protected so that nodes cannot be inadvertently created or destroyed outside the context of a given AbstractSyntaxTree.


Member Fields

const NodeType type;
The type field defines the content of the node. It can only be set when the node is initially constructed and cannot be subsequently altered. Thus a pointer to an object whose type field indicates it is an instance of a derived class may be safely cast to a pointer to the derived class.

FileLocation context;
The context field identifies the beginning location in the script source of the text described by this node. It is used in diagnostic messages to identify the location of a problem.

Anonymous Union
The purpose of the anonymous union in the definition of AstNode is to allow access to subnodes individually or as a group. When walking a tree, if all nodes are to be dealt with in the same way, it is much easier to do if they are seen as elements in an array. Thus the first element in the anonymous union is an array of five subnode pointers. Five is adequate, since the most complex node in the tree, the SimpleFor node, has just five subnodes.

On the other hand, it is often necessary to deal differently with different subnodes. In this case, it is desirable to access the subnodes by name. Furthermore, the appropriate names depend on the type of the node. Thus, the union contains a struct for each different node type. The name of the struct field is given by subnodes followed by the logical name of the node.

Constructor/Destructor

AstNode(const FileLocation &, NodeType t, ... );
In addition to the FileLocation reference and the NodeType, the constructor can accept up to five pointers to AstNode objects. These values default to NULL. Five is adequate, since no rule in the interpreter syntax identifies more than five subnodes.

virtual ~AstNode();
Since the AbstractSyntaxTree destructor handles destruction of node objects, there is nothing for the AstNode destructor to do. The destructor is virtual to allow the destructors for derived classes to be called properly.


Nodes Implemented by AstNode

The abstract syntax tree nodes implemented as instances of AstNode are as follows:
Null
This node is used where there is no information to be stored, as in empty expression statements or for the last item in linked lists. Instead of using a Null node, it would also be possible to simply use a null pointer. However, were this to be done, all tree walking functions would have to check explicitly for a null pointer. Use of a null node means the null case can be dealt with in switch statements just like any other node.

Script
The Script node represents the "start" production of the script syntax:
  script $
    -> statements, eof
The Script node is always the root node of the syntax tree. The Script node contains a pointer to an instance of the ListNode which lists all the statements in the script.

IfStatement
The IfStatement node has pointers to two subnodes: the condition to be tested, and the statement to be executed if the condition turns out to be true.

IfElseStatement
The IfElseStatement node has pointers to three subnodes: the condition to be tested, the statement to be executed if the condition is true, and the statement to be executed if the statement is false.

WhileStatement
The WhileStatement node has pointers to two subnodes: the condition that controls the loop, and the statement which forms the body of the loop.

ForStatement
The ForStatement node has pointers to four subnodes: the initialization code for the loop, the loop condition, the increment code, and finally, the statement which forms the body of the loop.

SimpleFor
The SimpleFor node represents a Pascal type for loop. It has pointers to five subnodes: the control variable, the initial value, the loop increment, the final value and finally, the statement which forms the body of the loop.

DoStatement
The DoStatement node has pointers to two subnodes: the statement which forms the body of the loop, and the condition which controls the loop. Note that even though the DoStatement node has subnodes that are the same as those for a WhileStatement, the two types of nodes must be maintained distinct because the functionality they represent is different.

RepeatStatement
The RepeatStatement node has pointers to two subnodes: the statement which forms the body of the loop, and the condition which controls the loop. Note that the RepeatStatement differs from the DoStatement only in the treatment of the loop condition.

Break
The Break node represents a break statement in the script. It has no subnodes.

Continue
The Continue node represents a continue statement in the script. It has no subnodes.

ExpressionStatement
The ExpressionStatement node contains a pointer to an expression node. The extra level of reference is necessary, since compiling an expression statement is not the same as compiling an expression.

StatementBlock
The StatementBlock node is actually an instance of ListNode. It simply lists all the statements in a compound statement.

CommaExpression
The CommaExpression node contains pointers to the two expressions linked together by the comma.

ConditionalExpression
The ConditionalExpression node contains a pointer to the condition expression and pointers to the expressions to be evaluated depending on the outcome of testing the condition.

LogicalOrExpression
The LogicalOrExpression node contains pointers to the two conditions. Although it would be possible to combine the LogicalOrExpression and LogicalAndExpression nodes into a single LogicalExpression node, the differences in handling the two operations are such that doing so would complicate rather than simplify matters.

LogicalAndExpression
The LogicalAndExpression node contains pointers to the two conditions. See the discussion of the LogicalOrExpression for a discussion of the possibility of making a single LogicalExpression node.

Dump
The Dump node represents an entire dump statement. It is actually an instance of a ListNode. The list contains pointers to Variable nodes which identify the variables whose values are to be dumped.

Print
The Print node represents an entire print statement. Like the Dump node, it is actually an instance of a ListNode. The list contains pointers to expression nodes whose values are to be printed in sequential order.


Classes Derived from AstNode

The following classes are all derived from AstNode. Thus they inherit the type field, the context field, and the anonymous union that belong to AstNode. In each case, they contain specific additional information necessary to represent a node in the syntax tree.

The ListNode Class

Introduction

The List node is used for syntactic constructs which create simple lists. The lists are created using an AgStack<AstNode *> object. For example:
  (AgStack<AstNode *>) statement list
    ->                     =AgStack<AstNode *>();
    -> statement list:s, statement:x  =s.push(x);
Once the list has been created, a ListNode object is created to hold it. Three types of node actually are instance of ListNode: In all cases, it would be possible to create the list using ordinary nodes and subnodes. However, if the statement list token were implemented in this way, to execute or compile a statement list, one level of recursion would be required for each statement in the list, since the order of statements in the list would be the reverse of the order of statements in the script. Since the entire script is actually just a statement list, the depth of recursion could easily become excessive.


MemberFields

const AgArray<AstNode *> list;
An array can be used, since, once created, no more entries will added to the list. The constructor argument is an AgStack structure, but is converted to an AgArray by the ListNode constructor. The benefit of using an array is that less storage is required and access to the k'th element of the array is independent of the value of k, whereas access to the k'th element of an AgStack increases with the logarithm of k.


Constructor/Destructor

ListNode(const FileLocation &, AgStack<AstNode *>);
The node list is initialized with the contents of the stack.

~ListNode();
There is nothing for the destructor to do, since the AgArray object takes care of itself.


The AssignmentNode Class

Introduction

The assignment node is used to represent assignment expressions in the CLL language and assignment statements in the PLL language. The relevant CLL syntax is as follows:
  assignment expression
    -> lvalue, assignment op, assignment expression
The Pll syntax is as follows:
  simple statement
   -> lvalue, ":=", expression, ';'
The node has two subnodes representing the lvalue and the assignment expression. The assignment operator is the only member field of the derived class.


Member Fields

Opcode opcode;
This is the opcode used in the assignment statement. It is defined by the token assignment op in ast.syn:
  (Opcode) assignment op
    -> '='                     =STORE;
    -> "+="                    =ADDM;
    -> "-="                    =SUBM;
    -> "*="                    =MULM;
    -> "/="                    =DIVM;
    -> "%="                    =MODM;
    -> "|="                    =IORM;
    -> "&="                    =ANDM;
    -> "^="                    =XORM;
    -> "<<="                   =LSM;
    -> ">>="                   =RSM;

Constructor/Destructor

AssignmentNode(const FileLocation &, AstNode *d, Opcode op, AstNode *x);
There is no null constructor for the class.

~AssignmentNode();
There is nothing for the destructor to do.

The BinaryNode Class

Introduction

The BinaryNode class represents all the binary arithmetic operations in the script language. It consists of pointers to the nodes representing the two operands and the binary operator that characterizes the node.

Member Fields

Opcode opcode;
This is the opcode that describes the binary operation. The possible values are give by the following productions in ast.syn:
  (Opcode) equality op
    -> "=="                    =EQ;
    -> "!="                    =NE;

    (Opcode) relational op
    -> '<'                     =LT;
    -> "<="                    =LE;
    -> '>'                     =GT;
    -> ">="                    =GE;

    (Opcode) shift op
    -> "<<"                    =LS;
    -> ">>"                    =RS;

    (Opcode) additive op
    -> '+'                     =ADD;
    -> '-'                     =SUB;

    (Opcode) multiplicative op
    -> '*'                     =MUL;
    -> '/'                     =DIV;
    -> '%'                     =MOD;

Constructor/Destructor

BinaryNode(const FileLocation &, Opcode op, AstNode *, AstNode *);
There is no null constructor for the class.

~BinaryNode();
There is nothing for the destructor to do.

The UnaryNode Class

Introduction

The UnaryNode class represents all the unary arithmetic operations in the script language including the standard unary arithmetic operators, the cast operators, and the memory reference instructions which dereference a pointer. The class contains a pointer to the node representing the operand and the unary operator that characterizes the node.

Member Fields

Opcode opcode;
This is the opcode that describes the unary operation. There are three classes of unary operators. The first class consists of memory reference operators which dereference a pointer to fetch or store data. These operators are:
FETCH
Fetch value addressed by top stack element.
I_FETCH
Increment, then fetch value addressed by top of stack.
D_FETCH
Decrement, then fetch value addressed by top of stack.
FETCH_I
Fetch, then increment value addressed by top of stack.
FETCH_D
Fetch, then increment value addressed by top of stack.
STORE
Store second item on stack in location addressed by the top of stack. Don't pop the value.

The second set of unary operators is given by the following productions in ast.syn:

  (Opcode) unary op
    -> '-'                     =NEG;
    -> '!'                     =NOT;
    -> '~'                     =COM;

The third set of unary operators consists of the cast operators, given by the following productions in ast.syn:

primary
 -> '(', "long", ')', primary:x =
                     AST.unaryNode(CAST_LONG, x);
 -> '(', "double", ')', primary:x =
                   AST.unaryNode(CAST_DOUBLE, x);

Constructor/Destructor

UnaryNode(const FileLocation &, Opcode op, AstNode *);
There is no null constructor for the class.

~UnaryNode();
There is nothing for the destructor to do.

The ConstantNode Class

Introduction

The ConstantNode class represents numeric constants in the script. The value of the constant is stored as an instance of the Value class.

Member Fields

Value value;
Records the value and type of the constant.

Constructor/Destructor

ConstantNode(const FileLocation &, const Value &);
There is no null constructor for the class.

~ConstantNode();
There is nothing for the destructor to do.

The VariableNode Class

Introduction

The VariableNode class is used to represent references to variables. There are only two such usages: in the dump statement and as an lvalue. The class contains only a string representation of the name of the variable.

Member Fields

AgString name;
The name field contains the name of the variable stored as an AgString object. It would be possible to use an index into an AgDictionary<AgString> instead of an AgString object, and thus save memory space, however such savings would come at the cost of more complex interfaces.

Constructor/Destructor

VariableNode(const FileLocation &, const AgString &);
There is no null constructor for the class.

~VariableNode();
There is nothing for the destructor to do, since the AgString object takes care of itself.


The FunctionCallNode Class

Introduction

The FunctionCallNode class is the interface which allows calls to external functions. It contains the ascii name of the function to be called and a list of the argument expressions.

Member Fields

AgString name;
The name of the function which is to be invoked.

const AgStack<AstNode *> argList;
Contains the list of argument expressions.

Constructor/Destructor

FunctionCallNode(const FileLocation &, const AgString &, const AgStack<AstNode *> &);
There is no null constructor for the class.

~FunctionCallNode();
There is nothing for the destructor to do, since the AgString object takes care of itself.



Table of Contents | Parsifal Software Home Page


XIDEK Extensible Interpreter Development Kit
Copyright © 1997-2002, Parsifal Software.
All Rights Reserved.