Home
Trial Copy
Intro. to Parsing
Users Say...
Special Features
Notation Summary
New 2.01 Features
File Trace
Grammar Trace
Glossary
Examples
Expression evaluator (freeware)
XIDEK interpreter kit (freeware)
Lex/Yacc Comparison
If-else ambiguity
Contact
Parsifal
|
Glossary of Parsing Terms
- Action
- A "parser action" is one of the basic executable elements of a
parsing engine. In a traditional parsing engine
there are four actions: the
shift action, the reduce
action, the accept action, and the
error action. The parsing engines which AnaGram
generates use a substantially greater number of actions, in order to provide
certain short cuts in the interest of greater speed and greater compactness of
the tables which control the action of the parsing engine.
- Accept Action
- The accept action is one of the four actions
of a traditional parsing engine. The accept action
is performed when the parser has succeeded in identifying
the goal token for the grammar.
When the parser executes the accept action, it returns to the calling program.
The accept action is thus the last action of
the parsing engine and occurs only once for each successful execution
of the parser.
- Associativity
- This is a term used to describe binary
operators . It describes how you interpret a sequence of operations which
all involve the same operator. Consider a - b - c, for instance. Here the
convention is that we subtract b from a, and then c from the difference. We say
that subtraction is left associative, since if there is a sequence of
subtraction operations, the leftmost one is to be performed first. As a second
example, consider exponentiation. FORTRAN uses ** as an operator to indicate
exponentiation. In order to conform as closely as possible to ordinary
mathematical notation, the expression a**b**c is interpreted to mean that first
b is raised to the power c, and then the resultant value is used as the power to
which a is raised. We say that exponentiation is right associative since the
rightmost of a sequence of exponentiation operations is to be performed first.
If a programming language does not allow for unparenthesized sequences of a
particular operator, the operator is said to be non-associative.
Another way to view left and right associativity is as implied parenthesization.
Left associative operators are parenthesized from the left. Right associative
operators are parenthesized from the right. Thus a - b - c = ((a-b) - c) and
a**b**c = (a**(b**c))
AnaGram offers two ways to specify associativity of
binary operators. The first
way is to use conventional Backus-Naur Form syntax
descriptions. You would use recursion to describe both subtraction and
exponentiation. Subtraction, since it is left associative, uses left recursion,
and exponentiation, being right associative, uses right recursion. Thus
difference -> value | difference, '-', value
exponential -> value | value, "**", exponential
could be used to describe differences and exponents.
The second way to specify associativity is to use an ambiguous grammar and
precedence declarations. (See Chapter 9, AnaGram User's Guide.)
- Backtracking
- In order to make parse tables more compact and parsers faster, it is
common to use default reductions. In case of
error, it is necessary to undo default reductions before diagnostics can be
properly determined. In AnaGram, this undo operation is called backtracking.
- Backus-Naur Form
- Backus-Naur Form, or BNF, is a conventional notation for describing
context free grammars. AnaGram uses an
extended notation for grammars, which, except for
semantically determined
productions, can be shown to be equivalent to BNF. The term "BNF" is
often used colloquially to refer to a grammar specification.
AnaGram's syntax specification language differs from BNF in the following
respects:
- In conventional BNF, a symbol not enclosed in angle brackets (< >) is
taken to represent itself literally. In AnaGram, literal character
representations must be enclosed in single quotes and literal strings within
double quotes.
- In BNF, the right side of a production is simply a list of symbols without
any delimiters or punctuation. AnaGram requires that the rule elements which
make up a grammar rule, or right side, be joined by commas.
- BNF makes no provision for identifying reduction procedures or their
arguments. AnaGram provides both reduction procedures, introduced by an "="
at the end of the production, and named arguments, introduced by a ":"
at the end of any token on the right side of the production.
- AnaGram allows virtual productions to be used freely.
- BNF is "pure" in that if you wish to define a nonterminal token
called "digit" to represent the digits from zero to nine, you must
provide ten explicit productions to define it. AnaGram treats the concept of "terminal
token" as used in language theory as an abstraction, and interposes
character set functionality between actual character input and the terminal
tokens of BNF. You can define digit to be the character range '0-9', for
instance, and AnaGram will determine and define precisely the minimum number of
productions necessary, taking into account any other usage of the characters
'0-9' in your grammar. This makes your grammar more compact, more manageable,
and easier to modify.
- AnaGram allows for
semantically determined productions, which provide a
significant mechanism for melding semantic analysis with syntactic
analysis.
- Binary operator
- A binary operator is an operator that works on two operands to create a
result. It is often written between its operands and is sometimes called an
infix operator. In ordinary programming languages "+" and "-"
are binary operators.
- Binding
- Most programming languages, rather than executing arithmetic operators in
simple left to right order, conform to the traditional conventions of ordinary
algebra, which dictate that, except where parenthesis indicate otherwise,
exponentiation is done before multiplication, multiplication before addition,
and addition before comparison. One says that exponentiation is "more
tightly binding" than multiplication, multiplication is more tightly
binding than addition, and so on. The sense of the word here is that the
operator binds together its operands into a single entity. An operand which
falls between two different operators is "bound" by the more tightly
binding operator. An operator that is more tightly binding than another is also
said to have higher precedence.
- Character sets
- One of the traditional problems with syntax directed programming is that
caused by the fact that most formal language specifications have productions of
the form:
letter -> a | b | c | d ... | z
Since the letters are not often distinguished in the grammar, this large
number of essentially identical productions causes a correspondingly large
number of states in the parser. This problem is often attacked by using a "lexical scanner" which simply specifies
a "letter token" whose value indicates precisely which letter is
intended. This works fine as long as there is nothing in the grammar which
distinguishes any letter at all. If there is, and there is usually some special
use of h, x, q, o, e, or even a-f, the situation gets more complicated. Often the
result is to abandon the use of syntax directed programming for elemental units
of the language and use a more complex lexical scanner which identifies names,
numbers, key words and operators. Such lexical scanners are often built using
conventional programming languages or simple pattern recognizing languages such
as LEX.
AnaGram avoids this problem by incorporation the notion of character set into
its input specifications. Briefly, the way it works is the following: You
specify the set of characters which make up any ostensibly terminal token.
AnaGram then analyzes the overlaps among these definitions and creates a minimal
set of tokens which match your specifications exactly. For instance, suppose you
have the following definitions:
letter = 'a-z' + 'A-Z'
hex digit = '0-9' + 'a-f' + 'A-F'
AnaGram will define a token for letter which has two productions:
letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'
With this technique, you have the freedom to specify your grammar in the
easiest possible manner, without the penalty of an absurdly large, slow parser.
Character sets may be specified in terms of ranges of characters, as in the
above example, as unions, denoted by "+", as
intersections, denoted by "&", as
differences, denoted by "-", and as
complements, using "~". You may also
specify a set consisting of a single character either with a literal character
or with a numeric specification. If you specify a character numerically, you may
use either decimal, octal or hexadecimal notation, as in C.
- Character Universe
- In ordinary set theory, the "universe" is the set of all
entities which may be elements of a set. It must be defined so that the
complement of a set can mean something unambiguous. In AnaGram, the character
universe normally comprises the ascii characters and the IBM extended character
set, that is, all characters in the range from 0 through 255. If, however, you
have defined input characters outside this range, the character universe will be
extended to the least character you have used and to the greatest character you
have used in your grammar.
- Characteristic Rule
- Each parser state is characterized by a particular set of grammar rules,
and for each such rule, a marked token which is the next token expected. These
rules are called the characteristic rules for the state. In the course of doing
grammar analysis, AnaGram determines the characteristic rules for each parser
state. After analyzing your grammar, you may inspect the State Definition Table
to see the characteristic rules for any state in your parser.
- Characteristic Token
- Every state in a parser, except state zero, can be characterized by the
one, unique token which causes a jump to that state. That token is called the
characteristic token of the state, because to get to that parser state you must
have just seen precisely that token in the input. Note that several states could
have the same characteristic token.
When you have a list of states, such as is given by the parser state stack, it
is equivalent to a list of characteristic tokens. This list of tokens is the
list of tokens that have been recognized so far by the parser. Some of these
tokens, of course, may be nonterminal tokens and may thus represent the result
of reducing a sequence of previously recognized tokens.
- Complement of a set
- In set theory, the complement of a set, S, is the collection of all
elements of the character universe which are not elements of S. Using AnaGram's
notation, the complement of S is written ~S. The complement operator has higher
precedence than the
difference, intersection, or union operators.
- Completed Rule
- A "completed rule" is a characteristic rule whose index is
pointing beyond the last rule element. In other words, it has been completely
matched and will be reduced by the next input.
If a state has more than one completed rule, the decision as to which to reduce
is made based on the next input token. If there is only one completed rule in a
state, it will be reduced by default unless the default reductions switch has
been reset, i.e., turned off.
- Conflict
- Conflicts arise during a grammar analysis when AnaGram cannot determine
how to treat a given input token. There are two sorts of conflicts: shift-reduce
conflicts and reduce-reduce conflicts. Conflicts may arise either because the
grammar is inherently ambiguous, or simply because the grammar analyzer cannot
look far enough ahead to resolve the conflict. In the latter case, it is often
possible to rewrite the grammar in such a way as to eliminate the conflict. Null
productions are a common source of conflict.
There are a number of ways to deal with conflicts. If you understand the
conflict well, you may simply choose to ignore it. When AnaGram encounters a
shift-reduce conflict while building parse tables it resolves it by choosing the
shift action. When AnaGram encounters a reduce-reduce conflict while building
parse tables, it resolves it by selecting the grammar rule which occurred first
in the grammar.
A second way to deal with conflicts is to set operator precedence parameters. If
you set these parameters, AnaGram will use them preferentially to resolve
conflicts. Any conflicts so resolved will be listed in the Resolved Conflicts
window.
A third way to resolve a conflict is to declare some tokens as sticky or as
subgrammars. This is particularly useful for productions whose sole purpose is
to skip over uninteresting input.
The fourth way to deal with conflicts is to rewrite the grammar to eliminate
them. Many people prefer this approach since it yields the highest level of
confidence in the resulting program.
- Context Free Grammars
- Context free grammars are grammars wherein the definition of a grammar
unit, or nonterminal token, does not depend on the context in which the
nonterminal is used. That means that if you define "widget" in a
certain way, that definition is valid in all contexts in which "widget"
might occur. Context free grammars can be represented in Backus-Naur Form.
AnaGram's syntax specification language has no facility for representing
grammars which are not context free.
- Default reductions
- A "default reduction" is a parser action which may be used in
your parser in any state which has precisely one completed rule.
If a given parser state has, among its characteristic rules, exactly one
completed rule, it is usually faster to reduce it on any input than to check
specifically for correct input before reducing it. The only time this default
reduction causes trouble is in the event of a syntax error. In this situation
you may get an erroneous reduction. Normally when you are parsing a file, this
is inconsequential because you are not going to continue performing
semantic actions in the presence of
errors. But, if you are using your parser to handle real-time
interactive input, you have to be able to continue semantic
processing after notifying your user that he has entered erroneous
input. In this case you would want default reductions to have been
turned off so that productions are reduced
only when there is correct input.
- Difference of two sets
- In set theory, the difference of two sets, A and B, is defined to be the
set of all elements of A that are not elements of B. In an AnaGram syntax file,
you represent the difference of two character sets by using the "-"
operator. Thus the difference of A and B is A - B. The difference operator is
left associative.
- Error Action
- The error action is one of the four actions of a traditional parsing
engine. The error action is performed when the parser has encountered an input
token which is not admissible in the current state. The further behavior of a
traditional parser is undefined.
- Event Driven Parser
-
An event driven parser is one in which
the relationship between the host program and the parser is turned
inside out. In a conventional parser, the host program calls the
parser, the parser analyzes the complete input text and returns to
the host program only when it has finished with the entire input.
In an event driven parser, the parser does not read its input
directly from a file or from memory. Instead, the host program, after
initializing the parser, calls it once for each input token. Each
time the parser is called, it updates its state
appropriately, calls any reduction
procedures that need to be called and finally when it needs more
input, returns to the host program. The effect is that parsing can
occur in parallel with other processing performed by the host
program. This technique is especially useful in situations where the
token stream to be parsed is developed on the fly, as when using lexical scanners, for instance.
- Expansion Rule
- In analyzing a grammar, we are often interested in the full range of input
that can be expected at a certain point. The expansion of a token or state shows
us all the expected input. An expansion yields a set of marked rules. Looking
at the marked token shows us what input to expect.
The set of expansion rules of a nonterminal token shows all the expected input
that can occur whenever the token appears in the grammar. The set consists of
all the grammar rules produced by the token, plus all the rules produced by the
first token of any rule in the set. The marked tokens for the expansion rules of a
token are always the first element in the rule.
The expansion of a state consists of its characteristic rules plus the expansion
rules of the marked token in each characteristic rule.
- Grammar
- Traditionally, a grammar is a set of productions which taken together
specify precisely a set of acceptable input sequences in terms of an abstract
set of terminal tokens. The set of acceptable input sequences is often called
the "language" defined by the grammar.
In AnaGram, the term grammar also includes configuration segments, definitions
of character sets, virtual productions, etc. that augment the collection of
productions. A grammar is often called a "syntax", and the rules of
the grammar are often called syntactic rules.
- Grammar Analysis
- The major function of AnaGram is the analysis of context free grammars
written in a particular variant of Backus-Naur Form.
The analysis of a grammar proceeds in four stages. In the first, the input
grammar is analyzed and a number of tables are built which describe all of the
productions and components of the grammar.
In the second stage, AnaGram analyzes all of the character sets defined in the
grammar, and where necessary, defines auxiliary tokens and productions.
In the third stage, AnaGram identifies all of the states of the parser and
builds the go-to table for the parser.
In the fourth stage, Anagram identifies reduction tokens for each completed
grammar rule in each state and checks for conflicts.
- Grammar Rule
- A "grammar rule" is the right side of a production. It consists
of a sequence of rule elements. Each rule element identifies some token, which
can be either a terminal token or nonterminal token. It is "matched"
by a corresponding sequence of tokens in the input stream to the parser. The
left side of the production identifies one or more nonterminal tokens, or
reduction tokens, to which the rule reduces when matched. If there is more than
one reduction token, there should be a reduction procedure to make the choice.
- Grammar Token
- The grammar token is the token which represents the "top level"
in your grammar. Some people refer to it as the "goal" or "goal
token" and others as the "start token". Whatever it is called, it
is the single token which describes the complete input to your parser.
- Intersection
- In set theory, the intersection of two sets, A and B, is defined to be the
set of all elements of A which are also elements of B. In an AnaGram syntax
file, the intersection of two character sets is represented with the "&"
operator. The intersection operator has lower precedence than the complement
operator, but higher precedence than the union and difference operators. The
intersection operator is left associative.
- LALR parsing
- LALR, or "lookahead LR" parsing, a slightly less powerful
variant of LR parsing, produces much smaller parsing tables. Thus an LALR
parser has very significant advantages in size over its LR counterpart. It is
possible to construct grammars which can be parsed by an LR parser but cannot be
parsed by an LALR parser. That is, the LALR parser will find these grammars to
be ambiguous and the LR parser will not. In practice, however, such grammars
seem to be rare and the ambiguities can usually be eliminated by minor revisions.
AnaGram generates LALR parsers and, in addition, uses LALR parsing to interpret
syntax files.
- Lexical Scanner
- A lexical scanner is a program or procedure used as a preprocessor for a
parser. It scans input characters and lumps them together into tokens. Many
systems for generating parsers, most notably YACC, can scarcely be used without
a lexical scanner.
AnaGram, although it can support the use of a lexical scanner, does not need a
lexical scanner for most practical problems. AnaGram avoids the need for a
lexical scanner by using character sets and disregard and lexeme statements.
If your problem does in fact need a lexical scanner, you can use AnaGram itself
to write it, so you don't need to know different languages for the scanner and
for the parser.
- Lookahead Token
- The current input token to a parser is often called the "lookahead"
token. In many situations, it is not shifted into the input buffer immediately,
but acts like a catalyst to cause a number of rules to be reduced before it is
eventually shifted in.
- LR Parsing
- An LR(k) parser is a type of deterministic bottom-up shift-reduce parser
using at most k lookahead symbols to determine its next action at any point in
the parse. If (k) is omitted, then k is assumed to be 1. Discussion of the
technical details of LR(k) parsing is beyond the scope of this manual. The
interested reader may consult any of a variety of works covering the subject.
AnaGram produces parsers which employ a variant of LR parsing known as LALR
parsing. It is not necessary to understand the details of LR and LALR parsing
theory in order to use AnaGram.
- Marked Rule
- A "marked rule" is a grammar rule together with a "marked token"
that indicates how much of the rule has already been matched and what input
should be expected, starting with the marked token, if the remainder of the rule is to be matched. Each parser
state is defined by a small number of characteristic rules which indicate what
matches have already been made and what input can be expected. Note that when a
state has more than one characteristic rule, any two characteristic rules with
the same number of tokens to the left of the marked token match identically up to the
marked token. Even if one has fewer tokens to the left than the other, its tokens
match the other exactly up to the marked token.
When marked rules are displayed in AnaGram windows, the marked token is displayed in
a distinctive font which the developer can select.
- Non-associative
- A binary operator, say #, is said to be non-associative if the sequence x
# y # z is not permissible. If this sequence is to be interpreted as (x#y)#z,
the operator is said to be left associative. If the sequence is to be
interpreted as x#(y#z), the operator is said to be right associative. (See
associativity.)
- Nonterminal Token
- A "nonterminal token" is a token which results from reducing the
sequence of tokens which match a grammar rule and replacing them with the
appropriate reduction token. Nonterminal tokens are to be distinguished from
terminal tokens or input tokens.
- Null Production
- A "null production" is one that has no tokens on the right side
whatsoever. Null productions essentially are identified by the first following
input token. Null productions are extremely convenient syntactic elements when
you wish to make some input optional. For example, suppose that you wish to
allow an optional semicolon at some point in your grammar. You could write the
following pair of productions:
optional semicolon -> | ';'
Note that a null production can never follow a '|'. The above could also
be written on multiple lines thus:
optional semicolon
->
-> ';'
You can always rewrite your grammar to eliminate null productions if you
wish, but you usually pay a price in conciseness and clarity. Sometimes,
however, it is necessary to do such a rewrite in order to avoid conflicts, to
which null productions are especially prone.
If you have a null production with no reduction procedure specified, your parser
will automatically assign the value zero to the reduction token.
Null productions can be generated by virtual productions.
- Parameter Assignment
- In any rule in an AnaGram grammar, the
semantic value of any rule element may be
passed to a reduction procedure by
means of a parameter assignment. Simply follow the rule element with a colon and a C variable name.
The C variable name can then be used in the reduction procedure to
refer to the semantic value of the token it is attached to. AnaGram
will automatically provide necessary declarations.
Here are some examples of rule elements with parameter
assignments:
'0-9':d
integer:n
expression:x
declaration : declaration_descriptor
- Parser
- A parser is a program, or more likely a procedure within a program, which
scans a sequence of input characters or input tokens and accumulates them in an
input buffer or stack as determined by a set of productions which constitute a
grammar. When the parser discovers a sequence of tokens as defined by a grammar
rule, or right side of a production, it "reduces" the sequence to a
single reduction token as defined by the left side of the grammar rule. This
nonterminal token now replaces the tokens which matched the grammar rule and the
search for matches continues. If an input token is encountered which will not
yield a match for any rule, it is considered a syntax error and some kind of
error recovery may be required to continue. If a match, or reduction action,
yields the grammar token, sometimes called the goal token or start token, the
parser deems its work complete and returns to whatever procedure may have called
it.
Tokens may have semantic values. If the
input values configuration switch is on, your parser will expect
semantic values to be provided by the input process along with the
token identification code. If the input values switch is off, your
parser will take the ascii value of the input character, that is, the
actual input code, as the value of the character. When the parser reduces a production, it can call a reduction procedure or semantic action to analyze the values of
the constituent tokens. This reduction procedure can then return a
value which characterizes the reduced token.
- Parser Control Block
-
A "Parser Control Block" is a structure which contains
all of the data necessary to describe the instantaneous
state of an AnaGram parser.
A programmer may may add declarations to the parser control
block in an AnaGram grammar by using the extend pcb statement.
- Parser Generator
-
A parser generator, such as AnaGram, is a program that
converts a grammar, a rule-based description of the
input to a program, into a conventional, procedural
module called a parser. The parsers AnaGram generates
are simple C modules which can be compiled on almost
any platform. AnaGram parsers are also compatible with
C++.
- Parser State Stack
-
The parser state stack is a stack maintained by an AnaGram
parser and which is an integral part of the parsing
process. At any point in the parse of your input
stream, the parser state stack provides a summary of
what has been found so far. In an AnaGram parser, the parser state stack is
stored in the Parser Control Block.
- Parser Value Stack
-
In parallel with the parser state stack,
an AnaGram parser
maintains a "value stack", each entry of
which corresponds to the semantic value of the token
identified at that state. Since the semantic values of
different tokens might well have different data types,
AnaGram provides the opportunity to define the data type for any token.
AnaGram then builds a typedef statement creating a data type which is a union
of the all the types you have defined. AnaGram uses this data type to define
the value stack.
- Parsing Engine
- A parser consists of three basic components: A set of syntax tables, a set
of reduction procedures and a parsing engine. The parsing engine is the body of
code that interprets the parsing table, invokes input functions, and calls the
reduction procedures. The Build Parser command configures a parsing engine
according to the implicit requirements of the syntax specification and according
to the explicit values of the configuration parameters.
The parsing engine itself is a simple automaton, characterized by a set of
states and a set of inputs. The inputs are the tokens of your grammar. Each
state is represented by a list of tokens which are admissible in that state and
for each token an action to perform and a parameter which further defines the
action. In a traditional LALR parser, there are only four actions: the shift
action, the reduce action, the accept action and the error action. AnaGram, in
doing its grammar analysis, identifies a number of special cases, and creates a
number of extra actions which make for faster processing, but which can be
represented as combinations of these primitive actions.
When a shift action is performed, the current state number is pushed onto the
parser state stack and the new state number is determined by the current state
number and the current lookahead token. Different tokens cause different new
states.
When a reduce action is performed, the length of the rule being reduced is
subtracted from the parser stack index and the new state number is read from the
top of the parser state stack. The reduction token for the rule being reduced is
then used as an input token.
Each state in the grammar, with the exception of state zero, has a
characteristic token which must have been recognized in order to jump to that
state. Therefore, the parser state stack, which is essentially a list of state
numbers, can also be thought of as a list of token numbers. This is the list of
tokens that have been seen so far in the parse of your input stream. Some of
these tokens, of course, may be nonterminal tokens which have resulted from
reducing other sequences of tokens.
- Partition
- If you use character sets in your grammar, AnaGram will compute a "partition"
of the character universe. This partition is a collection of non-overlapping
character sets such that every one of the sets you have defined can be written
as a union of partition sets. Each partition set is identified by a unique
reference number called the partition set number.
Each partition set is assigned a unique token. If one of your character sets
requires more than one partition set to represent it, AnaGram will create
appropriate productions and add them to your grammar so your parser can make the
necessary distinctions.
- Precedence
- "Precedence" is an attribute of binary operators and
unary
operators which can be used to resolve conflicts. Operator precedence is also
referred to as binding. Suppose that # and @ are two binary operators. The
question is how to interpret x # y @ z. If # has higher precedence than @, it is
interpreted as (x#y)@z. On the other hand if # has lower precedence than @, it
would be x#(y@z). If # and @ have the same precedence then the question must be
resolved by associativity.
Note that all operators at the same precedence level must have the same
associativity.
The situation is somewhat simpler for unary operators. If # and @ were both
prefix operators, or both suffix operators, there would be no ambiguity in
interpretation, since neither # @ x and x # @ offer more than one possible
interpretation. The only difficulty arises when both a prefix and a suffix
operator are applied to the same operand.
Suppose that # is a prefix operator and @ is a suffix operator. The question
then is how to interpret # x @. If # has higher precedence than @, it would be
interpreted as (# x) @. On the other hand, if # has lower precedence than @, it
would be interpreted as # (x @). If # and @ have the same precedence then the
question must be resolved by associativity.
- Production
- Productions are the mechanism used to describe how complex input
structures are built up out of simpler ones. Each production has a left side and
a right side. The right side, or grammar rule, is a sequence of rule elements,
which may represent either terminal tokens or nonterminal tokens. The left side
is a list of reduction tokens. In most cases there would be only a single
reduction token. Productions with more than one token on the left side are
called semantically determined productions. The
"
-> " symbol is used
to separate the left side from the right side.
- Reduce Action
- The reduce action is one of the four actions of a traditional parsing
engine. The reduce action is performed when the parser has succeeded in matching
all the elements of a grammar rule and the next input token is not erroneous.
Reducing the grammar rule amounts to subtracting the length of the rule from the
parser stack index, identifying the reduction token, stacking its
semantic value
and then doing a shift action with the reduction token as though it had been
input directly.
- Reduce-Reduce Conflict
- A grammar has a "reduce-reduce" conflict at some state if a
single token turns out to be a reducing token for more than one completed rule.
- Reducing Token
- In a state with more than one completed rule, your parser must be able to
determine which one was actually found. AnaGram deals with this problem by
looking at all the states the parser will branch to once each rule is reduced.
The acceptable input tokens for those states are the "reducing tokens"
for the completed rules in the state under investigation. If there is a single
token which is a reducing token for more than one rule, then the grammar is said
to have a reduce-reduce conflict at that state. If in a particular state there
is both a shift action and a reduce action for the same token the grammar is
said to have a shift-reduce conflict in that state.
A reducing token is not the same as a reduction token.
- Reduction Procedure
- A "reduction procedure" is a function you write which your
parser executes when it has identified the grammar rule to which the reduction
procedure is attached in your grammar. There
are two formats for reduction
procedures, depending on the size and complexity of the procedure. A
reduction procedure is often referred to as a "semantic action".
- Reduction Token
- A token which appears on the left side of a production is called a
reduction token. It is so called because when the grammar rule on the right side
of the production is matched in the input stream, your parser will "reduce"
the sequence of tokens which matches the rule by replacing the sequence of
tokens with the reduction token. Note that, if more than one reduction token is
specified, your reduction procedure should choose the exact one. If it does not,
your parser will use the leftmost syntactically correct one in the list as the default.
A reduction token is not the same as a reducing token.
- Resynchronization
- "Resynchronization" is the process of getting your parser back
in step with its input after encountering a syntax error. As such, it is one
method of error recovery. Of course, you would resynchronize only if it
is necessary to continue after the error. There are several options available
when using AnaGram. You could use the auto resynch option, which causes AnaGram
to incorporate an automatic resynchronizing procedure into your parser, or you
could use the error resynch option, which is similar to the technique used by
YACC programmers.
- Rule Element
- A grammar rule is a list of "rule elements", separated by
commas. Rule elements may be token names, character sets, keywords, immediate
actions, or virtual productions. When AnaGram encounters a rule element for
which no token presently exists, it creates one.
- Semantic Action
-
A semantic action is a piece of C or C++ code that is executed when
a grammar rule has been identified by a
parser. Semantic actions are also called
reduction procedures, since they
are executed when a grammar rule is reduced to the token on the left side of the
production.
- Semantic Value
-
Tokens, whether terminal
or nonterminal, may have a semantic value.
In the case of terminal tokens, this may be a value assigned by the
lexical scanner or, if the parser is
using direct character input, it will be the ascii value of the
character itself. The values of nonterminal tokens are created by
reduction procedures. As a parse
progresses, token values are shifted
onto the stack, so that in a reduction procedure, the values of the
tokens that comprise the grammar rule that
is being reduced are available
for inspection.
- Semantically Determined Production
- A "semantically determined production" is one which has more
than one reduction token specified on the left side of the production. You would
write such a production when the reduction tokens are syntactically
indistinguishable. The reduction procedure may then specify which of the listed
reduction tokens the grammar rule is to reduce to based on semantic
considerations. If there is no reduction procedure, or the reduction procedure
does not specify a reduction token, the parser will use the leftmost syntactically correct token.
- Set Expression
- A set expression is an algebraic expression used to define a character set
in terms of individual characters, ranges of characters, and/or other sets of
characters as constructed using complements, unions, intersections, and
differences.
- Shift Action
- The shift action is one of the four actions of a traditional parsing
engine. The shift action is performed when the input token matches one of the
acceptable input tokens for the current parser state. The
semantic value of the
token and the current state number are stacked, the parser stack index is
incremented and the state number is set to a value determined by the previous
state and the input token.
- Shift-Reduce Conflict
- A "shift-reduce" conflict occurs if in some state there is a
single token that should be shifted, because it is legitimate input for one of
the rules of the state, but should also be used to reduce some other rule
because it is a reducing token for that rule.
- Syntax Directed Parsing
-
Syntax directed parsing, or formal parsing, is an
approach to building parsers based on formal language
theory. Given a suitable description of a language,
called a grammar, there are algorithms which can be
used to create parsers for the language automatically.
In this context, the set of all possible inputs to a
program may be considered to constitute a language, and
the rules for formulating the input to the program
constitute the grammar for the language.
The parsers built from a grammar have the advantage
that they can recognize any input that conforms to the
rules, and can reject as erroneous any input that fails
to conform.
Since the program logic necessary to parse input is
often extremely intricate, programs which use formal
parsing are usually much more reliable than those built
by hand. They are also much easier to maintain, since
it is much easier to modify a grammar specification
than it is to modify complex program logic.
- Terminal Token
- A "terminal token" is a token which does not appear on the left
side of a production. It represents, therefore, a basic unit of input to your
parser. If the input to your parser consists of ascii characters, you may define
terminal tokens explicitly as ascii characters or as sets of ascii characters.
If you have an input procedure which produces numeric codes, you may define the
terminal tokens directly in terms of these numeric codes.
- Token
- Tokens are the units with which your parser works. There are two kinds of
tokens: terminal tokens and nonterminal tokens. These latter are identified by
the parser as sequences of tokens. The grouping of tokens into more complex
tokens is governed by the grammar rules, or productions in your grammar. In your
grammar, tokens may be denoted by token names, by virtual productions, by
immediate actions, by explicit character representations, or by expressions
which yield character sets.
A "token" should be thought of more or less as being something like a
cloakroom token. Imagine you had taken a kindergarten class to the museum and
all the kids checked their coats. Each one gets a token. What is the probability
of losing one? You take all of the tokens from the children, put them in a paper
bag, tie it up, and check it. You get one token in return. We would say that the
kids' coats represent the actual input to the system, their individual tokens
are the terminal tokens, and your one token which enables you to redeem the
paper bag is a nonterminal token. Nonterminal tokens are just single token
numbers to represent the result of compacting a sequence of more primitive
tokens.
Actually, the word "token" is commonly used to refer to a number of
apparently different things. The tokens you use in a grammar are abstract. The
tokens identified by a parser are concrete instances of the abstract tokens in
the grammar. Furthermore, we have to identify tokens in some manner, so we have
token names and token numbers with which to refer to particular tokens. As a
result the word "token", in any particular context, may refer directly
to either an abstract token, or a concrete instance of an abstract token, or may
refer only indirectly to the token by means of a token name or token number.
As an example, consider a C compiler which has a lexical scanner to identify
input tokens. According to Kernighan and Ritchie, there are six classes of C
tokens: identifiers, keywords, constants, string literals, operators, and other
separators. Consider a particular token, a hexadecimal constant, 0XF00. When the
lexical scanner hands this over to the parser, it has to provide an
identification code which says what kind of token it is, a hexadecimal constant,
as well as the
"semantic value" of the token,
3840. The identification code is usually an integer value,
determined by the designer of the lexical scanner, which by agreement
with the designer of the parser uniquely represents a hexadecimal
constant. This identification code can be thought of as an external
token number.
The grammar for the parser, on the other hand, has a token, "HEXconstant",
let us say, to represent the abstract usage of hexadecimal constants in the C
language. When AnaGram, or any other parser generator for that matter, analyzes
the grammar, it assigns its own reference number to identify "HEXconstant".
This reference number can be thought of as an internal token number to contrast
it with the external token number provided by the lexical scanner. The parsers
AnaGram builds contain a table, ag_ tcv, which provides a translation from the
external to the internal token number. Both of these token numbers, of course,
differ from the semantic value of the token, in this case, 3840.
Even when an AnaGram parser is using simple character input, the word "token"
may represent several different things. Suppose a grammar has a token 'A-Z',
which represents the set of upper case letters. There will be an internal token
number, assigned by AnaGram, which identifies the (abstract) token in parser
tables. The actual (concrete) input token to a parser will, however, be a single
letter, say "Q". In this case, the ascii value of the character "Q",
81, is used both as the external token number and as the semantic value of the
token.
- Unary Operator
- A "unary operator" is a token which represents
some sort of operation which operates on a single entity to produce a new
resultant entity. It is normally represented by a symbol which is written either
preceding the operand or following the operand. In the first case it is called a
"prefix operator" and in the second case a "suffix operator".
- Union
-
The union of two sets is the set of all elements that are to be found in one or
another of the two sets. In an AnaGram syntax file the union of two character
sets A and B is represented using the plus sign, as in A + B. The union operator
has the same precedence as the difference operator: lower than that of
intersection and complement. The union operator is left associative.
- Virtual Production
-
Virtual productions are a special short hand
representation of grammar rules which can be used to
indicate a choice of inputs. They are an important
convenience, especially useful when you are first
building a grammar.
AnaGram rewrites virtual productions, so that when you
look at the syntax tables in AnaGram, there will be
actual productions replacing the virtual productions.
A virtual production appears as one of the rule
elements in a grammar rule, i.e. as one of the members
of the list on the right side of a production.
The simplest virtual production is the "optional"
token. If x is an arbitrary token, x? can be used to
indicate an optional x.
Related virtual productions are x... and x?... where
the three dots indicate repetition. x... represents an
arbitrary number of occurrences of x, but at least one.
x?... represents zero or more occurrences of x.
The remaining virtual productions use curly or square
brackets to enclose a sequence of rules. The brackets
may be followed variously by nothing, a string of three
dots, or a slash, to indicate the choices to be made
from the rules. Note that rules may be used, not merely
tokens.
If r1 through rn are a set of grammar rules, then
{r1 | r2 | ... | rn}
is a virtual production that allows a choice of exactly
one of the rules. Similarly,
{r1 | r2 | ... | rn}...
is a virtual production that allows a choice of one or
more of the rules. And, finally,
{r1 | r2 | ... | rn}/...
is a virtual production that allows a choice of one or
more of the rules subject to the side condition that
rules must alternate, that is, that no rule can follow
itself immediately without the interposition of some
other rule. This is a case that is not particularly
easy to write by hand, but is quite useful in a number
of contexts.
If the above virtual productions are written with [ ]
instead of { }, they all become optional. [ ] is an
optional choice, [ ]... is zero or more choices, and
[ ]/... is zero or more alternating choices.
Null productions are not permitted in virtual
productions in those cases where they would cause an
intrinsic ambiguity.
You may use a definition statement to assign a name to
a virtual production.
|