Parsifal Software

Home

Trial Copy

Intro. to Parsing

Users Say...

Special Features

Notation Summary

New 2.01 Features

File Trace

Grammar Trace

Glossary

Examples

Expression evaluator (freeware)

XIDEK interpreter kit (freeware)

Lex/Yacc Comparison

If-else ambiguity

Contact Parsifal

AnaGram


Glossary of Parsing Terms


Action
Accept Action
Associativity
Backtracking
Backus-Naur Form
Binary operator
Binding
Character sets
Character Universe
Characteristic Rule
Characteristic Token
Complement of a set
Completed Rule
Conflict
Context Free Grammars
Default reductions
Difference of two sets
Error Action
Event Driven Parser
Expansion Rule
Grammar
Grammar Analysis
Grammar Rule
Grammar Token
Intersection
LALR parsing
Lexical Scanner
Lookahead Token
LR Parsing
Marked Rule
Non-associative
Nonterminal Token
Null Production
Parameter Assignment
Parser
Parser ControlBlock
Parser Generator
Parser State Stack
Parser Value Stack
Parsing Engine
Partition
Precedence
Production
Reduce Action
Reduce-Reduce Conflict
Reducing Token
Reduction Procedure
Reduction Token
Resynchronization
Rule Element
Semantic Action
Semantic Value
Semantically Determined Production
Set Expression
Shift Action
Shift-Reduce Conflict
Syntax Directed Parsing
Terminal Token
Token
Unary Operator
Union
Virtual Production


Action
A "parser action" is one of the basic executable elements of a parsing engine. In a traditional parsing engine there are four actions: the shift action, the reduce action, the accept action, and the error action. The parsing engines which AnaGram generates use a substantially greater number of actions, in order to provide certain short cuts in the interest of greater speed and greater compactness of the tables which control the action of the parsing engine.

Accept Action
The accept action is one of the four actions of a traditional parsing engine. The accept action is performed when the parser has succeeded in identifying the goal token for the grammar. When the parser executes the accept action, it returns to the calling program. The accept action is thus the last action of the parsing engine and occurs only once for each successful execution of the parser.

Associativity
This is a term used to describe binary operators . It describes how you interpret a sequence of operations which all involve the same operator. Consider a - b - c, for instance. Here the convention is that we subtract b from a, and then c from the difference. We say that subtraction is left associative, since if there is a sequence of subtraction operations, the leftmost one is to be performed first. As a second example, consider exponentiation. FORTRAN uses ** as an operator to indicate exponentiation. In order to conform as closely as possible to ordinary mathematical notation, the expression a**b**c is interpreted to mean that first b is raised to the power c, and then the resultant value is used as the power to which a is raised. We say that exponentiation is right associative since the rightmost of a sequence of exponentiation operations is to be performed first. If a programming language does not allow for unparenthesized sequences of a particular operator, the operator is said to be non-associative.

Another way to view left and right associativity is as implied parenthesization. Left associative operators are parenthesized from the left. Right associative operators are parenthesized from the right. Thus a - b - c = ((a-b) - c) and a**b**c = (a**(b**c))

AnaGram offers two ways to specify associativity of binary operators. The first way is to use conventional Backus-Naur Form syntax descriptions. You would use recursion to describe both subtraction and exponentiation. Subtraction, since it is left associative, uses left recursion, and exponentiation, being right associative, uses right recursion. Thus

    difference  -> value | difference, '-', value
    exponential -> value | value, "**", exponential
could be used to describe differences and exponents.

The second way to specify associativity is to use an ambiguous grammar and precedence declarations. (See Chapter 9, AnaGram User's Guide.)

Backtracking
In order to make parse tables more compact and parsers faster, it is common to use default reductions. In case of error, it is necessary to undo default reductions before diagnostics can be properly determined. In AnaGram, this undo operation is called backtracking.

Backus-Naur Form
Backus-Naur Form, or BNF, is a conventional notation for describing context free grammars. AnaGram uses an extended notation for grammars, which, except for semantically determined productions, can be shown to be equivalent to BNF. The term "BNF" is often used colloquially to refer to a grammar specification.

AnaGram's syntax specification language differs from BNF in the following respects:

  1. In conventional BNF, a symbol not enclosed in angle brackets (< >) is taken to represent itself literally. In AnaGram, literal character representations must be enclosed in single quotes and literal strings within double quotes.
  2. In BNF, the right side of a production is simply a list of symbols without any delimiters or punctuation. AnaGram requires that the rule elements which make up a grammar rule, or right side, be joined by commas.
  3. BNF makes no provision for identifying reduction procedures or their arguments. AnaGram provides both reduction procedures, introduced by an "=" at the end of the production, and named arguments, introduced by a ":" at the end of any token on the right side of the production.
  4. AnaGram allows virtual productions to be used freely.
  5. BNF is "pure" in that if you wish to define a nonterminal token called "digit" to represent the digits from zero to nine, you must provide ten explicit productions to define it. AnaGram treats the concept of "terminal token" as used in language theory as an abstraction, and interposes character set functionality between actual character input and the terminal tokens of BNF. You can define digit to be the character range '0-9', for instance, and AnaGram will determine and define precisely the minimum number of productions necessary, taking into account any other usage of the characters '0-9' in your grammar. This makes your grammar more compact, more manageable, and easier to modify.
  6. AnaGram allows for semantically determined productions, which provide a significant mechanism for melding semantic analysis with syntactic analysis.

Binary operator
A binary operator is an operator that works on two operands to create a result. It is often written between its operands and is sometimes called an infix operator. In ordinary programming languages "+" and "-" are binary operators.

Binding
Most programming languages, rather than executing arithmetic operators in simple left to right order, conform to the traditional conventions of ordinary algebra, which dictate that, except where parenthesis indicate otherwise, exponentiation is done before multiplication, multiplication before addition, and addition before comparison. One says that exponentiation is "more tightly binding" than multiplication, multiplication is more tightly binding than addition, and so on. The sense of the word here is that the operator binds together its operands into a single entity. An operand which falls between two different operators is "bound" by the more tightly binding operator. An operator that is more tightly binding than another is also said to have higher precedence.

Character sets
One of the traditional problems with syntax directed programming is that caused by the fact that most formal language specifications have productions of the form:
    letter -> a | b | c | d ... | z
Since the letters are not often distinguished in the grammar, this large number of essentially identical productions causes a correspondingly large number of states in the parser. This problem is often attacked by using a "lexical scanner" which simply specifies a "letter token" whose value indicates precisely which letter is intended. This works fine as long as there is nothing in the grammar which distinguishes any letter at all. If there is, and there is usually some special use of h, x, q, o, e, or even a-f, the situation gets more complicated. Often the result is to abandon the use of syntax directed programming for elemental units of the language and use a more complex lexical scanner which identifies names, numbers, key words and operators. Such lexical scanners are often built using conventional programming languages or simple pattern recognizing languages such as LEX.

AnaGram avoids this problem by incorporation the notion of character set into its input specifications. Briefly, the way it works is the following: You specify the set of characters which make up any ostensibly terminal token. AnaGram then analyzes the overlaps among these definitions and creates a minimal set of tokens which match your specifications exactly. For instance, suppose you have the following definitions:

    letter = 'a-z' + 'A-Z'
    hex digit = '0-9' + 'a-f' + 'A-F'
AnaGram will define a token for letter which has two productions:
    letter -> 'a-f' + 'A-F' | 'g-z' + 'G-Z'
With this technique, you have the freedom to specify your grammar in the easiest possible manner, without the penalty of an absurdly large, slow parser.

Character sets may be specified in terms of ranges of characters, as in the above example, as unions, denoted by "+", as intersections, denoted by "&", as differences, denoted by "-", and as complements, using "~". You may also specify a set consisting of a single character either with a literal character or with a numeric specification. If you specify a character numerically, you may use either decimal, octal or hexadecimal notation, as in C.

Character Universe
In ordinary set theory, the "universe" is the set of all entities which may be elements of a set. It must be defined so that the complement of a set can mean something unambiguous. In AnaGram, the character universe normally comprises the ascii characters and the IBM extended character set, that is, all characters in the range from 0 through 255. If, however, you have defined input characters outside this range, the character universe will be extended to the least character you have used and to the greatest character you have used in your grammar.

Characteristic Rule
Each parser state is characterized by a particular set of grammar rules, and for each such rule, a marked token which is the next token expected. These rules are called the characteristic rules for the state. In the course of doing grammar analysis, AnaGram determines the characteristic rules for each parser state. After analyzing your grammar, you may inspect the State Definition Table to see the characteristic rules for any state in your parser.

Characteristic Token
Every state in a parser, except state zero, can be characterized by the one, unique token which causes a jump to that state. That token is called the characteristic token of the state, because to get to that parser state you must have just seen precisely that token in the input. Note that several states could have the same characteristic token.

When you have a list of states, such as is given by the parser state stack, it is equivalent to a list of characteristic tokens. This list of tokens is the list of tokens that have been recognized so far by the parser. Some of these tokens, of course, may be nonterminal tokens and may thus represent the result of reducing a sequence of previously recognized tokens.

Complement of a set
In set theory, the complement of a set, S, is the collection of all elements of the character universe which are not elements of S. Using AnaGram's notation, the complement of S is written ~S. The complement operator has higher precedence than the difference, intersection, or union operators.

Completed Rule
A "completed rule" is a characteristic rule whose index is pointing beyond the last rule element. In other words, it has been completely matched and will be reduced by the next input.

If a state has more than one completed rule, the decision as to which to reduce is made based on the next input token. If there is only one completed rule in a state, it will be reduced by default unless the default reductions switch has been reset, i.e., turned off.

Conflict
Conflicts arise during a grammar analysis when AnaGram cannot determine how to treat a given input token. There are two sorts of conflicts: shift-reduce conflicts and reduce-reduce conflicts. Conflicts may arise either because the grammar is inherently ambiguous, or simply because the grammar analyzer cannot look far enough ahead to resolve the conflict. In the latter case, it is often possible to rewrite the grammar in such a way as to eliminate the conflict. Null productions are a common source of conflict.

There are a number of ways to deal with conflicts. If you understand the conflict well, you may simply choose to ignore it. When AnaGram encounters a shift-reduce conflict while building parse tables it resolves it by choosing the shift action. When AnaGram encounters a reduce-reduce conflict while building parse tables, it resolves it by selecting the grammar rule which occurred first in the grammar.

A second way to deal with conflicts is to set operator precedence parameters. If you set these parameters, AnaGram will use them preferentially to resolve conflicts. Any conflicts so resolved will be listed in the Resolved Conflicts window.

A third way to resolve a conflict is to declare some tokens as sticky or as subgrammars. This is particularly useful for productions whose sole purpose is to skip over uninteresting input.

The fourth way to deal with conflicts is to rewrite the grammar to eliminate them. Many people prefer this approach since it yields the highest level of confidence in the resulting program.

Context Free Grammars
Context free grammars are grammars wherein the definition of a grammar unit, or nonterminal token, does not depend on the context in which the nonterminal is used. That means that if you define "widget" in a certain way, that definition is valid in all contexts in which "widget" might occur. Context free grammars can be represented in Backus-Naur Form. AnaGram's syntax specification language has no facility for representing grammars which are not context free.

Default reductions
A "default reduction" is a parser action which may be used in your parser in any state which has precisely one completed rule.

If a given parser state has, among its characteristic rules, exactly one completed rule, it is usually faster to reduce it on any input than to check specifically for correct input before reducing it. The only time this default reduction causes trouble is in the event of a syntax error. In this situation you may get an erroneous reduction. Normally when you are parsing a file, this is inconsequential because you are not going to continue performing semantic actions in the presence of errors. But, if you are using your parser to handle real-time interactive input, you have to be able to continue semantic processing after notifying your user that he has entered erroneous input. In this case you would want default reductions to have been turned off so that productions are reduced only when there is correct input.

Difference of two sets
In set theory, the difference of two sets, A and B, is defined to be the set of all elements of A that are not elements of B. In an AnaGram syntax file, you represent the difference of two character sets by using the "-" operator. Thus the difference of A and B is A - B. The difference operator is left associative.

Error Action
The error action is one of the four actions of a traditional parsing engine. The error action is performed when the parser has encountered an input token which is not admissible in the current state. The further behavior of a traditional parser is undefined.

Event Driven Parser
An event driven parser is one in which the relationship between the host program and the parser is turned inside out. In a conventional parser, the host program calls the parser, the parser analyzes the complete input text and returns to the host program only when it has finished with the entire input.

In an event driven parser, the parser does not read its input directly from a file or from memory. Instead, the host program, after initializing the parser, calls it once for each input token. Each time the parser is called, it updates its state appropriately, calls any reduction procedures that need to be called and finally when it needs more input, returns to the host program. The effect is that parsing can occur in parallel with other processing performed by the host program. This technique is especially useful in situations where the token stream to be parsed is developed on the fly, as when using lexical scanners, for instance.

Expansion Rule
In analyzing a grammar, we are often interested in the full range of input that can be expected at a certain point. The expansion of a token or state shows us all the expected input. An expansion yields a set of marked rules. Looking at the marked token shows us what input to expect.

The set of expansion rules of a nonterminal token shows all the expected input that can occur whenever the token appears in the grammar. The set consists of all the grammar rules produced by the token, plus all the rules produced by the first token of any rule in the set. The marked tokens for the expansion rules of a token are always the first element in the rule.

The expansion of a state consists of its characteristic rules plus the expansion rules of the marked token in each characteristic rule.

Grammar
Traditionally, a grammar is a set of productions which taken together specify precisely a set of acceptable input sequences in terms of an abstract set of terminal tokens. The set of acceptable input sequences is often called the "language" defined by the grammar.

In AnaGram, the term grammar also includes configuration segments, definitions of character sets, virtual productions, etc. that augment the collection of productions. A grammar is often called a "syntax", and the rules of the grammar are often called syntactic rules.

Grammar Analysis
The major function of AnaGram is the analysis of context free grammars written in a particular variant of Backus-Naur Form.

The analysis of a grammar proceeds in four stages. In the first, the input grammar is analyzed and a number of tables are built which describe all of the productions and components of the grammar.

In the second stage, AnaGram analyzes all of the character sets defined in the grammar, and where necessary, defines auxiliary tokens and productions.

In the third stage, AnaGram identifies all of the states of the parser and builds the go-to table for the parser.

In the fourth stage, Anagram identifies reduction tokens for each completed grammar rule in each state and checks for conflicts.

Grammar Rule
A "grammar rule" is the right side of a production. It consists of a sequence of rule elements. Each rule element identifies some token, which can be either a terminal token or nonterminal token. It is "matched" by a corresponding sequence of tokens in the input stream to the parser. The left side of the production identifies one or more nonterminal tokens, or reduction tokens, to which the rule reduces when matched. If there is more than one reduction token, there should be a reduction procedure to make the choice.

Grammar Token
The grammar token is the token which represents the "top level" in your grammar. Some people refer to it as the "goal" or "goal token" and others as the "start token". Whatever it is called, it is the single token which describes the complete input to your parser.

Intersection
In set theory, the intersection of two sets, A and B, is defined to be the set of all elements of A which are also elements of B. In an AnaGram syntax file, the intersection of two character sets is represented with the "&" operator. The intersection operator has lower precedence than the complement operator, but higher precedence than the union and difference operators. The intersection operator is left associative.

LALR parsing
LALR, or "lookahead LR" parsing, a slightly less powerful variant of LR parsing, produces much smaller parsing tables. Thus an LALR parser has very significant advantages in size over its LR counterpart. It is possible to construct grammars which can be parsed by an LR parser but cannot be parsed by an LALR parser. That is, the LALR parser will find these grammars to be ambiguous and the LR parser will not. In practice, however, such grammars seem to be rare and the ambiguities can usually be eliminated by minor revisions.

AnaGram generates LALR parsers and, in addition, uses LALR parsing to interpret syntax files.

Lexical Scanner
A lexical scanner is a program or procedure used as a preprocessor for a parser. It scans input characters and lumps them together into tokens. Many systems for generating parsers, most notably YACC, can scarcely be used without a lexical scanner.

AnaGram, although it can support the use of a lexical scanner, does not need a lexical scanner for most practical problems. AnaGram avoids the need for a lexical scanner by using character sets and disregard and lexeme statements.

If your problem does in fact need a lexical scanner, you can use AnaGram itself to write it, so you don't need to know different languages for the scanner and for the parser.

Lookahead Token
The current input token to a parser is often called the "lookahead" token. In many situations, it is not shifted into the input buffer immediately, but acts like a catalyst to cause a number of rules to be reduced before it is eventually shifted in.

LR Parsing
An LR(k) parser is a type of deterministic bottom-up shift-reduce parser using at most k lookahead symbols to determine its next action at any point in the parse. If (k) is omitted, then k is assumed to be 1. Discussion of the technical details of LR(k) parsing is beyond the scope of this manual. The interested reader may consult any of a variety of works covering the subject.

AnaGram produces parsers which employ a variant of LR parsing known as LALR parsing. It is not necessary to understand the details of LR and LALR parsing theory in order to use AnaGram.

Marked Rule
A "marked rule" is a grammar rule together with a "marked token" that indicates how much of the rule has already been matched and what input should be expected, starting with the marked token, if the remainder of the rule is to be matched. Each parser state is defined by a small number of characteristic rules which indicate what matches have already been made and what input can be expected. Note that when a state has more than one characteristic rule, any two characteristic rules with the same number of tokens to the left of the marked token match identically up to the marked token. Even if one has fewer tokens to the left than the other, its tokens match the other exactly up to the marked token.

When marked rules are displayed in AnaGram windows, the marked token is displayed in a distinctive font which the developer can select.

Non-associative
A binary operator, say #, is said to be non-associative if the sequence x # y # z is not permissible. If this sequence is to be interpreted as (x#y)#z, the operator is said to be left associative. If the sequence is to be interpreted as x#(y#z), the operator is said to be right associative. (See associativity.)

Nonterminal Token
A "nonterminal token" is a token which results from reducing the sequence of tokens which match a grammar rule and replacing them with the appropriate reduction token. Nonterminal tokens are to be distinguished from terminal tokens or input tokens.

Null Production
A "null production" is one that has no tokens on the right side whatsoever. Null productions essentially are identified by the first following input token. Null productions are extremely convenient syntactic elements when you wish to make some input optional. For example, suppose that you wish to allow an optional semicolon at some point in your grammar. You could write the following pair of productions:
    optional semicolon -> | ';'
Note that a null production can never follow a '|'. The above could also be written on multiple lines thus:
    optional semicolon
      ->
      -> ';'
You can always rewrite your grammar to eliminate null productions if you wish, but you usually pay a price in conciseness and clarity. Sometimes, however, it is necessary to do such a rewrite in order to avoid conflicts, to which null productions are especially prone.

If you have a null production with no reduction procedure specified, your parser will automatically assign the value zero to the reduction token.

Null productions can be generated by virtual productions.

Parameter Assignment
In any rule in an AnaGram grammar, the semantic value of any rule element may be passed to a reduction procedure by means of a parameter assignment. Simply follow the rule element with a colon and a C variable name. The C variable name can then be used in the reduction procedure to refer to the semantic value of the token it is attached to. AnaGram will automatically provide necessary declarations.

Here are some examples of rule elements with parameter assignments:

  '0-9':d
  integer:n
  expression:x
  declaration : declaration_descriptor

Parser
A parser is a program, or more likely a procedure within a program, which scans a sequence of input characters or input tokens and accumulates them in an input buffer or stack as determined by a set of productions which constitute a grammar. When the parser discovers a sequence of tokens as defined by a grammar rule, or right side of a production, it "reduces" the sequence to a single reduction token as defined by the left side of the grammar rule. This nonterminal token now replaces the tokens which matched the grammar rule and the search for matches continues. If an input token is encountered which will not yield a match for any rule, it is considered a syntax error and some kind of error recovery may be required to continue. If a match, or reduction action, yields the grammar token, sometimes called the goal token or start token, the parser deems its work complete and returns to whatever procedure may have called it.

Tokens may have semantic values. If the input values configuration switch is on, your parser will expect semantic values to be provided by the input process along with the token identification code. If the input values switch is off, your parser will take the ascii value of the input character, that is, the actual input code, as the value of the character. When the parser reduces a production, it can call a reduction procedure or semantic action to analyze the values of the constituent tokens. This reduction procedure can then return a value which characterizes the reduced token.

Parser Control Block
A "Parser Control Block" is a structure which contains all of the data necessary to describe the instantaneous state of an AnaGram parser.

A programmer may may add declarations to the parser control block in an AnaGram grammar by using the extend pcb statement.

Parser Generator
A parser generator, such as AnaGram, is a program that converts a grammar, a rule-based description of the input to a program, into a conventional, procedural module called a parser. The parsers AnaGram generates are simple C modules which can be compiled on almost any platform. AnaGram parsers are also compatible with C++.

Parser State Stack
The parser state stack is a stack maintained by an AnaGram parser and which is an integral part of the parsing process. At any point in the parse of your input stream, the parser state stack provides a summary of what has been found so far. In an AnaGram parser, the parser state stack is stored in the Parser Control Block.

Parser Value Stack
In parallel with the parser state stack, an AnaGram parser maintains a "value stack", each entry of which corresponds to the semantic value of the token identified at that state. Since the semantic values of different tokens might well have different data types, AnaGram provides the opportunity to define the data type for any token. AnaGram then builds a typedef statement creating a data type which is a union of the all the types you have defined. AnaGram uses this data type to define the value stack.

Parsing Engine
A parser consists of three basic components: A set of syntax tables, a set of reduction procedures and a parsing engine. The parsing engine is the body of code that interprets the parsing table, invokes input functions, and calls the reduction procedures. The Build Parser command configures a parsing engine according to the implicit requirements of the syntax specification and according to the explicit values of the configuration parameters.

The parsing engine itself is a simple automaton, characterized by a set of states and a set of inputs. The inputs are the tokens of your grammar. Each state is represented by a list of tokens which are admissible in that state and for each token an action to perform and a parameter which further defines the action. In a traditional LALR parser, there are only four actions: the shift action, the reduce action, the accept action and the error action. AnaGram, in doing its grammar analysis, identifies a number of special cases, and creates a number of extra actions which make for faster processing, but which can be represented as combinations of these primitive actions.

When a shift action is performed, the current state number is pushed onto the parser state stack and the new state number is determined by the current state number and the current lookahead token. Different tokens cause different new states.

When a reduce action is performed, the length of the rule being reduced is subtracted from the parser stack index and the new state number is read from the top of the parser state stack. The reduction token for the rule being reduced is then used as an input token.

Each state in the grammar, with the exception of state zero, has a characteristic token which must have been recognized in order to jump to that state. Therefore, the parser state stack, which is essentially a list of state numbers, can also be thought of as a list of token numbers. This is the list of tokens that have been seen so far in the parse of your input stream. Some of these tokens, of course, may be nonterminal tokens which have resulted from reducing other sequences of tokens.

Partition
If you use character sets in your grammar, AnaGram will compute a "partition" of the character universe. This partition is a collection of non-overlapping character sets such that every one of the sets you have defined can be written as a union of partition sets. Each partition set is identified by a unique reference number called the partition set number.

Each partition set is assigned a unique token. If one of your character sets requires more than one partition set to represent it, AnaGram will create appropriate productions and add them to your grammar so your parser can make the necessary distinctions.

Precedence
"Precedence" is an attribute of binary operators and unary operators which can be used to resolve conflicts. Operator precedence is also referred to as binding. Suppose that # and @ are two binary operators. The question is how to interpret x # y @ z. If # has higher precedence than @, it is interpreted as (x#y)@z. On the other hand if # has lower precedence than @, it would be x#(y@z). If # and @ have the same precedence then the question must be resolved by associativity.

Note that all operators at the same precedence level must have the same associativity.

The situation is somewhat simpler for unary operators. If # and @ were both prefix operators, or both suffix operators, there would be no ambiguity in interpretation, since neither # @ x and x # @ offer more than one possible interpretation. The only difficulty arises when both a prefix and a suffix operator are applied to the same operand.

Suppose that # is a prefix operator and @ is a suffix operator. The question then is how to interpret # x @. If # has higher precedence than @, it would be interpreted as (# x) @. On the other hand, if # has lower precedence than @, it would be interpreted as # (x @). If # and @ have the same precedence then the question must be resolved by associativity.

Production
Productions are the mechanism used to describe how complex input structures are built up out of simpler ones. Each production has a left side and a right side. The right side, or grammar rule, is a sequence of rule elements, which may represent either terminal tokens or nonterminal tokens. The left side is a list of reduction tokens. In most cases there would be only a single reduction token. Productions with more than one token on the left side are called semantically determined productions. The "->" symbol is used to separate the left side from the right side.

Reduce Action
The reduce action is one of the four actions of a traditional parsing engine. The reduce action is performed when the parser has succeeded in matching all the elements of a grammar rule and the next input token is not erroneous. Reducing the grammar rule amounts to subtracting the length of the rule from the parser stack index, identifying the reduction token, stacking its semantic value and then doing a shift action with the reduction token as though it had been input directly.

Reduce-Reduce Conflict
A grammar has a "reduce-reduce" conflict at some state if a single token turns out to be a reducing token for more than one completed rule.

Reducing Token
In a state with more than one completed rule, your parser must be able to determine which one was actually found. AnaGram deals with this problem by looking at all the states the parser will branch to once each rule is reduced. The acceptable input tokens for those states are the "reducing tokens" for the completed rules in the state under investigation. If there is a single token which is a reducing token for more than one rule, then the grammar is said to have a reduce-reduce conflict at that state. If in a particular state there is both a shift action and a reduce action for the same token the grammar is said to have a shift-reduce conflict in that state.

A reducing token is not the same as a reduction token.

Reduction Procedure
A "reduction procedure" is a function you write which your parser executes when it has identified the grammar rule to which the reduction procedure is attached in your grammar. There are two formats for reduction procedures, depending on the size and complexity of the procedure. A reduction procedure is often referred to as a "semantic action".

Reduction Token
A token which appears on the left side of a production is called a reduction token. It is so called because when the grammar rule on the right side of the production is matched in the input stream, your parser will "reduce" the sequence of tokens which matches the rule by replacing the sequence of tokens with the reduction token. Note that, if more than one reduction token is specified, your reduction procedure should choose the exact one. If it does not, your parser will use the leftmost syntactically correct one in the list as the default.

A reduction token is not the same as a reducing token.

Resynchronization
"Resynchronization" is the process of getting your parser back in step with its input after encountering a syntax error. As such, it is one method of error recovery. Of course, you would resynchronize only if it is necessary to continue after the error. There are several options available when using AnaGram. You could use the auto resynch option, which causes AnaGram to incorporate an automatic resynchronizing procedure into your parser, or you could use the error resynch option, which is similar to the technique used by YACC programmers.
Rule Element
A grammar rule is a list of "rule elements", separated by commas. Rule elements may be token names, character sets, keywords, immediate actions, or virtual productions. When AnaGram encounters a rule element for which no token presently exists, it creates one.

Semantic Action
A semantic action is a piece of C or C++ code that is executed when a grammar rule has been identified by a parser. Semantic actions are also called reduction procedures, since they are executed when a grammar rule is reduced to the token on the left side of the production.

Semantic Value
Tokens, whether terminal or nonterminal, may have a semantic value. In the case of terminal tokens, this may be a value assigned by the lexical scanner or, if the parser is using direct character input, it will be the ascii value of the character itself. The values of nonterminal tokens are created by reduction procedures. As a parse progresses, token values are shifted onto the stack, so that in a reduction procedure, the values of the tokens that comprise the grammar rule that is being reduced are available for inspection.

Semantically Determined Production
A "semantically determined production" is one which has more than one reduction token specified on the left side of the production. You would write such a production when the reduction tokens are syntactically indistinguishable. The reduction procedure may then specify which of the listed reduction tokens the grammar rule is to reduce to based on semantic considerations. If there is no reduction procedure, or the reduction procedure does not specify a reduction token, the parser will use the leftmost syntactically correct token.

Set Expression
A set expression is an algebraic expression used to define a character set in terms of individual characters, ranges of characters, and/or other sets of characters as constructed using complements, unions, intersections, and differences.

Shift Action
The shift action is one of the four actions of a traditional parsing engine. The shift action is performed when the input token matches one of the acceptable input tokens for the current parser state. The semantic value of the token and the current state number are stacked, the parser stack index is incremented and the state number is set to a value determined by the previous state and the input token.

Shift-Reduce Conflict
A "shift-reduce" conflict occurs if in some state there is a single token that should be shifted, because it is legitimate input for one of the rules of the state, but should also be used to reduce some other rule because it is a reducing token for that rule.

Syntax Directed Parsing

Syntax directed parsing, or formal parsing, is an approach to building parsers based on formal language theory. Given a suitable description of a language, called a grammar, there are algorithms which can be used to create parsers for the language automatically. In this context, the set of all possible inputs to a program may be considered to constitute a language, and the rules for formulating the input to the program constitute the grammar for the language.

The parsers built from a grammar have the advantage that they can recognize any input that conforms to the rules, and can reject as erroneous any input that fails to conform.

Since the program logic necessary to parse input is often extremely intricate, programs which use formal parsing are usually much more reliable than those built by hand. They are also much easier to maintain, since it is much easier to modify a grammar specification than it is to modify complex program logic.

Terminal Token
A "terminal token" is a token which does not appear on the left side of a production. It represents, therefore, a basic unit of input to your parser. If the input to your parser consists of ascii characters, you may define terminal tokens explicitly as ascii characters or as sets of ascii characters. If you have an input procedure which produces numeric codes, you may define the terminal tokens directly in terms of these numeric codes.

Token
Tokens are the units with which your parser works. There are two kinds of tokens: terminal tokens and nonterminal tokens. These latter are identified by the parser as sequences of tokens. The grouping of tokens into more complex tokens is governed by the grammar rules, or productions in your grammar. In your grammar, tokens may be denoted by token names, by virtual productions, by immediate actions, by explicit character representations, or by expressions which yield character sets.

A "token" should be thought of more or less as being something like a cloakroom token. Imagine you had taken a kindergarten class to the museum and all the kids checked their coats. Each one gets a token. What is the probability of losing one? You take all of the tokens from the children, put them in a paper bag, tie it up, and check it. You get one token in return. We would say that the kids' coats represent the actual input to the system, their individual tokens are the terminal tokens, and your one token which enables you to redeem the paper bag is a nonterminal token. Nonterminal tokens are just single token numbers to represent the result of compacting a sequence of more primitive tokens.

Actually, the word "token" is commonly used to refer to a number of apparently different things. The tokens you use in a grammar are abstract. The tokens identified by a parser are concrete instances of the abstract tokens in the grammar. Furthermore, we have to identify tokens in some manner, so we have token names and token numbers with which to refer to particular tokens. As a result the word "token", in any particular context, may refer directly to either an abstract token, or a concrete instance of an abstract token, or may refer only indirectly to the token by means of a token name or token number.

As an example, consider a C compiler which has a lexical scanner to identify input tokens. According to Kernighan and Ritchie, there are six classes of C tokens: identifiers, keywords, constants, string literals, operators, and other separators. Consider a particular token, a hexadecimal constant, 0XF00. When the lexical scanner hands this over to the parser, it has to provide an identification code which says what kind of token it is, a hexadecimal constant, as well as the "semantic value" of the token, 3840. The identification code is usually an integer value, determined by the designer of the lexical scanner, which by agreement with the designer of the parser uniquely represents a hexadecimal constant. This identification code can be thought of as an external token number.

The grammar for the parser, on the other hand, has a token, "HEXconstant", let us say, to represent the abstract usage of hexadecimal constants in the C language. When AnaGram, or any other parser generator for that matter, analyzes the grammar, it assigns its own reference number to identify "HEXconstant". This reference number can be thought of as an internal token number to contrast it with the external token number provided by the lexical scanner. The parsers AnaGram builds contain a table, ag_ tcv, which provides a translation from the external to the internal token number. Both of these token numbers, of course, differ from the semantic value of the token, in this case, 3840.

Even when an AnaGram parser is using simple character input, the word "token" may represent several different things. Suppose a grammar has a token 'A-Z', which represents the set of upper case letters. There will be an internal token number, assigned by AnaGram, which identifies the (abstract) token in parser tables. The actual (concrete) input token to a parser will, however, be a single letter, say "Q". In this case, the ascii value of the character "Q", 81, is used both as the external token number and as the semantic value of the token.

Unary Operator
A "unary operator" is a token which represents some sort of operation which operates on a single entity to produce a new resultant entity. It is normally represented by a symbol which is written either preceding the operand or following the operand. In the first case it is called a "prefix operator" and in the second case a "suffix operator".

Union
The union of two sets is the set of all elements that are to be found in one or another of the two sets. In an AnaGram syntax file the union of two character sets A and B is represented using the plus sign, as in A + B. The union operator has the same precedence as the difference operator: lower than that of intersection and complement. The union operator is left associative.

Virtual Production

Virtual productions are a special short hand representation of grammar rules which can be used to indicate a choice of inputs. They are an important convenience, especially useful when you are first building a grammar.

AnaGram rewrites virtual productions, so that when you look at the syntax tables in AnaGram, there will be actual productions replacing the virtual productions.

A virtual production appears as one of the rule elements in a grammar rule, i.e. as one of the members of the list on the right side of a production.

The simplest virtual production is the "optional" token. If x is an arbitrary token, x? can be used to indicate an optional x.

Related virtual productions are x... and x?... where the three dots indicate repetition. x... represents an arbitrary number of occurrences of x, but at least one. x?... represents zero or more occurrences of x.

The remaining virtual productions use curly or square brackets to enclose a sequence of rules. The brackets may be followed variously by nothing, a string of three dots, or a slash, to indicate the choices to be made from the rules. Note that rules may be used, not merely tokens.

If r1 through rn are a set of grammar rules, then

   {r1 | r2 | ... | rn}  
is a virtual production that allows a choice of exactly one of the rules. Similarly,
   {r1 | r2 | ... | rn}...  
is a virtual production that allows a choice of one or more of the rules. And, finally,
   {r1 | r2 | ... | rn}/...  
is a virtual production that allows a choice of one or more of the rules subject to the side condition that rules must alternate, that is, that no rule can follow itself immediately without the interposition of some other rule. This is a case that is not particularly easy to write by hand, but is quite useful in a number of contexts.

If the above virtual productions are written with [ ] instead of { }, they all become optional. [ ] is an optional choice, [ ]... is zero or more choices, and [ ]/... is zero or more alternating choices.

Null productions are not permitted in virtual productions in those cases where they would cause an intrinsic ambiguity.

You may use a definition statement to assign a name to a virtual production.



AnaGram parser generator
Copyright ©1993-2002, Parsifal Software.
All Rights Reserved.


Parsifal Software
Links to: Home page | Trial Copy | Syntax Directed Parsing | Glossary