AnaGram Parser Generator: Summary of Notation

Home

Trial Copy

Intro. to Parsing

Users Say...

Special Features

Notation Summary

New 2.01 Features

File Trace

Grammar Trace

Glossary

Examples

Expression evaluator (freeware)

XIDEK interpreter kit (freeware)

Lex/Yacc Comparison

If-else ambiguity

Contact Parsifal

AnaGram Parser Generator:
Summary of Notation

Introduction
Lexical Conventions
Names
Character Representations
Character Ranges
Character Sets
Keywords
Tokens
Productions
Reduction Procedures
Definitions
Configuration Section
Embedded C

Introduction

The rules for using AnaGram are given in Chapters 8 and 9 of the AnaGram User's Guide. This page contains a brief summary. AnaGram's online Help also has entries to explain the various terms. An example, a simple four-function calculator, with explanatory annotations can be viewed here. The annotated example and its syntax file are included with the AnaGram trial copy download.

Lexical Conventions

AnaGram allows the free use of spaces, tabs and comments. Both C style and C++ style comments are allowed. Blank lines are allowed, but only between statements.

AnaGram statements may continue onto following lines as long as they are clearly incomplete. Normally this rule is satisfied by dangling punctuation or open parentheses, brackets, or braces. In no case can a statement continue over a blank line.

Names

Symbol names must begin with a letter or underscore, and may contain letters, digits, or underscores. They may also contain embedded spaces, tabs, and comments. Any sequence of embedded space, however, is replaced by a single blank character.

The names eof, error, and grammar have special meanings.

Character Representations

You may represent a character using the same rules as for character constants in C. You may also use signed integers, using either decimal, octal or hexadecimal formats, again following the rules for C. You may specify control characters using ^, e.g., ^C.

Character Ranges

Character ranges may be specified either in the form 'a-z' or with two simple characters separated by "..", e.g, 32..255.

Character Sets

Use the following operators for more complex character sets:

	+	union
	-	difference
	&	intersection
	~	complement

AnaGram interprets a single character to mean the set that contains only the character itself.

Keywords

A character string enclosed in double quotes, such as "while" or "/*", is a keyword. See example. The rules for writing keyword strings are the same as for literal strings in C. AnaGram parsers have special lookahead logic to recognize keywords, so that keywords get special treatment. They are not equivalent to the corresponding sequence of single characters.

Tokens

The units of a grammar are called tokens. Terminal tokens may be character sets, keywords, immediate actions, or virtual productions. Nonterminal tokens are defined in terms of other tokens by means of productions.

Productions

A production consists of one or more token names on the left, an arrow, and a grammar rule on the right, with rule elements separated by commas. Here is a simple example:

   dinner -> appetizer, salad, main course, dessert

The order of the elements in a rule is significant, but productions themselves may appear in any order in a syntax file. A production with more than one name on the left is called a semantically determined production.

Additional productions with the same left side may be joined by using | or another arrow. The arrow, if used, must start a new line:

   variable name -> letter | variable name, digit

is equivalent to

   variable name
     -> letter
     -> variable name, digit

If the token on the left side of a production is called grammar or is tagged with a following dollar sign, it is taken to be the grammar token, or goal token for the grammar.

The names on the left side of a production may be preceded by a type cast indicating the data type of the semantic value of the named tokens.

A grammar rule is a sequence of rule elements joined by commas. The rule elements may be character sets, keywords, token names, virtual productions, or immediate actions.

A virtual production is a token name or character set expression followed by ? or ?..., or a sequence of one or more rules, joined by |, inside brackets or braces and optionally followed by an ellipsis (...). The ? indicates an optional token. Braces indicate a choice among the listed rules. Brackets indicate an optional choice. The ellipsis represents unlimited repetition.

Reduction Procedures

A reduction procedure is a piece of C or C++ code following a grammar rule that is to be executed when the rule is recognized in the parser's input stream. Reduction procedures may be short form: a single expression followed by a semicolon, or long form: a block of code enclosed in braces. In either case they are preceded by an equal sign. Short form procedures may not continue onto another line.

Reduction procedures may access the semantic values of tokens in the grammar rule to which they are attached. To each token whose value is needed append a colon and the variable name used for the token value in the reduction procedure. In a short form reduction procedure, the value of the expression is assigned to the reduction token, the token on the left side of the production. In a long form procedure, use the return statement to assign a value to the token on the left side of the production.

An immediate action differs from a reduction procedure in that it may occur in the middle of a grammar rule. To distinguish it from a reduction procedure, it begins with an exclamation point rather than an equal sign.

Definitions

You may assign names to frequently used character sets, virtual productions, keywords, or immediate actions by using a definition statement consisting of a name, an equal sign and the entity to be named. For example:

   digit   = '0-9'

Configuration Section

A configuration section is a block of special statements enclosed in square brackets. These are either attribute statements or assign values to configuration parameters or switches, all of which are described in on-line help windows.

Embedded C

You may include C or C++ code to support your reduction procedures at any point in your grammar by enclosing it in braces. The beginning brace must be on a fresh line, and no other statement may follow on the same line as the terminating brace. A block of embedded C at the very beginning of a syntax file is called the C prologue.

Links to: Home page | Trial Copy | Syntax Directed Parsing | Glossary