AnaGram

Return to Parsifal Software Home Page ----------------------

Four Function Calculator:
An AnaGram Example

----------------------

The following example is a complete program: The output produced by AnaGram from this example can be compiled, linked and run without any support modules other than the standard run-time library provided by any C compiler. In the interest of brevity, the example has been kept very simple.

FFCALC.SYN implements a simple four function calculator which reads its input from stdin. The calculator has 52 registers, labeled 'a' through 'z' and 'A' through 'Z'. FFCALC evaluates arithmetic expressions and assignment statements and prints the results to stdout. The expressions may contain '+', '-', '*', and '/' operators as well as parentheses. In addition, FFCALC supports the free use of white space and C style comments in the input. It also contains complete error handling, including syntax error diagnostics and resynchronization after syntax errors.

For purposes of annotation, line numbers have been inserted at the left margin. The line numbers are not part of the AnaGram syntax. Immediately following the example are some brief explanatory notes keyed to the line numbers.

Line  1:  {/*   FOUR FUNCTION CALCULATOR: FFCALC.SYN           */}

Line  2:  // -- CONFIGURATION SECTION ----------------------------
Line  3:  [
Line  4:    default token type = double
Line  5:    disregard white space
Line  6:    lexeme { real}
Line  7:    traditional engine
Line  8:  ]

Line  9:  // -- FOUR FUNCTION CALCULATOR -------------------------
Line 10:  (void) calculator $
Line 11:   -> [calculation?, '\n']..., eof

Line 12:  (void) calculation
Line 13:   -> expression:x                      =printf("%g\n",x);
Line 14:   -> name:n, '=', expression:x                         ={
Line 15:                    printf("%c = %g\n",n+'A',value[n]=x);}
Line 16:   -> error

Line 17:  expression
Line 18:   -> term
Line 19:   -> expression:x, '+', term:t                     = x+t;
Line 20:   -> expression:x, '-', term:t                     = x-t;

Line 21:  term
Line 22:   -> factor
Line 23:   -> term:t, '*', factor:f                         = t*f;
Line 24:   -> term:t, '/', factor:f                         = t/f;

Line 25:  factor
Line 26:   -> name:n                                   = value[n];
Line 27:   -> real
Line 28:   -> '(', expression:x, ')'                          = x;
Line 29:   -> '-', factor:f                                  = -f;

Line 30:  // -- LEXICAL UNITS ------------------------------------
Line 31:  digit   = '0-9'
Line 32:  eof     = -1

Line 33:  (void) white space
Line 34:   -> ' ' + '\t' + '\r' + '\f' + '\v'
Line 35:   -> "/*", ~eof?..., "*/"              // C style comment

Line 36:  (int) name
Line 37:   -> 'a-z' + 'A-Z':c                             = c-'A';

Line 38:  real
Line 39:   -> integer part:i, '.', fraction part:f          = i+f;
Line 40:   -> integer part, '.'?
Line 41:   -> '.', fraction part:f                            = f;

Line 42:  integer part
Line 43:   -> digit:d                                     = d-'0';
Line 44:   -> integer part:x, digit:d              = 10*x + d-'0';

Line 45:  fraction part
Line 46:   -> digit:d                               = (d-'0')/10.;
Line 47:   -> digit:d, fraction part:f          = (d-'0' + f)/10.;

Line 48:  { /* -- EMBEDDED C ---------------------------------- */
Line 49:    double value[64];                      /* registers */
Line 50:    void main(void) {
Line 51:      ffcalc();
Line 52:    }
Line 53:  } // -- END OF EMBEDDED C ------------------------------

Notes to example

General note: When an AnaGram grammar is written to use direct character input, the terminal tokens are written as character sets. A single character is construed to be the set consisting only of the character itself. Otherwise character sets can be defined by ranges, e.g., 'a-z', or by set expressions using +, -, &, or ~ to represent union, difference, intersection, or complement respectively. If the sets used in the grammar are not pairwise disjoint, and they seldom are, AnaGram calculates a disjoint covering of the character universe, and extends the grammar appropriately. The semantic value of a terminal token is the ascii character code, so that semantic distinctions may still be made even when characters are syntactically equivalent.

Line 1. Braces { } are used to denote embedded C or C++ code that should be passed unchanged to the parser. Embedded C at the very beginning of the syntax file is placed at the beginning of the parser file. All other embedded C is placed following a set of definitions and declarations AnaGram needs for the code it generates. AnaGram saves up all the reduction procedures, or semantic actions, and places them after all the embedded C.

Line 3. Brackets [ ] are used to denote configuration sections. Configuration sections contain settings for configuration parameters and switches, and a number of attribute statements that provide metasyntactic information.

Line 4. This statement sets the default token type for nonterminal tokens to double. The default value for "default token type" is void. You can override the type for a particular token using an explicit cast. See line 10. The default type for terminal tokens is int. AnaGram uses the token type declarations to set up calls and definitions of reduction procedures and also to set up the parser value stack.

Line 5. The disregard statement tells AnaGram to extend the grammar so that the generated parser will skip all instances of white space which are not contained within lexemes. "White space" is a token defined at line 33. There is nothing magic about the name. Any other name could have been used.

Line 6. The lexeme statement identifies a list of nonterminal tokens within which the "disregard" statement is inoperative. real is defined at line 38.

Line 7. "traditional engine" is a configuration switch. Simply asserting it turns it on. You can also write: traditional engine = ON. To turn off a switch use ~: thus ~traditional engine would guarantee the switch is off, whatever its default value. Alternatively set traditional engine = OFF.

AnaGram parsers normally use a parsing engine with more than the standard four parsing actions: shift, reduce, error and accept. The extra actions are compound actions. The result of using these actions is to speed up the parser and to reduce the size of the state table by about fifty per cent. The traditional engine switch turns this optimization off, so the parser will only use the four traditional actions. This is usually only done for clarity when using the File Trace or Grammar Trace options described below.

Line 9. AnaGram supports both C and C++ style comments. Nesting of C comments is controlled by the "nest comments" switch.

Line 10. An explicit cast can be used to override the token type for nonterminal tokens. Types can be just about any C or C++ type, including template types. Basically, the only exceptions are types containing () or [].

The simplest way to specify the goal token for a grammar is to mark it with a dollar sign. You can also simply name it "grammar", or set the "grammar token" parameter in a configuration section.

Line 11. For "->" read "produces". '?' following a token name makes it optional. Tokens in a rule are separated by commas. Multiple rules with the same left side can also be separated with '|'.

The rules for character constants are the same as for C. Brackets [] indicate the rule is optional. Braces { } would be used if the rule were not optional. Brackets and braces can include multiple rules separated by |. The ellipsis ... indicates unlimited repetition. These constructs are referred to as "virtual productions". eof is defined at line 32.

Lines 10 and 11 taken together specify that this grammar describes a possibly empty sequence of lines terminated with an eof character. Each line contains an optional "calculation" followed by a newline character.

Line 13. To assign the value of a token (stored on the parser value stack) to a c variable for use in a semantic action, or reduction procedure, simply follow the token name with a colon and the name of the variable.

Short form reduction procedures are simple C or C++ expressions terminated with a semicolon. They cannot include a newline character. The name of the C variable is local to this particular procedure. Normally the value of the reduction procedure is assigned to the token on the left side of the production. In this case, since calculation is of type "void", the result of the printf call is discarded.

Line 14. When reduction procedures won't fit on a single line or are more complex than a single expression, they can be enclosed in braces { }. Use a return statement to return a value.

Line 16. The error token can be used to resynchronize a parser after encountering a syntax error. It works more or less like the error token in YACC. In this case it matches any portion of a "calculation" up to a syntax error and then everything up to the next newline, as determined by the production on line 11. AnaGram also provides an alternative form of error continuation called "automatic resynchronization" which uses a heuristic approach derived from the grammar. By default, AnaGram parsers provide syntax error diagnostics. The user may provide his own if he wishes.

Line 18. If a grammar rule does not have a reduction procedure, the value of the first token in the rule is assigned to the token on the left side of the production.

Line 19. Since the default type specification given on line 4 was "double", x and t have type double, and the reduction procedure returns their sum, also double.

Line 24. Note that in the interest of simplicity, this reduction procedure omits any provision for divide by zero errors.

Line 31. Definition statements may be used to provide shorthand names. '0-9' is a character range, as discussed above.

Line 32. Input characters can also be defined using decimal, octal or hex notation. They are not limited to any particular range, so that it is possible to define the end of file token as the standard stream I/O end of file value.

Line 33. Note that AnaGram permits embedded blanks in token names.

Line 34. The set consisting of blank, tab, return, form feed or vertical tab.

Line 35. Keywords are strings of characters enclosed in double quotes. Standard C rules apply for literal strings. Keywords stand outside the character space and are recognized in preference to individual characters.

~ indicates the complement of a character set, so that ~eof is any character except end of file. The character universe is the set of characters on the range 0..255 unless there are characters outside this range, in which case it is extended to the smallest contiguous range which includes the outside characters. ?... allows zero or more comment characters. This rule describes a standard C comment (no nesting allowed).

Line 36. The value of name is an int, an index into the value table.

Line 37. The '+' is set union. Therefore c is any alphabetic character.

Line 50. If you don't have any embedded C in your syntax file, AnaGram will create a main program automatically. Since there was already embedded C at line 1, AnaGram won't automatically create a main program, so we need to define one explicitly.

Line 51. The default function name for the parser is taken from the file name, in lower case. There is a configuration parameter available to set it to something else if necessary. Lacking any contrary specification, the parser will read its input from stdin.


----------------------

Parsifal Software
Link to: Home page | Trial Copy | Syntax Directed Parsing | Glossary