IPC/IPCSEM - Interactive Parser Constructor / with Semantic functions

This manual describes IPC/IPCSEM, a family of tools for generating parsers that perform tokenization and pattern-matching on text. The manual includes the following sections:

1.0 Introduction to IPC
1.1 Running IPC
1.2 Grammar File (syntax specification rules)
1.3 Message File
1.4 Parse Table (PT.h)
1.5 Conflict Resolution

2.0 Introduction to Parser
2.1 Running the Parser
2.2 Parser Message File

3.0 Introduction to Semantic Functions
3.1 Examples


1.0 Introduction to IPC

Interactive Parser Constructor is a tool for creating parsers: programs that recognize lexical patterns in text. IPC reads the given input file for a description of a parser to generate. The description is a context-free grammar (CFG) in the form of BNF-style production rules. IPC generates as output a C include file, PT.h, which defines data structures containing grammar-specific information suitable for parsing text which can be generated from that CFG. This include file is compiled with parser.c to produce an executable parser.

IPC offers several features that enhance the computational power of a parser. The interactive capability of IPC allows the user to view and solve ambiguous grammar conflicts before creating a parser, which minimizes error-edit-run cycles. The capacity to add semantic functions to syntax rules extend the parser and allow it to implement a variety of operations, from creating simple parse trees to generating complete target code.


1.1 Running IPC

The IPC program can be operated in either the prompt mode or via command-line arguments. By default, the interactive mode will prompt for the following information:

Once this information has been provided, IPC begins to read the grammar file. If the grammar file does not contain any syntax error or rule conflicts, the message file and parse table file is generated. If syntax errors are encountered, you must correct the grammar file and re-run IPC. Rule conflicts however, can be resolved interactively through question-and-answer prompts.


1.2 Grammar File

The format for IPC grammar files is a version of BNF syntax in the following form:
LHS0 = t0 LHS1 t1 ... tn-1 LHSn tn ; where t0...t1 are terminal tokens and LHS1...LHSn are non-terminals. The following input specifies a simple expression grammar which can be used to parse text like: "a + b * 10 + 20" and "c * 6 + 8 * 2".

Z = E ;
E = E "+" T ;
E = T ;
T = T "*" F ;
T = F ;
F = "#id" ;
F = "#int" ;

By default, IPC recognizes four built-in terminals; "#id" matches identifiers (variables), "#int", "#real" and "#string". Here is another simple example that extends the first example by allowing assignment statements and parentheses.

Z = A ;
A = "#id" "=" E ;
E = E "+" T ;
E = T ;
T = T "*" F ;
T = F ;
F = "#id" ;
F = "#int" ;
F = "(" E ")" ;

Since IPC is a bottom-up parser, rules with LHS non-terminals "lower down" are reduced first, followed by rules with LHS non-terminals "higher up." In the context of the previous grammar, the expression "a + b * 10 + 20" would be parsed correctly with "b * 10" having higher precedence. Hence E = "(" E ")" has higher precedence (as it should) than any other grammar rule. A somewhat more complicated example:

    DeclList     = DeclList ";" Decl ;                          SemFunc1
    DeclList     = Decl ;                                       SemFunc2
    Decl         = IdList ":" Type ;                            SemFunc3
    IdList       = IdList "," "#id" ;                           SemFunc4
    IdList       = "#id" ;	                                SemFunc5
    Type         = ScalarType ;                                 SemFunc6
    Type         = "array" "(" ScalarTypeList ")" "of" Type ;   SemFunc7
    ScalarType   = "#id" ;                                      SemFunc8
    ScalarType   = Bound ".." Bound ;                           SemFunc9
    Bound        = Sign IntLiteral ;                            SemFunc10
    Bound        = "#id" ;                                      SemFunc11
    ...
    

This is the beginning of a grammar for declaration statement of Pascal like language. The last items on each line, SemFunc1...SemFunc11, are semantic functions that will be called when the rule to the left is being used in a reduction.


1.3 Message File

The IPC message file contains three major sections. The first section shows the FIRST, FOLLOW and SET information for each rule. The second section, PARSE TABLE is the text (readable) version of the data structures in PT.h and has the following format: LHS, Action, Rule. The last section contains IPC’s internal data structure statistics showing number of productions, alternative rule sets, number of rows and columns in the parse table, etc.

    GRAMMAR FILE "grammar":
      No Format Errors
 
    FIRST ( Z ):  { #id #int }
    FOLLOW( Z ):  { $ }
    FIRST ( E ):  { #int #id }
    FOLLOW( E ):  { + $ }
    FIRST ( T ):  { #id #int }
    FOLLOW( T ):  { * $ + }
    FIRST ( F ):  { #int #id }
    FOLLOW( F ):  { + $ * }
     
    SET   0:
      ITEMS
          -> * , { $ }
          -> *  "+" , { + $ }
          -> * , { + $ }
          -> *  "*" , { * + $ }
          -> * , { * + $ }
          -> * "#id", { * + $ }
          -> * "#int", { * + $ }
      TRANSITIONS
         on "#int" goto Set 5
         on "#id" goto Set 4
         on  goto Set 3
         on  goto Set 2
         on  goto Set 1
     
    SET   1:
      ITEMS
          ->  *, { $ }
          ->  * "+" , { $ + }
      TRANSITIONS
         on "+" goto Set 6
     
    SET   2:
      ITEMS
          ->  *, { $ + }
          ->  * "*" , { $ + * }
      TRANSITIONS
         on "*" goto Set 7
     ...
    SET   9:
      ITEMS
          ->  "*"  *, { $ + * }
     
                                PARSE TABLE
    State   0:
         "#id", S, 4
         "#int", S, 5
         , G, 1
         , G, 3
         , G, 2
     
    State   1:
         "$", A
         "+", S, 6
     
    State   2:
         "$", R, 2
         "*", S, 7
         "+", R, 2
     ...
    State   9:
         "$", R, 3
         "*", R, 3
         "+", R, 3
     
    LR(1) Data Structure Statistics for the Grammar "grammar":
    Statistic                                  Corresponding Constant
    -----------------------------------------------------------------
    Grammar Data Structures:
      Number of productions                4   MAX_PROD
      Number of alternates                 7   MAX_ALT (MAX_RULE)
      Number of elements                  11   MAX_ELEM
      Length of the grammar name space    24   MAX_SPACE
    Item Sets Data Structures:
      Number of item sets                 10   MAX_SET (MAX_ROW)
      Total number of kernel items        13   MAX_ITEM
      Number of items in largest set       7   MAX_TEMP
    Parse Table Data Structures:
      Number of parse table rows          10   MAX_ROW
      Number of parse table columns       11   MAX_COLUMN
    


1.4 Parse Table

This C "include" file, called "PT.h" by default, contains all the necessary data for generation of a parser capable of analyzing text of the given grammar. Here is the content of PT.h for the first example grammar:

    typedef struct pt_entry

{ int Action, Data; } pt_node; static pt_node PT[10][10] = { {{69,-1},{83, 4},{83, 5},{69,-1},{69,-1},{69,-1},{71, 1},{71, 3},{71, 2},{69,-1}}, {{69,-1},{69,-1},{69,-1},{65, 0},{69,-1},{83, 6},{69,-1},{69,-1},{69,-1},{69,-1}}, {{69,-1},{69,-1},{69,-1},{82, 2},{83, 7},{82, 2},{69,-1},{69,-1},{69,-1},{69,-1}}, {{69,-1},{69,-1},{69,-1},{82, 4},{82, 4},{82, 4},{69,-1},{69,-1},{69,-1},{69,-1}}, {{69,-1},{69,-1},{69,-1},{82, 5},{82, 5},{82, 5},{69,-1},{69,-1},{69,-1},{69,-1}}, {{69,-1},{69,-1},{69,-1},{82, 6},{82, 6},{82, 6},{69,-1},{69,-1},{69,-1},{69,-1}}, {{69,-1},{83, 4},{83, 5},{69,-1},{69,-1},{69,-1},{69,-1},{71, 3},{71, 8},{69,-1}}, {{69,-1},{83, 4},{83, 5},{69,-1},{69,-1},{69,-1},{69,-1},{71, 9},{69,-1},{69,-1}}, {{69,-1},{69,-1},{69,-1},{82, 1},{83, 7},{82, 1},{69,-1},{69,-1},{69,-1},{69,-1}}, {{69,-1},{69,-1},{69,-1},{82, 3},{82, 3},{82, 3},{69,-1},{69,-1},{69,-1},{69,-1}}}; typedef struct Rule_Form { int Name, Length; } R_NODE; static R_NODE RULES[7] = { { 9, 1},{ 6, 3},{ 6, 1},{ 8, 3},{ 8, 1},{ 7, 1},{ 7, 1}}; static int G_COUNT = 10; static char *G_LIST[] = { "", "#id", "#int", "$", "*", "+", "E", "F", "T", "Z" }; static int END_FILE = 3;

In addition to the PT array, the RULE array contains all the grammar rules in the order in which they were specified in the grammar file. R_NODE.Name is an index into the G_LIST array, which has the entire Terminal/Non-Terminal vocabulary, sorted alphabetically while R_NODE.Length is the number RHS elements for that rule.


1.5 Conflict Resolution

When the IPC program detects grammar ambiguity -- a grammar is ambiguous if more than one derivation exists for some input string -- it will interactively prompt for clarification to resolve the conflict(s). Here is a blatant example of an ambiguous grammar:

Z = A ;
A = "#id" "=" E ;
E = E "+" E ;
E = E "*" E ;
E = "(" E ")" ;
E = "#id" ;
E = "#int" ;

Using the following dialog, the IPC resolves the ambiguity and generates a correct parser. Comments are listed to the right for clarification of user actions. Also, note the * (called dot) in each rule; this is a good indicator of which rule you must chose.

    Parse Table Conflicts, Set 14:

    1 = Resolve Conflicts
    2 = Print Conflicts
    3 = Print Derivation Trace
    Enter function number (1/2/3):  1

    Set 14, S/R Conflict, Input_Symbol "+" -- Items that conflict:

    1.   ->  "+"  *, { $ + * }
    2.   ->  * "+" , { $ + * }
    Solve by which item?  Enter item number:  1	    // Reduce. + has the same precedence

    Set 14, S/R Conflict, Input_Symbol "*" -- Items that conflict:

    1.   ->  "+"  *, { $ + * }
    2.   ->  * "*" , { $ + * }
    Solve by which item?  Enter item number:  2	    // Shift. * has greater precedence

    Parse Table Conflicts, Set 15:

    1 = Resolve Conflicts
    2 = Print Conflicts
    3 = Print Derivation Trace
    Enter function number (1/2/3):  1

    Set 15, S/R Conflict, Input_Symbol "+" -- Items that conflict:

    1.   ->  "*"  *, { $ + * }
    2.   ->  * "+" , { $ + * }
    Solve by which item?  Enter item number:  1	    // Reduce. + has less precedence

    Set 15, S/R Conflict, Input_Symbol "*" -- Items that conflict:

    1.   ->  "*"  *, { $ + * }
    2.   ->  * "*" , { $ + * }
    Solve by which item?  Enter item number:  1     // Reduce. * has the same precedence

    Parse Table Conflicts, Set 20:

    1 = Resolve Conflicts
    2 = Print Conflicts
    3 = Print Derivation Trace
    Enter function number (1/2/3):  1

    Set 20, S/R Conflict, Input_Symbol "+" -- Items that conflict:

    1.   ->  "+"  *, { ) + * }
    2.   ->  * "+" , { ) + * }
    Solve by which item?  Enter item number:  1	    // Reduce.  + has the same precedence

    Set 20, S/R Conflict, Input_Symbol "*" -- Items that conflict:

    1.   ->  "+"  *, { ) + * }
    2.   ->  * "*" , { ) + * }
    Solve by which item?  Enter item number:  2	    // Shift. * has greater precedence

    Parse Table Conflicts, Set 21:

    1 = Resolve Conflicts
    2 = Print Conflicts
    3 = Print Derivation Trace
    Enter function number (1/2/3):  1

    Set 21, S/R Conflict, Input_Symbol "+" -- Items that conflict:

    1.   ->  "*"  *, { ) + * }
    2.   ->  * "+" , { ) + * }
    Solve by which item?  Enter item number:  1     // Reduce. + has less precedence

    Set 21, S/R Conflict, Input_Symbol "*" -- Items that conflict:

    1.   ->  "*"  *, { ) + * }
    2.   ->  * "*" , { ) + * }
    Solve by which item?  Enter item number:  1	    // Reduce. * has the same precedence

    Program over, messages in ipc_message output in PT.h.
    


2.0 Introduction to Parser

The actual parsing of a language (an expression, strings, program source code, etc.) is the responsibility of a program called parser. To build a parser, you must compile and link parser.c. This maybe accomplished using the following UNIX command.

$ cc parser.c –o parser

By default, parser.c uses the parse table file PT.h that was generated by IPC. If for some reason you specified a different filename for the parse table file, you have to modify parser.c and change the #include "PT.h" to reflect your parse table include file.


2.1 Running the Parser

The parser that you have created can operate either interactively or by using command-line arguments. The interactive mode offers the following prompts.

If the parser is successful in parsing the source text file, it will print "Text syntactically correct" and then exit to the command prompt.


2.2 Parser Message File

The verbose version of the parser’s message file contains all the information on how your source file was parsed. Using the grammar

Z = E ;
E = E "+" T ;
E = T ;
T = T "*" F ;
T = F ;
F = "#id" ;
F = "#int" ;

parsing the expression "a + b * 10 + 20" results in this message file:

    Parsing text "expression":

    STACK CONTENTS                                       .....   NEXT LEXEME
    -----------------------------------------------------------------------
    0                                                    .....   "a"
    0 "a" 4                                              .....   "+"
    0 <F> 3                                              .....   "+" 
    0 <T> 2                                              .....   "+" 
    0 <E> 1                                              .....   "+" 
    0 <E> 1   "+" 6                                      .....   "b" 
    0 <E> 1   "+" 6   "b" 4                              .....   "*" 
    0 <E> 1   "+" 6   <F> 3                              .....   "*" 
    0 <E> 1   "+" 6   <T> 8                              .....   "*"
    0 <E> 1   "+" 6   <T> 8   "*" 7                      .....   "10" 
    0 <E> 1   "+" 6   <T> 8   "*" 7   "10" 5             .....   "+" 
    0 <E> 1   "+" 6   <T> 8   "*" 7   <F> 9              .....   "+" 
    0 <E> 1   "+" 6   <T> 8                              .....   "+" 
    0 <E> 1                                              .....   "+" 
    0 <E> 1   "+" 6                                      .....   "20" 
    0 <E> 1   "+" 6   "20" 5                             .....   "EOF" 
    0 <E> 1   "+" 6   <F> 3                              .....   "EOF" 
    0 <E> 1   "+" 6   <T> 8                              .....   "EOF" 
    0 <E> 1                                              .....   "EOF" 
    0 <E> 1                                              .....   "EOF"
    

Each message line shows you the contents of the stack and the next lexeme (token) to be parsed. As each lexeme is processed, the stack contents forms a pattern that parser can recognize and reduce via a rule whose RHS matches the pattern on the stack. Each lexeme or non-terminal is followed by a SET number (or in case of "#id", "#int"…, the rule #) that was used for its derivation. All the SETS are listed in the IPC message file.


3.0 Introduction to Semantic Functions

By default, the parser that you generate can only inform you about the syntactical correctness of the source language. To generate target code, this is not sufficient. For example, a CFG does not allow type checking of variables. Hence, there is motivation to extending the parser by writing semantic functions.

As was indicated earlier, semantic functions are C functions that the parser calls before it reduces the stack via a production rule. Each rule in your grammar file can be associated with a different semantic function. The IPC will include the name of these functions in the parse table that it creates for use by the parser. The functions themselves must be compiled and linked with the parser.

Since the semantic functions become part of the parser, they can access all the information and structures that the parser has at its disposal. This combination can create a powerful tool that is capable of performing any compile-related task, from a simple expression re-writing to a full target-language code generation.


3.1 Examples

The first step in creating a semantic function is to augment the grammar file. Here is the simple expression grammar with some semantic functions:

    Z = E ;         function0
    E = E "+" T ;   function1
    E = T ;         function2
    T = T "*" F ;   function3
    T = F ;         function4
    F = "#id" ;     function5
    F = "#int" ;    function6
    

Next, we must implement these functions to perform a task. For our example, we will write these functions to create a post-fix representation of an expression input in in-fix form.

    int function6(STACK *L1) 	// parsers stack is passed as the only argument
    {
    	int top;
    
    	top = L1->top;		// array index to the top of the stack
    	//
    	// now that we know where the top is, we need to look at the rule.
    	// rule F = "#int" has 1 item on it's RHS. So that is the only symbol
    	// that we have to print.
    	// 
    	printf("%s ", L1->Data[top].Symbol);
    	
    	return (0);		// some compilers will complain if we don't do this
    }
    
    int function5(STACK *L1)
    {
    	//
    	// for our example, this function will perform the same task as
    	// funtion6, so we can just call that one.
    	//
    	return function6(L1);
    }
    
    int function4(STACK *L1)
    {
    	// we don't need this function to do anything and could have excluded it
    	// from our grammar.
    	//
    	Return (0);
    }
    
    int function3(STACK *L1)
    {
    	int top;
    
    	top = L1->top;
    	//
    	// looking at the rule, T = T "*" F, we can see that the operator is in top-1
    	// since the operator is the symbol that we want, we have to decrement the
    	// stack top (our local variable) to get it.
    	//
    	top = top - 1;
    	printf("%s ", L1->Data[top].Symbol);
    	
    	return(0);
    }
    
    int function2(STACK *L1)
    {
    	// don't need this function either.
    	//
    	return(0);
    }
    
    int function1(STACK *L1)
    {
    	//
    	// this function is identical to function3. The rule E = E "+" T has the
    	// RHS order and number as function3.
    	//
    	return function3(L1);
    }
    
    int function0(STACK *L1)
    {
    	//
    	// this function will output a single return to end our print string
    	//
    	printf("\n");
    
    	return(0);
    }
    

Running the resulting parser with the input "a + b * 10 + 20" will output "a b 10 * + 20 +".

The semantic functions that you write need not be the only functions in your C file. You may add as many support functions as you need. For our next and final example, we implement an abstract syntax tree (AST) using the same grammar, by making calls to support routines not shown (e.g., makeTree() ).

    int function6(STACK *L1)
    {
    	int  top;
    	ITEM item;
    
    	top = L1->top;
    	item = createItem(L1->Data[top].Symbol);
    	pushStack(item);
    	
    	return (0);
    }
    
    int function5(STACK *L1)
    {
    	return function6(L1);
    }
    
    int function4(STACK *L1)
    {
    	Return (0);
    }
    
    int function3(STACK *L1)
    {
    	int  top;
    	ITEM item1, item2, node;
    	char binOp;
    
    	top   = (L1->top) - 1;
    	binOp = *(L1->Data[top].Symbol);	// this is OK since we only have * and -
    	item2 = popStack();
    	item1 = popStack();
    	node  = makeTree(item1, item2, binOp);
    
    	pushStack(node);
    	
    	return(0);
    }
    
    int function2(STACK *L1)
    {
    	return(0);
    }
    
    int function1(STACK *L1)
    {
    	return function3(L1);
    }
    
    int function0(STACK *L1)
    {
    	ITEM rootNode;
    
    	rootNode = popStack();
    	printTree(rootNode);
    
    	return(0);
    }
    

As should be clear by now, semantic functions can significantly extend the parser. You can use them to create data structures as necessary to implement intermediate forms (ASTs, DAGs, three-address code, etc.), or even to implement a code generator which takes the intermediate form as input & generates from it target code. Hence, semantic functions provide sufficient flexibility to allow the parser to metamorphose into whatever compile-related tool is desired, even into a complete compiler.