Refactoring the ANSI C grammar

A nice blog article by Eli Bendersky about the distinction between ASTs and CSTs.

After having clarified the concepts Eli tried to make the case for ASTs. He presents the C grammar or more precisely the part of the C grammar that deals with declarations and mentions the obscurity of the grammar that is immediately reflected in the structure of the CST. ASTs serve as some sort of cleanup. This is all true: the C grammar has obscure parts and his ASTs are cleanups but does it justify the extra effort being spent for creating and maintaining ASTs?

A grammar is just another computer program written in (E)BNF and you can strive for clean and accessible BNF programs just like you do in any other programming language. So what about refactoring? Now the ANSI C grammar might not be among the finest parts of the C language but it’s still authoritative and it’s also not easy to show that another grammar describes the same language syntax.

I’ve nevertheless attempted to rewrite the declaration part of the C grammar. There are a few omissions here. For example storage class specifiers and type qualifiers like `const` are missing and there are also no function declarations. It shall be easy though to add them given the syntactical description of the top level `declaration` rule described below. There are also just a few type names for testing purposes. Determining type names is an entirely different issue we do not want to treat in this blog post.

declaration: typename declarator
declarator: simple_declarator | func_ptr_declarator | ptr_declarator | array_declarator
 
simple_declarator: simple_declarator_base | '(' simple_declarator ')'
simple_declarator_base: NAME | '(' NAME ')'
 
ptr_declarator: ptr_declarator_base | '(' ptr_declarator ')'
ptr_declarator_base: ptr_or_ref declarator
 
func_ptr_declarator: func_ptr_declarator_base | '(' func_ptr_declarator ')'
func_ptr_declarator_base:  '(' ( ptr_declarator | array_declarator ) ')' arglist
 
array_declarator: array_declarator_base | '(' array_declarator ')'
array_declarator_base:  ( ptr_declarator | simple_declarator | func_ptr_declarator ) subscript_list      
 
subscript_list: ('[' [NAME | Intnumber] ']')+
ptr_or_ref: '*'+ ['&'] | '&'
arglist: '(' [parameters] ')'
parameters: declaration (',' declaration)*
typename: 'int' | 'float' | 'char' | 'void'

This grammar can be used to accurately parse following expressions:

int x;
int *x;
int (x);
int (*x);
int ((*x));
int ((*x)());
int ((*x)())[];
int ((*x)(int y))[];
int ((*x)(int y, char z[4]))[];
int ((*x)(int y, char z[4]))[][1];
char **foo [][8];
char (**foo [][8]);
char (**foo [][8])();
char ((**foo [][8])());
char *((**foo [][8])());
char (*(**foo [][8])());
char (*(**foo [][8])())[];
char *(*(**foo [][8])())[];

There is still some overhead because we have pairings like `simple_declarator` and `simple_declarator_base`. These distinctions have the only purpose of putting the basic form into parentheses.

A parse tree display for

int (*x)();

looks like this in EasyExtend:

  declaration  -- S`116994 -- 258
    typename  -- S`117008 -- 272
      NAME  -- T`116737 -- 1     L`1
        int
    declarator  -- S`116995 -- 259
      func_ptr_declarator  -- S`117000 -- 264
        func_ptr_declarator_base  -- S`117001 -- 265
          LPAR  -- T`116743 -- 7     L`1
            (
          ptr_declarator  -- S`116998 -- 262
            ptr_declarator_base  -- S`116999 -- 263
              ptr_or_ref  -- S`117005 -- 269
                STAR  -- T`116752 -- 16     L`1
                  *
              declarator  -- S`116995 -- 259
                simple_declarator  -- S`116996 -- 260
                  simple_declarator_base  -- S`116997 -- 261
                    NAME  -- T`116737 -- 1     L`1
                      f
          RPAR  -- T`116744 -- 8     L`1
            )
          arglist  -- S`117006 -- 270
            LPAR  -- T`116743 -- 7     L`1
              (
            RPAR  -- T`116744 -- 8     L`1
              )
  SEMI  -- T`116749 -- 13     L`1
    ;

Still much clutter but now it is easy to identify the semantically relevant entities and search for them. In EasyExtend we use simple functions like `find_node` and `find_all` which keep a CST node and a `node id` parameter and perform a depth-first search.

This entry was posted in C, Grammars. Bookmark the permalink.