A nice blog article by Eli Bendersky about the distinction between ASTs and CSTs.
After having clarified the concepts Eli tried to make the case for ASTs. He presents the C grammar or more precisely the part of the C grammar that deals with declarations and mentions the obscurity of the grammar that is immediately reflected in the structure of the CST. ASTs serve as some sort of cleanup. This is all true: the C grammar has obscure parts and his ASTs are cleanups but does it justify the extra effort being spent for creating and maintaining ASTs?
A grammar is just another computer program written in (E)BNF and you can strive for clean and accessible BNF programs just like you do in any other programming language. So what about refactoring? Now the ANSI C grammar might not be among the finest parts of the C language but it’s still authoritative and it’s also not easy to show that another grammar describes the same language syntax.
I’ve nevertheless attempted to rewrite the declaration part of the C grammar. There are a few omissions here. For example storage class specifiers and type qualifiers like `const` are missing and there are also no function declarations. It shall be easy though to add them given the syntactical description of the top level `declaration` rule described below. There are also just a few type names for testing purposes. Determining type names is an entirely different issue we do not want to treat in this blog post.
declaration: typename declarator declarator: simple_declarator | func_ptr_declarator | ptr_declarator | array_declarator simple_declarator: simple_declarator_base | '(' simple_declarator ')' simple_declarator_base: NAME | '(' NAME ')' ptr_declarator: ptr_declarator_base | '(' ptr_declarator ')' ptr_declarator_base: ptr_or_ref declarator func_ptr_declarator: func_ptr_declarator_base | '(' func_ptr_declarator ')' func_ptr_declarator_base: '(' ( ptr_declarator | array_declarator ) ')' arglist array_declarator: array_declarator_base | '(' array_declarator ')' array_declarator_base: ( ptr_declarator | simple_declarator | func_ptr_declarator ) subscript_list subscript_list: ('[' [NAME | Intnumber] ']')+ ptr_or_ref: '*'+ ['&'] | '&' arglist: '(' [parameters] ')' parameters: declaration (',' declaration)* typename: 'int' | 'float' | 'char' | 'void' |
This grammar can be used to accurately parse following expressions:
int x; int *x; int (x); int (*x); int ((*x)); int ((*x)()); int ((*x)())[]; int ((*x)(int y))[]; int ((*x)(int y, char z[4]))[]; int ((*x)(int y, char z[4]))[][1]; char **foo [][8]; char (**foo [][8]); char (**foo [][8])(); char ((**foo [][8])()); char *((**foo [][8])()); char (*(**foo [][8])()); char (*(**foo [][8])())[]; char *(*(**foo [][8])())[]; |
There is still some overhead because we have pairings like `simple_declarator` and `simple_declarator_base`. These distinctions have the only purpose of putting the basic form into parentheses.
A parse tree display for
int (*x)(); |
looks like this in EasyExtend:
declaration -- S`116994 -- 258 typename -- S`117008 -- 272 NAME -- T`116737 -- 1 L`1 int declarator -- S`116995 -- 259 func_ptr_declarator -- S`117000 -- 264 func_ptr_declarator_base -- S`117001 -- 265 LPAR -- T`116743 -- 7 L`1 ( ptr_declarator -- S`116998 -- 262 ptr_declarator_base -- S`116999 -- 263 ptr_or_ref -- S`117005 -- 269 STAR -- T`116752 -- 16 L`1 * declarator -- S`116995 -- 259 simple_declarator -- S`116996 -- 260 simple_declarator_base -- S`116997 -- 261 NAME -- T`116737 -- 1 L`1 f RPAR -- T`116744 -- 8 L`1 ) arglist -- S`117006 -- 270 LPAR -- T`116743 -- 7 L`1 ( RPAR -- T`116744 -- 8 L`1 ) SEMI -- T`116749 -- 13 L`1 ; |
Still much clutter but now it is easy to identify the semantically relevant entities and search for them. In EasyExtend we use simple functions like `find_node` and `find_all` which keep a CST node and a `node id` parameter and perform a depth-first search.