EBNF Grammars

Definition

A context free grammar is a set of recursive rules describing the syntax of a programming language. The rules are also called productions. Each production
has the following form

        grammar: ( rule | NEWLINE )*
    rule: NAME ':' rhs NEWLINE
    rhs: alt ( '|' alt )*
    alt: item+
    item: '[' rhs ']' | atom [ '*' | '+' ]
    atom: '(' rhs ')' | NAME | STRING

This is kind of a self description of EBNF style rules we are using in EasyExtend.

The names NAME,NEWLINE and STRING have no further productions. They are called terminals. The names of productions like rule, rhs or atom are called nonterminals.
By convention we use lower case letters for nonterminals and upper case letters for terminals. The grammars grammar shows us how to build terms or expressions in the grammar
language formalism. This formalism is called EBNF which means Extended Backus Naur Form [1].

The meaning of those terms is as follows

Term	Meaning
A B	Sequence: match expr using first A and then B
A \| B	Alternative: match expr using either A or B
[A]	Optional: term can be used or omitted in a match of expr
A*	Optional repitition: any sequence of terms A can be used to match expr but also none
A+	Mandatory repitition: any sequence of terms A can be used to match expr but at least one is required.
(A)	Grouping: used to separate one term from another one
' ...'	Single quote: terminal literal
" ... "	Double quote: terminal literal

[1] There are many equivalent ways of representing EBNF but they differ just in the way punctation is used and it is easy to transform them into each other.
The one we use in EasyExtend is described by the grammar above. It conforms with Pythons own grammar description style.

EBNF and top down parsing

EBNF grammars are used for top down parsing. This means one starts with a top level nonterminal e.g. grammar and tries to match an expression using the right hand side of grammar. This is

        ( rule | NEWLINE )*

the parser has to interpret the rule as follows: use either rule or NEWLINE arbitrary many times to match the expression. So the parser can branch into matching the expression using the terminal
NEWLINE or the nonterminal rule. For the nonterminal rule we can repeat the attempt to match the expression and try

        NAME ':' rhs NEWLINE

The top down parser will always start with the leftmost symbol it finds and matches the expression. Suppose we want to match the following modified grammar rule:

        expression = "grammar: rule | NEWLINE*"

Here we go:

        NAME ':' rhs NEWLINE
        ^
        NAME ':' rhs NEWLINE          -- grammar
              ^
        NAME ':' rhs NEWLINE          -- :
                 ^
                 alt ( '|' alt )*
                 ^
   item+
                 ^
                 '[' rhs ']' | atom [ '*' | '+' ]
                   ^
                              '(' rhs ')' | NAME | STRING
                                ^
                              NAME               --   rule
                 alt ( '|' alt )*
                        ^
                       '|'                                     --   |
                 alt ( '|' alt )*
                   ^
               item+
                           ^
                           '[' rhs ']' | atom [ '*' | '+' ]
                     ^
                             '(' rhs ')' | NAME | STRING
                                 ^
                                            NAME      --   NEWLINE
     '[' rhs ']' | atom [ '*' | '+' ]
                                         ^
                                        '*'        --   *

The parse tree we get has the following shape

Precedences

Looking at the parse tree above we notice immediately that item binds atom and the star symbol '*' together. The binding strength or precedence of '*' to the term it follows is higher than the binding strength of | to the subsequent term. The associativity of the operators | and * is entirely encoded in the way the production rules refer to each other.

Having rules ordered in the way

                           mult_expr: add_expr ('*' add_expr)*
                           add_expr: NUMBER ('+' NUMBER)*

implies that '+' has a higher precedence than '*'.

Having them ordered the other way round

                           mult_expr: NUMBER ('*' NUMBER)*
                           add_expr: mult_expr ('+' mult_expr)*

implies that '*' has a higher precedence than '+'.

The advantage of encoding precedences this way is that we don't need any additional explicit constructions e.g. numerical precedences. The disadvantage is that we eventually need more rules.

Pythons own grammar doesn't use numerical but implicit precedences. EasyExtend uses Pythons grammar and extends it. It uses implicit precedences only.

Keywords

In Python we have rules like this one:

return_stmt: 'return' [testlist]

Obviously 'return' is a terminal symbol. But unlike NAME that covers all alphanumerical sequences starting with an alphabetic character words like 'return' are special when used in the grammar. They are
exactly the keywords of the language and they can't be used as identifiers.

Comments

You can use Python style line comments

# bla bla bla ...

everywhere in the grammar files. They are not mentioned in the set of grammar rules above because they get filtered out in the token stream. More about this in the section Lexing & Parsing with EasyExtend.

Summary

EasyExtend uses EBNF as a language to encode language syntax and a top down parser for parsing source code. Precedences are encoded implicitely in the recursion of the grammar rules. Keywords of the language are alpha numeric strings used immediately in the grammar.