Archive for March, 2009

Contextual Keywords

Posted in csharp, EasyExtend, Grammars, Python on March 22nd, 2009 by kay – Be the first to comment

Keywords and Contextual Keywords

A language keyword is a reserved word that cannot be used as an identifier for variables, constants, classes etc. Often it is an element of language syntax. Syntactical structure is simplified by means of keywords. A keyword can be thought of as a “semantic marker” on many occasions i.e. it denotes a location where something particular meaningful is happening. The while keyword points to a while-loop, the class keyword to a class definition etc. Other keywords like as or in link together expressions and have a fixed location within a statement.

A contextual keyword is sort of a hybrid. It acts much like a keyword but is not reserved. The word as was such a contextual keyword in Python until it became a proper keyword in version 2.5. The C# language defines lots of contextual keywords besides the regular ones. MSDN defines a contextual keyword as:

A contextual keyword is used to provide a specific meaning in the code, but it is not a reserved word [in C#].

Contextual keywords in C# are used properly in the grammar description of the language – like regular keywords. For example the add_accessor_declaration is defined by the following C# rule:

add_accessor_declaration: [attributes] "add" block

Keywords in EasyExtend

In EasyExtend keywords are implicitly defined by their use in a grammar file. The “add” string in the add_accessor_declaration becomes automatically a keyword. Technically a keyword is just an ordinary name and the tokenizer produces a NAME token. It’s easy to verify this: when we inspect the token stream of a function definition def foo(): pass we’ll notice that def and pass are mapped onto the NAME token:

>>> def foo(): pass
 Line  | Columns | Token Value         | Token Name    | Token Id     |
 1     | 0-3     | 'def'               | NAME          | 1 -- 1       |
 1     | 4-7     | 'foo'               | NAME          | 1 -- 1       |
 1     | 7-8     | '('                 | LPAR          | 7 -- 7       |
 1     | 8-9     | ')'                 | RPAR          | 8 -- 8       |
 1     | 9-10    | ':'                 | COLON         | 11 -- 11     |
 1     | 11-15   | 'pass'              | NAME          | 1 -- 1       |
 1     | 15-15   | '\n'                | NEWLINE       | 4 -- 4       |
 2     | 0-0     | ''                  | ENDMARKER     | 0 -- 0       |

In the parse tables keywords are preserved and we will find a state description like (‘pass’, 1, 273) which explicitly refer to the pass keyword. It will be clearly distinguished from token id’s of type NAME which are used otherwise ( e.g. for foo ). Now the pass token in the token-stream has precisely following structure [1, ‘pass’, 1, (11, 15)]. Usually the token id is all that the parser needs to know but when the parser encounters a NAME token the token value is a keyword the keyword is turned into a token type. We can summarize this using following function

def compute_token_type(tok):
    tok_type  = tok[0]      # the standard token type
    tok_value = tok[1]
    if tok_type == token.NAME:
        if tok_value in keywords:
            return tok_value    # tok_value is a keyword and used as a token type
    return tok_type

Contextual Keywords in EasyExtend

Unlike keywords which can simply be determined from the grammar this isn’t possible with contextual keywords. They have to be made explicit elsewhere. When I considered contextual keywords in EasyExtend I refused to create special infrastructure for them but used a simple trick instead. Suppose one defines following token in aToken.ext file

    CONTEXT_KWD: 'add' | 'remove'

This is just an ordinary token definition. When is generated from Token+Token.ext we’ll find following settings

    CONTEXT_KWD = 119896


    token_map = {
        'add|remove': 119896,

The provides obviously sufficient information to determine contextual keywords form the CONTEXT_KWD token definition. We exploit this in the next function definition:

def generate_contextual_kwds(parse_token):
        ctx_kwd = parse_token.CONTEXT_KWD
        t_map   = swap_dict(parse_token.token_map)      # swap_dict turns dict keys
                                                        # into values and values into keys
        contextual_keywords = set( t_map[ctx_kwd].split("|") )
        assert contextual_keywords <= keywords
        return contextual_keywords
    except AttributeError:
        return set()

Now we have to understand the semantics of contextual keywords. The following code illustrates the behavior of the parser in the presence of contextual keywords.

def compute_next_selection(token_stream, parse_table):
    tok = token_stream.current()
    tok_type = compute_token_type(tok)
    selection = parse_table.selectable_ids()   # provides all token ids and nonterminal ids which are
                                               # admissible at this state
    for s in selection:
        if s == tok_type or tok_type in first_set(s):
            return s
    # here we start to deal with contextual keywords
    elif tok_type in contextual_keywords:
        # replace the token type of the contextual keyword by NAME
        tok_type = token.NAME
        for s in selection:
            if s == tok_type or tok_type in first_set(s):
                return s
    raise ParserError

When a keyword is needed in the particular context ( tok_type was accepted in the first for-loop ) either the tok_type itself is returned or a node id of a non-terminal that derives the tok_type ( tok_type is in the first_set of the node id ). If this fails it is still permitted that instead of the contextual keyword the NAME token is examined in the same way as the keyword. So if a particular context requires a contextual keyword this keyword is provided otherwise it is checked whether the context just requires a NAME and the NAME is provided instead. This has the practical implication that a contextual keyword can be used as an identifier of a variable.

CPython vs IronPython

Posted in Python on March 10th, 2009 by kay – Be the first to comment

Two video shots tell more than 2000 words.



The thunk_stmt and programmable semantics In Python

Posted in DSL, Grammars, Python on March 9th, 2009 by kay – 4 Comments

Seems as if there was just a single direction of syntactical improvement of Python these days which is the introduction of Ruby style blocks. Tav suggests to reuse Pythons with-statement:

with_stmt: "with" expression (["as" target] | ["do" [parameter_list]]) ":" suite

Guido responds promptly and rejects the semantic dispersion of the with-statement. No meaning changer by using just a magical do keyword. In Tavs vision the do-branch of the with-statement is turned into an anonymous closure that is called by a special __do__ function. The __do__ shall also be customizable which finally leads to DSLs:

But more enterprising __do__ functions could do interesting things like rewriting the AST of the BLOCK_CODE. Given that this is already done by many many Python libraries, it’s not much of a stretch of the imagination. The benefit? Super nice Pythonic DSLs!

Well it is quite easy to put this idea upside down. Instead of re-purposing a special purpose statement like with one can simply turn each expression and “small statement” e.g. raise, assert, return, assignment etc. into a compound statement using following definition

thunk_stmt: small_stmt ':' suite

One has to elaborate on it a little and change rules of Pythons existing grammar for LL(1) conformance

stmt: compound_stmt
single_input: NEWLINE | compound_stmt NEWLINE
compound_stmt: ( if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | 
                 funcdef | classdef | thunk_stmt )
thunk_stmt: small_stmt ( ':' suite | (';' small_stmt)* [';'] NEWLINE )

Here the thunk_stmt is a fusion of the definition stated above and the simple_stmt of Pythons grammar. This definition doesn’t lead to programmable syntax but programmable semantics which is what Tav intends to support with a customizable __do__.

An obvious use case would be property definitions:

x = property():
    def get(self):
        return self._x
    def set(self, value):
        self._x = value

It simply appends a block of code to a valid assignment x = property().

Same goes with

which might be extended to
    if emp.salary > developer.salary:
            return fireEmployee(emp)
        return extendContract(emp)

Without an appended block the expression

might simply mean emp: emp)

It should be clear that Python needs a protocol to support this which means here that a __special__ function is attached to an object and that for user defined classes this __special__ function can be overwritten.

Lonely Python is Intelligent Design and how to liberate from it

Posted in DSL, Grammars, Python on March 7th, 2009 by kay – 2 Comments

Python is doomed. Well, of course it is not yet doomed. Python is an evolutionary dead end living lonely on its own fitness plateau. It is clear that Python won’t have offspring because of significant whitespace aka indentation sensitivity. No children, no future.

You might ask where are the news? There are none and we knew this for ever and a decade but sometimes it is in the news again. Read this post by the author of a programming language called Reia. He struggles endlessly with indentation delimited blocks and finally abandons them for Ruby style end terminators ( not before a popularity poll in the Web 2.0 age and consulting the author of the failed Logix project who is now a Ruby advocate). Finally his language looks like Ruby which demonstrates the evolutionary fitness of Ruby. Python on the other hand is a piece of Intelligent Design. Guido van Rossum has managed a ” beautiful set of compromises” when he designed the Python grammar. So it ends up in overspecialization whereas languages like JavaScript are flowing more with the nature. To illustrate this nature you might examine the file jQuery.js that implements the popular jQuery library. The impatients might immediately scroll to the last line. It’s sweet.

Some people have argued that Haskell style layouts may save the soul of significant whitespace. Transplanted to Python this would imply mandatory explicit block delimiters that can also be omitted for the now classical indentation style. If you are not willing to decide just give multiple choice style freedom to anyone.

In the rest of the article I’m discussing optional explicit delimiters for Python built into the grammar and making the statement/expression distinction weightless. I promise not to cringe when it becomes “ugly”. Not behaving like little girls seems to be an evolutionary dead end among male, adult programmers as well but I take it with stoicism.

Grammar Extension

atom: ( '(' [embedded_stmt|yield_expr] ')' |'[' [listmaker] ']' |
        '{' [dictmaker] '}' |
         '`' testlist1 '`' |
         NAME |  NUMBER | STRING+)
trailer: '(' [arglist|embedded_suite] ')' | '[' subscriptlist ']' | '.' NAME
embedded_stmt: ( embedded_funcdef |
    embedded_if_stmt |
    embedded_while_stmt |
    embedded_for_stmt |
    embedded_try_stmt |
    embedded_with_stmt |
    embedded_classdef |
    embedded_small )
embedded_small: small_stmt (';' small_stmt)*
embedded_suite: embedded_stmt+
embedded_funcdef: [decorators] 'def' NAME parameters ':' embedded_suite
embedded_if_stmt: ( 'if' test ':' embedded_suite ('elif' test ':' embedded_suite)*
                   ['else' ':' embedded_suite])
embedded_while_stmt: 'while' test ':' embedded_suite ['else' ':' embedded_suite]
embedded_for_stmt: ( 'for' exprlist 'in' testlist ':' embedded_suite
                     ['else' ':' embedded_suite] )
embedded_try_stmt: ('try' ':' suite ((except_clause ':' embedded_suite)+
                    ['else' ':' embedded_suite] ['finally' ':' embedded_suite] |
                   'finally' ':' embedded_suite))
embedded_with_stmt: 'with' test [ with_var ] ':' embedded_suite
embedded_classdef: 'class' NAME ['(' [testlist] ')'] ':' embedded_suite

This is a conservative grammar extension of the official Python 2.5 Grammar. Conservative means that all expressions / statements written in Python 2.5 remain valid. Those non-terminals that substitute existing ones are atom and trailer. In detail the testlist_gexp non-terminal in atom is replaced by the newly defined embedded_stmt non-terminal and instead of an arglist an embedded_suitecan be used . The embedded_stmt non-terminal is the gateway for all kinds of statements embedded into expressions. The statement-expression distinction becomes de facto obsolete. What happens to significant whitespace? Post-processing the token stream filters whitespace inside of parentheses. Hence the token stream won’t contain NEWLINE, INDENTand DEDENT token within expressions. All those embedded_xxx non-terminals are only adjustments of existing ones reflecting the absence of the mentioned whitespace token in expressions. Notice that you can’t mix. Once you are living in a grammatical expression whitespace becomes definitely insignificant.

The grammar is not LL(1) and you need a stronger parser than Pythons own one to handle it. There is also no behavior associated with the new definitions. Semantics shall be entirely neglected and only the future shape of things shall be highlighted. Here are a few examples:

Multiline Lambda

lambda x,y: (
    z = x+y;              # semicolons always separate small statements
    print z;
    return z
)(3, 2)

Long Assertions

assert ( if x>0:
    x == 2
        x == -2)

The Return of the Return

def foo(x):
    return ( return )

My Property

x = property(
    def fget(self):
        return self._x
    def fset(self, value):
        self._x = value

Function in the middle

x + (def fact(n):
     if n<=1:
          return 1
          return fact(n-1))(7) + z