Easy Extend



                       


                 
                       Author: Kay Schluehr
                       Mail: easyextend@fiber-space.de
                       Date: 16.05.2006
                       Version: 1.5

                                

Abstract


EasyExtend is a constructive approach to extend the Python language using pure Python. EasyExtend is developed as a Python framework depending only on tools provided by the CPython interpreter suite ( compiler ) and standard library as well as some pieces of code borrowed from the PyPy project. Opposite to toolkits for writing extension modules in C ( or RPython in future ) the EasyExtend framework is dedicated to extend the language itself by adding new grammar rules and transformation of parse trees. Acting directly on the nodes of syntax trees makes EasyExtend safe and extensible. Moreover the parser and the transformations are considerably fast. While EasyExtend can obviously be used to define "little languages" embedded in Python it can also be used to create Python oriented tools like a code coverage tool based on code generation. For both use-cases examples will be provided.

Overview


0. Where to find it

    How to find the source tree and what it contains.

1. EasyExtend as a Grammar transformer

    Explains the very basics of EE - the world of CST surgery. It might not be delicious to see blood
    but it's no reason to faint.

2. Fibers

    We need to label our extensions. From now on we speak about fibers. In this chapter fibers will be dissected.

3. The console is your friend

    Starting an interactive console cooperating with your fibers definitions should work immediately.

4.  If you love mankind...

    ...you write testcases and accessible documentation ( even if it is *bad* english ;) and...

0. Where to find it?

EasyExtend is distributed as a Python package and will be installed in the site-packages directory of your Python distribution using a distutils setup.py install script. The package can be downloaded here. It's a small package containing just a Python parser and a few basic framework modules. It is self contained in the sense that it doesn't depend on other 3rd party packages that are not in the stdlib.

The development of extension languages is separated from core EE and they can be installed freely. Examples provided together with this documentation are distributed within a FibersAndDocs archive at the download page. The Windows installer instead selects a dedicated directory. It is as loosely coupled to the particular Python XX distribution as possible and can be found at <MyPythonInstallation>\EE together with the subdirectories \doc and \Fibers. After installation the PYTHONPATH environment variable has to be set to the Fibers directory. Under Windows this can be omitted since the registry will be updated.

For running testcases of the Gallery extension language you might switch to the gallery\test directory and call:
 
    C:\Python24\EE\Fibers\gallery\test>python ..\fibers.py test_all.py

Similar you can check out the code coverage tool:

    C:\Python24\EE\Fibers\coverage\test>python ..\fibers.py test_demo.py


1. EasyExtend as a Grammar transformer


1.1 Introduction

EasyExtend starts deep down on the grammar level. Each Python source distribution contains a rather unremarkable file simply called Grammar that defines Pythons syntax rules. The CPython distribution contains a parser suite that reads the grammar file and creates a parser table. The parser table generator is called pgen and the resulting parser table is a long list of C-structures generated into graminit.c. The parser table will be accessed by a deterministic finite automata (DFA) that performs parsing of Python source files into concrete syntax trees (CSTs) reflecting the grammar structure directly ( parsemodule.c ).

Luckily people felt the need to translate this parser-chain to pure Python and recently the PyPy project made the code easily accessible to everyone. The PyPy-package pypy/module/parser ( release 0.7 ) contains the Python parser used also by EasyExtend with just very slight modifications omitting PyPy related imports and names. This parser together with the Python stdlibs tokenizer is considerably fast so it is not intended to replace it by the C-equivalent in future.

1.2 Grammar extensions

Each extension language is defined by its own grammar rules. Analog to CPythons graminit.c EasyExtend generates the file PyGrammar.py using PyPgen.py that contains the DFA parser rules as well as a new instance of the module symbol.py. The symbol.py module associates names of grammar rules with numbers that are used as token in PyGrammar.py. EasyExtend uses the native CPython bytecode compiler that cares for validity of the grammar rules. This compiler still relies on graminit.c and the native parser chain. This means that the existing sequence of nonterminals as it is defined in the stdlibs symbol.py must not be changed!  Only proper extensions of symbol.py are permitted. Those can easily be achieved respecting following rules:

    A. You might alter the structure of an existing grammar rule but never delete it or modify the nonterminal identifier.
    B.  If you define a new grammar rule you have to append it at the end of the file.

Following (B) the PyPgen tool, responsible for creating the token sequence, creates new yet unavailable token that are appended to the symbol.py file.

Below a commented section of the Gallery Grammar ( complient with Python2.4 ) is shown:

# Grammar for Gallery

...

# A modified Python grammar rule
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | funcdef | classdef | on_stmt |
               repeat_stmt | switch_stmt

...

# new grammar definitions
# add your own rules here:


on_stmt: 'on' NAME '=' test ':' suite ['else' ':' suite]
repeat_stmt: 'repeat' ':' suite 'until' ':' (NEWLINE INDENT test NEWLINE DEDENT | test NEWLINE )

switch_stmt: 'switch' expr ':' NEWLINE INDENT case_stmt DEDENT ['else' ':' suite]
case_stmt: 'case' expr ':' suite ('case' expr ':' suite)*




1.3 The CST module

Once an extension grammar is defined the parser ( found in EasyExtend/parser ) is used to generate a concrete syntax tree (CST) from a language module (implemented in the eecompiler.py ). If you want to see how it works in detail create a module named test.py with a few definitions and another module test_driver.py where you copy following code snippet, put them both in the /gallery directory and observe what happens running test_driver.py.

from EasyExtend.parser.StdTokenizer import StdTokenizer
import EasyExtend.parser.DFAParser as DFAParser
import PyGrammar
import symbol

fileObj   = open("test.py","U")
tokenizer = StdTokenizer("test.gal", fileObj.readline)
parseTree = DFAParser.parsetok(tokenizer, PyGrammar.grammarObj, symbol.file_input)
print parseTree

The parseTree object is presented as a nested list that has the same structure as those lists created by Pythons stdlib parser package. If you put the expression a+b into test.py you shall obtain almost the same expression as applying parser.suite("a+b").tolist(). The only difference may be that of additional line-number information in the expression parsed from the file.

1.3.1  CST Synthesis

The next step after creation of the CST  is tree-transformation. Any CST of extension language F has to be backtransformed to a Python CST. The module cst.py defines functions used to create CSTs constructively i.e. each production rule of Pythons grammar is encoded in a function. Using cst-functions will always create valid CSTs. The system of functions defined in cst.py can be considered as a somewhat awkward to use applicative functional language. Application can be smoothed for a programmer using wrappers that defined in surgery.py . Creating those is currently work in progress. EasyExtend refuses to create ASTs from extension languages since this introduces an additional transformation step.

Intead of a straight transformation

             Module(F).py -> CST(F) -> CST(Py) -> Module(Py).pyc

we would have to modify transformation chain

             Module(F).py -> CST(F) -> AST(Py) -> Module(Py).pyc

without eliminating CST handling but introducing additional API for ASTs.

Acting on CST level has shown an additional advantage in representing source code more directly. Back transformation from CSTs to source code is trivial in EE ( see cst2source.py module ) and necessary when it comes to customizing the Python console.

Each token of Pythons symbol.py is mapped 1-1 and name by name onto a function of cst.py. So it is quite easy to relate
Pythons Grammar, Pythons symbol.py and cst.py

Grammar
cst.py
symbol.py
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
def single_input(*args):
      BLOCK
single_input = 256
decorator: '@' dotted_name [ '(' [arglist] ')' ] NEWLINE
def decorator(*args):
      BLOCK
decorator = 259


The surgery.py module extends the cst.py module defining wrapper functions, functions for searching in CSTs etc. It is most likely surgery.py that gets imported by EE clients whenever cst information is needed.

Example: create the CST of the following simple function using surgery.py

    def f(x):
        pass

from EasyExtend.surgery import*

PASS     = any_stmt(pass_stmt())    # any_stmt(..) is a surgery.py defined wrapper
ARG      = fpdef(Name("x"))
VARARGS  = varargslist(ARG)
PARAMS   = parameters(VARARGS)
SUITE    = suite(PASS)
f_cst    = file_input(stmt(compound_stmt(funcdef(Name("f"),PARAMS,SUITE))))


The function any_stmt is not an image of a symbol.py token but a wrapper used to create a stmt node from an arbitrary concrete statement node ( here pass_stmt ). The surgery.py module also defines also a wrapper CST_Function function that simplifies the definition of  f_stmt considerably.

from EasyExtend.surgery import*

PASS   = any_stmt(pass_stmt())
BODY   = suite(PASS)
f_cst  = file_input(CST_Function(BODY,"f",("x",)))

While this is very close to AST handling the underlying objects are still CSTs.

Check the correctness of your CST:

import parser

p_tree = parser.suite("def f(x):\n  pass").tolist()

>>> f_cst == p_tree
True


1.3.2  CST Analysis

Using surgery wouldn't be very usefull if it won't contain functions used to analyze CSTs. If you look e.g. for a symbol.repeat_stmt node in a CST of Gallery the surgery.py module provides the function find_node and the generator find_all for this purpose:

find_node( tree, node_id, level=10000)
Seeks for a node with id = node_id in tree. It returns the first node found in a search is depth-first search or None.
          The maximal depth of the searched tree is constrained by level which has a high value by default. If level is set 0 only
the direct subnodes of tree will be inspected.
find_all( tree, node_id, level=10000)
Returns a generator that seeks and yields any node with id = node_id in tree. The search mode is depth-first.
          The maximal depth of the searched tree is constrained by level which has a high value by default. If level is set 0 only
the direct subnodes of tree will be inspected.

1.3.3  Some truths about the CST module

Any CST generating function always checks the validity of the input nodes executing an assert stmt on the expected node type. Honestly, creating and testing the cst-module is boring and tedious but now it has been done and small variations for different Python releases are added quickly. One aspect of both usage and implementation is redundancy elimination. Whenever a terminal symbol such as a keyword or a punctuation can be omitted because it can be deduced from context the omission is mandatory. Passing a terminal symbol into a function needlessly will likely cause an exception.

Examples:
       1) funcdef: [decorators] 'def' NAME parameters ':' suite

        'def' and ':' shall be dropped.

    2) fpdef: NAME | '(' fplist ')'

        '(' and ')' shall be dropped.

    3) raise_stmt: 'raise' [test [',' test [',' test]]]

        the 'raise' keyword as well as all colons shall be dropped

    4) import_from: 'from' dotted_name 'import' ('*' | '(' import_as_names ')' | import_as_names)

        The keywords 'from' and 'import' shall be dropped. Using either '*' '(' or import_as_names
        as subsequent symbol is mandatory because expression context has to be established.


Probably the safest way to use CST generator functions is to copy some pieces of testcode from test_surgery.py and modify it for your own purposes.

1.4 A modified compiler

One of the more displeasing surprises with early EasyExtend has been line number confusion. The CPython compiler which was used by EasyExtend has a subtle way to create a table that relates bytecode offsets and line numbers for debugging and traceback purposes. The underlying assumption of  the table creation process is that both types of numbers increase monotonically.
This assumption is conflicting with the non monotone character of general transformations. In case of EasyExtend the byte code increments shall be correlated with line numbers of extension source and not the Python code resulting from translation that gets finally compiled. So the algorithm for creating and retrieving byte code increment / line number offset pairs is not applicable for EasyExtend. Read the following compiler document if you are interested in more details.


 
                 

   2. Fibers

                                                                                                             

2.1  Some guiding words

Fresh words are out and you have to reuse them with the effect of producing noise and ambiguities - at least sometimes. I coined Fiber as a term for extension languages ( no, not language extensions encoded in C or other languages! ). You might object that it's already used as synonymous with microthread in concurrency context, but I don't care a lot since using microthread is already quite convenient in the Python domain such as the stackless branch. Besides this imaginative snaky geometry pattern from indian you mythology the term fiber is a reminder of the technical use of fiber in mathematics such as abstract algebra or differential geometry. In particular the latter introduces the notion of a fiber over a base space and a fiber bundle where a fiber, the base space and the fiber bundle are all spaces and the fiber bundle is glued smoothly together using fibers: one fiber for each point in the base space. The allegoric translation into the Python area makes Python a base space and extension languages fibers. Defining a glue mechanism for fibers is currently an open problem.

2.2 The framework

If you consider a framework as an "application with holes" the holes of the EE framework are the particular fibers that contain the language definition. While the EasyExtend package is a class and function library used by any fiber the complete fiber is passed back to EasyExtend as a module! So the fiber module serves as the language interface that suffices a few conventions. A minimal fiber has to look like this:

# minimal fiber that provides EE framework interface
import EasyExtend   # EasyExtend package
import conf         # defines some fiber properties e.g. the console prompt, paths etc.
import PyGrammar    # contains fiber specific DFA tables used by the parser

class FastTransformer(EasyExtend.eetransformer.Transformer):
    '''
    Defines fiber specific transformations
    '''


Since we have to pass the fiber to EE we need to import it somehow. This can be done in a separate main.py for instance or we import the fiber within the fiber and pass it thereafter:

# minimal fiber that provides EE framework interface
import EasyExtend   # EasyExtend package
import conf         # defines some fiber properties e.g. the console prompt, paths etc.
import PyGrammar    # contains fiber specific DFA tables used by the parser

class FastTransformer(EasyExtend.eetransformer.Transformer):
    '''
    Defines fiber specific transformations
    '''

if __ name__ == '__main__':
    import sys
    import minimal_fiber              # import this fiber
    if len(sys.argv)>1:
        py_module = sys.argv[1]
        EasyExtend.run_module(py_module, fiber)
    else:
        console = EasyExtend.create_console("MinimalFiber", fiber)
        console.interact()

As long no deviant grammar is defined the behaviour of the minimal fiber shall be the same as that of the Python interpreter.
  

2.2.1  eetransformer

The Transformer module provides mappings between the CST of a fiber F and a Python CST. For this purpose the F-CST will be traversed and the F specific nodes will be replaced. Each node that shall be visited has to be made public using a method signature like this one that is defined in the Gallery fiber:
handle_repeat_stmt( node)
Transformation of the content of node where node is the repeat_stmt node of the fiber CST. According to the transformation rule for the repeat statement defined for Gallery it has to be replaced by a while_stmt node of a Python CST.
An essential part and the main difficulty of the transformer algorithm is to move the returned nodes to the right place in the CST. This aspect is not solved analytically for any kind of node but it turned out that a stable heuristics is applicable.
Depending on the node type that shall be replaced return either:
  • A node of type test which represents the most general form of a Python expression in the CST.
  • One or more nodes of type stmt that represent arbitrary Python statements in the CST.
For convenience the surgery module offers methods any_test and any_stmt that wrap expressions or statements of any kind such that test and stmt nodes are left as result:

any_test( node)
Returns a correctly built test node whenever the input is one of  the following nodes: test, and_test, lambdef, not_test, comparison, expr, xor_expr, and_expr, shift_expr, arith_expr, term, factor, power, atom. Additionally each node that is part of the atom grammar rule that is at the bottom of the chain above such as Name, Number etc. fits as well.
any_stmt( node)
Returns a correctly built stmt node whenever the input is one of  the following nodes: stmt, simple_stmt, compound_stmt, small_stmt, if_stmt, for_stmt, while_stmt, try_stmt, break_stmt, continue_stmt, return_stmt, raise_stmt, yield_stmt, expr_stmt, print_stmt, del_stmt, pass_stmt, flow_stmt, import_stmt, global_stmt, exec_stmt, assert_stmt.
Example: in Gallery the repeat loop is transformed as follows:

class FastTransformer(Transformer):
    def handle_repeat_stmt(self, node):
        "repeat_stmt: 'repeat' ':' suite 'until' ':' (NEWLINE INDENT test NEWLINE DEDENT |
                                                      test NEWLINE )"
        _suite = find_node(node,symbol.suite)
        _test  = find_node(node, symbol.test, level=0)
        _until = if_stmt(_test, suite(any_stmt(break_stmt())))
        _suite.insert(-1, any_stmt(_until))
        return any_stmt(while_stmt(any_test(Number(1)),_suite))


2.2.1.1  Being __like_main__

With EasyExtend the combination python main.py represents a specialized interpreter not alone python. But running e.g.
python main.py gallery_module.py has the drawback that the convenient block statement of any gallery_module.py

__name__ == '__main__':
      ...


if present in gallery_module.py , will not be executed since gallery_module.py is not the __main__ module but main.py ! The EE solution is a generic transformation of the block statement into a function called  __like_main__. This function is not dedicated to be defined explicitely by the user ( please don't don't define it. Otherwise I select a weirder name ) but created by the eecompiler during parse-tree transformation:


if __name__ == __main__:
    BLOCK  

 ==> 
def __like_main__():
    BLOCK

if __name__ == '__main__':
    __like_main__()

Running the compiled module gallery_module.pyc from python will cause execution of __like_main__() because gallery_module.py is now the __main__ module. If otherwise gallery_module.py is called from python main.py  the __like_main__ function is called by main.py:


# main.py

...

if __name__ == '__main__':
    mod_name = sys.argv[1]
    mod = create_module(mod_name)
    try:
        mod.__like_main__()
    except AttributeError:
        pass
      



2.2.2  eeimporter

The eeimporter module and the Importer class defined in this module play an important role in defining the boundaries of the fiber. Let m = m(F) be a module of the fiber F then we need to know how m has to import dependent modules. There are a some basic alternatives:
  1. All files belong to a fiber F. This is basically what Python is asserting.
  2. Define a FIBERPATH environment variable. Each module in the specified or subsequent directories will be imported as a module of fiber F.
  3. Define a regular expression to identify naming characteristics. That's how the coverage fiber defines dependencies between a module text_A.py and a module A.py.
  4. Use a general criterion. Gallery assumes that Fibers\gallery is the root directory where gallery related modules are defined. Some of those modules are excluded from import.
  5. Others to be defined.
An instance of an Importer class or a derived class will be added to sys.meta_path to create an "import hook". This is a bit of an undocumented Python feature I'm not going to explain here. What I know about the behaviour of those import hooks I derived from experimentation and reading of the testcode of test_importhooks.py.

Note: If you try to import a module that was already imported at another place Python takes the module from sys.modules. The import hook won't work in this case.

Here is a list of modules that are already imported when the coverage fiber Importer registers itself at sys.meta_path. Ironically all of the framework modules were imported so that quality of the EE unittests cannot be checked with coverage - at least not directly.

The Importer module provides two methods that are relevant if you want to define your own custom imports:
register_importer( )
Registers this importer at sys.meta_path.
accept_module( fs_path)
Defines fiber specific acceptance rules for module at fs_path where fs_path is the file systems path of the module.
The accept_module method replaces the find_module method as the method that shall be overwritten in Importer subclasses.

2.2.3  The conf module

The conf module is used to define some basic fiber related settings. Currently only the prompt attribute and the ext attribute will be supported where prompt refers to the sys.ps1 that will be shown in the interactive console and ext refers to the bytecode file extension.

import gallery

>>> gallery.conf.prompt
'gal> '



2.2.4  fibermod

Fibers that define own grammar rules / token need specialized versions of Pythons stdlib modules symbol.py, token.py and tokenize.py. The symbol.py module will always be newly generated whenever a Grammar file changes. The token.py and tokenize.py modules have to be adapted by hand when new punctuation is added. I used to define a fibermod package in the root package of the fiber that contains copies of those modules. The modules are imported at the fibers __init__.py. They substitute standard imports.

2.4  Fiber compiled modules

We have seen how to create a fiber and how to execute modules with specific language definitions in the fibers context. But allthough we needed the fiber to compile those modules they still compile to ordinary Python bytecode and we should be able to mix the compiled modules freely. Thus both of these executions calls should work in principle:

                 (A)  python fiber.py gallery_module.py

                
(B)  python gallery_module.pyc

In case (A) the whole EE framework and the fiber module can be used all the way long. Functions can be defined in the fiber and function calls can be generated into the module. But those functions/objects don't reside in the module but somewhere else. Since they are language specific they should be accessible by all modules of the language.
The EE framework offers the __publish__ fiber attribute.
__publish__ : list
The fiber attribute __publish__ is a list of names. It will be read at startup and for each name in __publish__ the following assignment will be made:  __builtin__.__dict__ [name] = fiber.name.
Now in case of (A) the execution of the __publish__ protocol can simply be triggered by the startup functions involving the fiber. The defined objects are immediately accessible in gallery_module.py. But in (B) the names are not present and execution will likely fail. Thus we have to import the fiber explicitely in the module. In EE we can abstract from the import of the concrete fiber and use the statement

       import __fiber__

instead. This import statement is more a declaration than an actual execution. It will be transformed by EE according to:


import __fiber__  ==> 
import <fiber_name>.<fiber_mod> as __fiber__

The <fiber_name> and <fiber_mod> placeholders will be replaced by the actual fiber values. The new module valued __fiber__ variable can be used to reflect on the fiber. Using import __fiber__ in place of a more concrete import advice has another advantage. Lets have a new fiber F2 that extends a given fiber F1 and we want to use names defined in F2 in a module M that was compiled with F1. Than we simply have to recompile M using F2 without touching the specific fiber import.

Note: The  import __fiber__ declarative shall be defined in the module namespace. Don't use it within a class or function definition.

You might check this out by importing e.g. the test_gallery module after being compiled to bytecode.

2.5  Suffixes

There is currently no support for alternative suffixes. While I would prefer using an ee suffix over the py suffix for fibers, suffixes are not configurable in CPython but hardcoded in the import.c module.

3. The console is your friend

The eeconsole module of the EE package is what it needs to run an interactive console being aware of a particular fiber. The eeconsole is based on Pythons standard InteractiveConsole object that executes only Python commands. If you type a fiber command at the console prompt it has to be re-translated into a Python command ( or a sequence of Python commands ). So  eeconsole essentially realizes an interplay between the eetransformer that creates Python CSTs from fiber definitions and cst2source that restores Python source code from CSTs.
   

3.1  Launching a fiber specific console


You can define a prompt variable in the conf.py module. This variable is accessed by the eeconsole and the value replaces
the standard prompt sys.ps1. The second prompt sys.ps2 will be determined as a dotted line whose length corresponds
with that of sys.ps1. Starting the gallery console under Windows may look like:

c:\Python24\EE\Fibers\gallery>python fiber.py
(GenericConsole)
gal> x = 0
gal> repeat:
....    x+=1
....    print x
.... until: x==3
....
1
2
3
gal>

3.2  Debugging

Debugging using the pdb module should work as convenient but don't forget to pass globals() explicitely into pdb.run().

Example: Using module gallery.test.funcs and function

def f_repeat_until(x):
    repeat:
        x-=1
    until:
        x<7
    return x

we get the following session trace on the Gallery console:

gal> import gallery.test.funcs as funcs
gal> import pdb
gal> pdb.run("funcs.f_repeat_until("7.5"),globals())
gal> import pdb
gal> pdb.run("funcs.f_repeat_until(7.5)",globals())
> <string>(1)?()
(Pdb) s
--Call--
> c:\python24\lang\gallery\test\funcs.py(6)f_repeat_until()
->
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(8)f_repeat_until()
-> repeat:
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(9)f_repeat_until()
-> x-=1
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(10)f_repeat_until()
-> until:
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(12)f_repeat_until()
-> return x
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(266)f_repeat_until()
(Pdb) n
--Return--
> c:\python24\lang\gallery\test\funcs.py(266)f_repeat_until()->6.5
(Pdb) n
--Return--
> <string>(1)?()->None
(Pdb) n
gal>




4. If you love mankind...

... think for yourself and go for enlightenment.

Out of phase

Dynamics of language evolution may deevaluate any one of your fiber inventions/intentions. Either by providing analog language semantics of the base language that makes the fiber most likely superflous or by introducing new conflicting syntax. The most amenable compromise for a peacefull coexistence is the creation of a habitat. This is the main goal of the Transit meta-fiber that is currently in the concept phase. Transit needs some reserved syntax of its own to let other fibers hooked safely into the base language. In a sense Transit tries to express an inter-DSL design pattern. If there is some substance in this idea sooner or later a little piece of syntax space will be requested for Transit. But would a Python-Enhancement-Proposal ( PEP ) be the right approach? Since it is about a Python Not-Enhancement should the hypothetical PEP be written for ultimate failure?

With a smile

Are we moving towards a "language oriented programming" paradise or do we go to DSL hell created by liberated Pythonistas? Personally I doubt about either. Designing means dancing in fetters as Walter Gropius said. If EasyExtend helps to improve learning some steps in the software design dance it might be well done.