Archive for March, 2010

Inheritance and the C preprocessor

Posted in Algorithms, C on March 24th, 2010 by kay – 3 Comments

Defining n-ary trees using the C preprocessor

In this article I introduce a compile time C technique used to define inheritance. Instead of giving a lengthy motivation I’ll jump directly to the algorithm and discuss it later. I hope lovers of C and its preprocessor find it useful. #defines first!

#define TOP 0
#define SET_CHILD(n,parent) ( parent==TOP ? n: \
                            ( parent<(1<<4) ? (n<<4) + parent : \
                            ( parent<(1<<8) ? (n<<8) + parent : (n<<12)+parent)))
 
#define IS_SUBNODE(child, parent) ((child & parent) == parent)
 
#define SELECT(X, a, best) ( a > best && IS_SUBNODE(X, a)? a : best)
 
#define SELECT_FROM_5(X, a, b, c, d, e) SELECT(X, a, \
                                        SELECT(X, b, \
                                        SELECT(X, c, \
                                        SELECT(X, d, \
                                        SELECT(X, e, 0)))))
 
#define SELECT_FROM_4(X, a, b, c, d) SELECT_FROM_5(X, a, b, c, d, 0)
#define SELECT_FROM_3(X, a, b, c)    SELECT_FROM_5(X, a, b, c, 0, 0)
#define SELECT_FROM_2(X, a, b)       SELECT_FROM_5(X, a, b, 0, 0, 0)

The SET_CHILD macro is used to define up to 15 child nodes of a given root for a n-ary tree of depth 5 with a single root node, named TOP. This is encoded within a single number of type word which is adequate for most embedded compilers. For 32 or 64 bit processors one can either support more child nodes or a deeper tree.

SET_CHILD is assigning a name to n-th child of a given parent. One starts with TOP as the parent of all nodes and recurses down:

#define A SET_CHILD(1, TOP)
#define B SET_CHILD(2, TOP)
...
#define A1 SET_CHILD(1, A)
#define A2 SET_CHILD(2, A)
...
#define B1 SET_CHILD(1, B)
#define B2 SET_CHILD(2, B)
...
#define A11 SET_CHILD(1, A1)
#define A12 SET_CHILD(2, A1)
...
#define A21 SET_CHILD(1, A2)
#define A22 SET_CHILD(2, A2)
...

By construction no more than 15 child nodes for a given parent are permitted. If more are used, macros like IS_CHILD will fail to work correctly.

Once a tree is created with the appropriate nodes, one can use IS_CHILD to check for child/parent relationships. The tree is constructed s.t. IS_CHILD(A, B) returns 1 iff A is a direct child of B or a grandchild of B etc. otherwise 0. So IS_CHILD(A22, A) evaluates to 1 just like IS_CHILD(A22, A2) or IS_CHILD(A22, TOP) but IS_CHILD(A22, A1) is 0.

The C preprocessor doesn’t support overloading and the flavors I checked didn’t support varargs wich wouldn’t probably be much helpful in this case either. So I defined a group of 5 SELECT_FROM_xx macros being distinguished only be the number of arguments. The number 5 isn’t magic and one can extend the range of SELECT_FROM_xx macros by need.

How is SELECT_FROM_xx used? The first argument X is an arbitary node of the tree. If one of the susequent nodes a, b, … c is identical with X, X will be the value of SELECT_FROM_xx(X, a, b, …, c). Otherwise the most-direct-parent of X among the nodes a, …c will be returned. If none of them is a parent of X the return value is TOP.

Example:

If we set

#define X A22

then we get

SELECT_FROM_2(X, A, B)        // = A
SELECT_FROM_3(X, A, B, A1)    // = A
SELECT_FROM_3(X, A, B, A2)    // = A2
SELECT_FROM_3(X, A2, B, A)    // = A2
SELECT_FROM_3(X, A2, A, A22)  // = A2
SELECT_FROM_2(X, A1, B)       // = TOP

Inheritance

With the definitions above we can influence conditional compilation:

#if SELECT_FROM_3(X,A2,A,B) == A2
        const int a = 0;
#else if SELECT_FROM_3(X,A2,A,B) == A
        const int a = 1;
#else if SELECT_FROM_3(X,A2,A,B) == B
        const int a = 2;
#else
        const int a = -1;
#endif

The virtue of the construction lies in its robustness. Suppose X is A22 then the first branch is selected but this remains true also if we build a “subclass” A22k , k = 1, …, 9, A, …, F of A22 and assign e.g.

#define X A225

So if we use conditional compilation for a given System S and create a subsystem T of S e.g. a new version of S, we have to adapt our C code only in places where T differs explicitely from S. This robustness is also the major virtue of using inheritance / polymorphism in OOP. It has led to disrespect of using case-statements in OOP since those do not exploit polymorphism and cause in turn less robust code. We see that case- or if-else statements can be confined with the very same idea and robustness even on the level of the C preprocessor. The additional charme of using the C preprocessor is that child/parent relationships are computed at compile time and do not cause any runtime performance penalty.

Restricted backmatching

Posted in TBP on March 13th, 2010 by kay – Be the first to comment

In practice we often encounter situations when our preferred approach to problem solving breaks down. Just look at the recent Google implementation of  a regexp engine RE2, created by Russ Cox, the guy who has written a revival paper for Thompson NFAs  a while ago with a few follow-ups which build on those ideas.  Now once again backmatching is greyed from the feature matrix which means: no implementation. The project page intro states:

The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.

When one takes a closer look at backmatching one starts to wonder what the restrictions to backmatching in trace based parsing ( TBP ) approaches to pattern matching really are?  One can ask the question also with a more positive intent: what cases of backmatching are compliant with TBP? We’ll find there are quite a lot. Some are obviously falling apart like exotic applications of regexps for solving NP-complete problems like 3-SAT – but could it be in the end that only esoteric applications of backmatching are excluded from TBP? The reader might be the final judge.

General backmatching

When we think about backmatching in regular expressions we might have expressions like this

 "... (P) ... \1"

in mind where (P) defines a simple pattern and \1 refers to the value of the matched pattern. So there is a functional relationship between (P) and \1.

Actually this perspective is a little simplistic in the general case. Consider the following simple regexp:

([ab]*)b*\1

Here the match of  ([ab]*) depends on what \1 will match but it is also highly ambiguous. If we match the following string

s = “bb”

the first b can be matched by ([ab]*) and the last “b” by \1 but the whole string can also be matched by b*.

Here is another more complicated example

(([ab]*)b*\2)*[ab]*\1b*

It stems from Thomas Lord with whom I discussed this topic on LtU and who corrected my initial naive assumptions about backmatching. Not only depends the match of ([ab]*) on the match of \1 but also on the match of \2 which depends on the match of \1 as well. Of course \1 depends on both of the matches of ([ab]*) and (([ab]*)b*\2). It’s all tangled.

General backmatching like the one above can be used to solve NP-complete problems which exploits these tanglements and find resolutions. See this article for more examples. With exponential time backtracking algorithms built into regexp engines one gets solutions for free or by magic. In some sense this is cool but if you’d write a requirements document for a regexp engine would you demand that it shall solve 3-SAT problems? If you say “yes”, show me the real world application which uses this power.

Functional backmatching

If we restrict backmatching to simple functional relations between (P) and \1 we can still express a broad range of practically relevant use cases. Here we give an approach to formalize those restrictions which can be checked by a regexp compiler.

In an expression

    ... (P) ... \1

the back-reference \1 can be separated from P when following holds:

  1. P doesn’t contain back-references which means it is self contained.

  2. It is possible to write the expression in the way

    ... L(P)R ... \1

where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is at the start of an expression.

The first condition can be checked syntactically. The second condition can be expressed using the following two equations on sets

2.1 LAST-SET(L)  /\ FIRST-SET(P) = {}
2.2 LAST-SET(P)  /\ FIRST-SET(R) = {}

If additionally following condition is true

2.3 FIRST-SET(P) /\ LAST-SET(P) = {}

R might be empty and an expression

    ... L(P)\1 ...

is permitted.

End Cycles

The current conditions are still to restrictive. For example regexp (a)\1 violates condition (2.3) but shall be permitted. What we really want to exclude is that \1 is adjecent to what I call a non empty endcycle.

An endcycle of P has the following definition:

END-CYCLE(P) = FOLLOW-SET( LAST-SET(P) )

Take for example the regexpP = (a*|b|c). Here LAST-SET(P) = {a, b, c} and  FOLLOW-SET({a,b,c}) = {a} which means that a is in the endcycle of P.

With endcycles in mind we can weaken the conditions of (2) considerably:

If P has no endcycle i.e.

    END-CYCLE(P) = {}

we permit

    ... L(P)\1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}

If on the other hand

    END-CYCLE(P) != {}

we permit

    ... L(P)R ... \1 ...

if the following holds:

    END-CYCLE(L) /\ FIRST-SET(P) = {}
    END-CYCLE(P) /\ FIRST-SET(R) = {}

Final  Remarks

No matter how the conditions are defined it has to be granted that matching (P) is terminated before backmatching. If this isn’t checked statically during regexp compilation one can still defer checks until runtime. Much like any other dynamic check it is less clear what will happen to an expression but there isn’t much mental overhead and the implementation is kept simpler.