Building sets from set() and ANY
When defining a regular expression there is a well known duality between a set of characters specified by enclosing characters or character ranges within square brackets `[…]` and defining the pattern of the complement set `[^…]` by preceding the caret symbol ^ .- [0-9] — matches all digits
- [^0-9] — matches all non-digits
The nature of the complement set mentioned here is less precisely defined and is implementation dependent in practice. If a regexp matcher knows about `ASCII` only then `[^0-9]` is the set of all `ASCII` characters which are not digits.
Suppose now we want to build a character matching set from sets in the sense of a `set` data-type as inhabited in Python as a builtin. This is easy because we start from the empty `set()` and add just more elements. Each finite set can be constructed in this way and we can establish a set ~ pattern correspondence:
`set(range(9)) ~ [0-9]`
But how do we find a corresponding set for `[^0-9]`?
This cannot be finitely constructed from the bottom up using `set()` as a start value but it has to be finitely de-constructed from the set of all elements. This looks like we are running into unlimited comprehension and the Russel paradox but actually the set theory we are using is quite weak and element membership can be decided just by doing hashtable lookups.
Instead of constructing a set from `set()` we want to (de-)construct it from `ANY`. The name `ANY` is not chosen by chance because we want to express the correspondence with `.` pattern:
`ANY ~ .`
( Actually `ANY` is a little stronger because `.` doesn’t match newline characters unless explicitly specified by the `DOTALL` flag in the `re.compile` function. )
Using `ANY` we can easily state the above correspondence:
`ANY – set(range(9)) ~ [^0-9]`
Implementation of ANY
We define `ANY` as an instance of `univset`. The `univset` has a set-like interface and implements set operations consistently on sets of the type `ANY-A`, `ANY-B`,… where `A`, `B`… are finite sets constructed from `set()`. For example`ANY-A ∩ ANY-B` = `ANY – ( A ∪ B )`
`ANY-A ∪ B` = `ANY – ( A – B )`
class univset(object): def __init__(self): self._diff = set() def __sub__(self, other): S = univset() if type(other) == set: S._diff = self._diff | other return S else: S._diff = self._diff | other._diff return S def __rsub__(self, other): return other & self._diff def __contains__(self, obj): return not obj in self._diff def __and__(self, other): return other - self._diff def __rand__(self, other): return other - self._diff def __repr__(self): if self._diff == set(): return "ANY" else: return "ANY - %s"%self._diff def __or__(self, other): S = univset() S._diff = self._diff - other return S def __xor__(self, other): return (self - other) | (other - self) def add(self, elem): if elem in self._diff: self._diff.remove(elem) def update(self, elem): self._diff = self._diff - other def __ror__(self, other): return self.__or__(other) def union(self, other): return self.__or__(other) def difference(self, other): return self.__sub__(other) def intersection(self, other): return self.__and__(other) def symmetric_difference(self, other): return self.__xor__(other) def issubset(self, other): if type(other) == set: return False if issubset(other._diff, self._diff): return True return False def issuperset(self, other): if self._diff & other: return False return True def __lt__(self, other): return self.issubset(other) def __eq__(self, other): if type(other) == set: return False try: return self._diff == other._diff except AttributeError: return False def __ne__(self, other): return not self.__eq__(other) def __le__(self, other): return self.__lt__(other) or self.__eq__(other) def __gt__(self, other): return self.issuperset(other) def __gt__(self, other): return self.issuperset(other) or self == other |
Examples
ANY = univset() NON_DIGITS = ANY - set(range(9)) assert 8 not in NON_DIGITS assert 'a' in NON_DIGITS assert 0 in NON_DIGITS | set([0]) |
Nice, but your class is missing the iter method. 😀
Very cool exercise but I don’t see why you can’t simply have the negation of the original statement in the body of the code rather than making an entire class that pretends to be the compliment of the set. Good idea but I fail to see the practicality.
@mynameisfiber: Why bother having a False value if you can always represent False as (not True)? Sometimes it’s nice to have a special object that represents exactly what you need to represent.
@mynameisfiber
I didn’t want to motivate ANY and univset with applications in EasyExtend because this is not accessible to the reader.
Just so much. EE doesn’t use Pythons regular expressions for lexical analysis because of ordered choice: if you have a pattern A | B it matches differently than B | A in the general case. This shall be avoided. But in order to avoid this one needs a disambiguation strategy. Suppose A and B are sets and C = A /\ B is the intersection then we can write A | B = C | A-B | B-A for expressions represented by those sets. The latter expression is commutative because all of the sets are disjoint! So making all character sets disjoint by building intersections works as a disambiguation strategy.
This has been implemented in EE for long. What’s missing is a way to express the “set of all characters that are not digits” and that’s what ANY and univset are for in the specific context I needed. They are more general though be construction.
@Nerdbeard
as long as the elements are abstract entities ( e.g. objects of type Element not encapsulating any concrete value ) I don’t see any problem. This is not so interesting though unless we make assertion about those elements but then we are doing logic programming or theorem proving… Not sure where this leads?