{"id":322,"date":"2009-04-16T12:09:56","date_gmt":"2009-04-16T11:09:56","guid":{"rendered":"http:\/\/fiber-space.de\/wordpress\/?p=322"},"modified":"2009-04-16T12:09:56","modified_gmt":"2009-04-16T11:09:56","slug":"universal-set-in-python","status":"publish","type":"post","link":"http:\/\/fiber-space.de\/wordpress\/2009\/04\/16\/universal-set-in-python\/","title":{"rendered":"Universal Set in Python"},"content":{"rendered":"<h3>Building sets from set() and ANY<\/h3>\n<p\/>\nWhen defining a regular expression there is a well known duality between a <em>set of characters<\/em> specified by enclosing characters or character ranges within square brackets `[&#8230;]` and defining the pattern of the complement set `[^&#8230;]` by preceding the caret symbol ^ .<\/p>\n<ul>\n<li>[0-9]       &#8212; matches all digits<\/li>\n<li>[^0-9]    &#8212; matches all non-digits<\/li>\n<\/ul>\n<p>The nature of the complement set mentioned here is less precisely defined and is implementation dependent in practice. If a regexp matcher knows about `ASCII` only then `[^0-9]` is the set of all `ASCII` characters which are not digits.<\/p>\n<p>Suppose now we want to build a character matching set from sets in the sense of a `set` data-type as inhabited in Python as a builtin. This is easy because we start from the empty `set()` and add just more elements. Each finite set can be constructed in this way and we can establish a set ~ pattern correspondence:<\/p>\n<p>`set(range(9))  ~  [0-9]`<\/p>\n<p>But how do we find a corresponding set for `[^0-9]`?<\/p>\n<p>This cannot be finitely constructed from the bottom up using `set()` as a start value but it has to be finitely de-constructed from the <em>set of all elements<\/em>. This looks like we are running into unlimited comprehension and the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Russell's_paradox\">Russel paradox<\/a> but actually the set theory we are using is quite weak and element membership can be decided just by doing hashtable lookups.<\/p>\n<p>Instead of constructing a set from `set()` we want to (de-)construct it from `ANY`. The name `ANY` is not chosen by chance because we want to express the correspondence with `.` pattern:<\/p>\n<p>`ANY  ~  .`<\/p>\n<p>( Actually `ANY` is a little stronger because `.` doesn&#8217;t match newline characters unless explicitly specified by the `DOTALL` flag in the `re.compile` function. )<\/p>\n<p>Using `ANY` we can easily state the above correspondence:<\/p>\n<p>`ANY  &#8211; set(range(9)) ~  [^0-9]`<\/p>\n<h3>Implementation of ANY<\/h3>\n<p\/>\nWe define `ANY` as an instance of `univset`.  The `univset` has a set-like interface and implements set operations consistently on sets of the type `ANY-A`, `ANY-B`,&#8230; where `A`, `B`&#8230; are finite sets constructed from `set()`. For example<\/p>\n<p>`ANY-A \u2229 ANY-B` = `ANY &#8211; ( A \u222a B )`<br \/>\n`ANY-A \u222a B` = `ANY &#8211; ( A &#8211; B )`<\/p>\n<p\/>\n<pre lang=\"python\">class univset(object):\r\n    def __init__(self):\r\n        self._diff = set()\r\n\r\n    def __sub__(self, other):\r\n        S = univset()\r\n        if type(other) == set:\r\n            S._diff = self._diff | other\r\n            return S\r\n        else:\r\n            S._diff = self._diff | other._diff\r\n            return S\r\n\r\n    def __rsub__(self, other):\r\n        return other &amp; self._diff\r\n\r\n    def __contains__(self, obj):\r\n        return not obj in self._diff\r\n\r\n    def __and__(self, other):\r\n        return other - self._diff\r\n\r\n    def __rand__(self, other):\r\n        return other - self._diff\r\n\r\n    def __repr__(self):\r\n        if self._diff == set():\r\n            return \"ANY\"\r\n        else:\r\n            return \"ANY - %s\"%self._diff\r\n\r\n    def __or__(self, other):\r\n        S = univset()\r\n        S._diff = self._diff - other\r\n        return S\r\n\r\n    def __xor__(self, other):\r\n        return (self - other) | (other - self)\r\n\r\n    def add(self, elem):\r\n        if elem in self._diff:\r\n            self._diff.remove(elem)\r\n\r\n    def update(self, elem):\r\n        self._diff = self._diff - other\r\n\r\n    def __ror__(self, other):\r\n        return self.__or__(other)\r\n\r\n    def union(self, other):\r\n        return self.__or__(other)\r\n\r\n    def difference(self, other):\r\n        return self.__sub__(other)\r\n\r\n    def intersection(self, other):\r\n        return self.__and__(other)\r\n\r\n    def symmetric_difference(self, other):\r\n        return self.__xor__(other)\r\n\r\n    def issubset(self, other):\r\n        if type(other) == set:\r\n            return False\r\n        if issubset(other._diff, self._diff):\r\n            return True\r\n        return False\r\n\r\n    def issuperset(self, other):\r\n        if self._diff &amp; other:\r\n            return False\r\n        return True\r\n\r\n    def __lt__(self, other):\r\n        return self.issubset(other)\r\n\r\n    def __eq__(self, other):\r\n        if type(other) == set:\r\n            return False\r\n        try:\r\n            return self._diff == other._diff\r\n        except AttributeError:\r\n            return False\r\n\r\n    def __ne__(self, other):\r\n        return not self.__eq__(other)\r\n\r\n    def __le__(self, other):\r\n        return self.__lt__(other) or self.__eq__(other)\r\n\r\n    def __gt__(self, other):\r\n        return self.issuperset(other)\r\n\r\n    def __gt__(self, other):\r\n        return self.issuperset(other) or self == other<\/pre>\n<p><strong>Examples<\/strong><\/p>\n<pre lang=\"python\">ANY = univset()\r\nNON_DIGITS = ANY - set(range(9))\r\nassert 8 not in NON_DIGITS\r\nassert 'a' in NON_DIGITS\r\nassert 0 in NON_DIGITS | set([0])<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Building sets from set() and ANY When defining a regular expression there is a well known duality between a set of characters specified by enclosing characters or character ranges within square brackets `[&#8230;]` and defining the pattern of the complement &hellip; <a href=\"http:\/\/fiber-space.de\/wordpress\/2009\/04\/16\/universal-set-in-python\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[18,4],"tags":[17],"_links":{"self":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/322"}],"collection":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/comments?post=322"}],"version-history":[{"count":10,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/322\/revisions"}],"predecessor-version":[{"id":332,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/posts\/322\/revisions\/332"}],"wp:attachment":[{"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/media?parent=322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/categories?post=322"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/fiber-space.de\/wordpress\/wp-json\/wp\/v2\/tags?post=322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}