Franck Pommereau - Blog - From string formatting to parsing

With Python’s string formatting operator, you can do:

>>> format = "I'd like to eat %u %s of spam with %u eggs"
>>> data = (3, "pounds", 10)
>>> text = format % data
>>> text
"I'd like to eat 3 pounds of spam with 10 eggs"

But how to perform the reverse operation: given text and format, how to get data?

The idea is to analyse format and replace each % directive with the corresponding regexp. For instance, %u is replaced with [0-9]+ that actually matches an unsigned integer. Then, for each such replacement, we must do two more things: first, wrap the regexp in a named group in order to be able to find matched text; second, record the type of expected data (eg, int for %u) in order to convert it properly.

Here is a draft implementation:

import re

# two helper functions

def unrepr (text) :
    "reverse of repr"
    return eval(text)

def hexint (text) :
    "converts 0x... hexa string to integer"
    return int(text, 16)

# main class

class Parser :
    def __call__ (self, data, pattern) :
        regexp, types = self.compile(pattern)
        match = regexp.match(data)
        if not match :
            raise ValueError, "cannot match pattern %r on %r" % (pattern, data)
        groups = [data[match.start("Grp%s" % i):match.end("Grp%s" % i)]
                  for i in range(len(types))]
        return tuple(t(g) for t, g in zip(types, groups))
    def compile (self, pattern) :
        chars = []
        types = []
        i = 0
        while i < len(pattern) :
            if pattern[i] != "%" :
                chars.append(pattern[i])
                i += 1
            elif pattern[i+1] == "%" :
                chars.append("%")
                i += 2
            elif pattern[i+1] == "s" :
                chars.append("(?P<Grp%s>.*?)" % len(types))
                types.append(str)
                i += 2
            elif pattern[i+1] == "r" :
                chars.append("(?P<Grp%s>(?P<Par%s>['\"]).*?(?P=Par%s))"
                             % (len(types), len(types), len(types)))
                types.append(unrepr)
                i += 2
            elif pattern[i+1] == "c" :
                chars.append("(?P<Grp%s>.)" % len(types))
                types.append(str)
                i += 2
            elif pattern[i+1] == "u" :
                chars.append("(?P<Grp%s>[0-9]+)" % len(types))
                types.append(int)
                i += 2
            elif pattern[i+1] in "di" :
                chars.append("(?P<Grp%s>[+-]?[0-9]+)" % len(types))
                types.append(int)
                i += 2
            elif pattern[i+1] in "fF" :
                chars.append("(?P<Grp%s>[+-]?[0-9]*\.[0-9]*)" % len(types))
                types.append(float)
                i += 2
            elif pattern[i+1] in "xX" :
                chars.append("(?P<Grp%s>0[xX]?[0-9A-Fa-f]*)" % len(types))
                types.append(hexint)
                i += 2
            else :
                raise DataError, "unsupported format '%%%s'" % pattern[i+1]
        return re.compile("^" + "".join(chars) + "$"), types

Using this module, we can now write:

>>> import unformat
>>> p = unformat.Parser()
>>> p("I'd like to eat 3 pounds of spam with 10 eggs",
...   "I'd like to eat %u %s of spam with %u eggs")
(3, 'pounds', 10)

Of course, since we have regexp, the format string can be more general. For instance, we may use .* instead of spam and eggs:

>>> import unformat
>>> p = unformat.Parser()
>>> p("I'd like to eat 3 pounds of spam with 10 eggs",
...   "I'd like to eat %u %s of .* with %u .*")
(3, 'pounds', 10)

Possible improvements include:

more general and precise regexps (eg, %r and %s are not correct)
support for all formatting directives
support for named directives (like %(name)s)
caching of compiled strings
avoid eval in unrepr because it may be usafe with foreign data (use ast.literal_eval instead)