With Python’s string formatting operator, you can do:
>>> format = "I'd like to eat %u %s of spam with %u eggs"
>>> data = (3, "pounds", 10)
>>> text = format % data
>>> text
"I'd like to eat 3 pounds of spam with 10 eggs"
But how to perform the reverse operation: given text
and format
,
how to get data
?
The idea is to analyse format and replace each %
directive with the
corresponding regexp. For instance, %u
is replaced with [0-9]+
that actually matches an unsigned integer. Then, for each such
replacement, we must do two more things: first, wrap the regexp in a
named group in order to be able to find matched text; second, record
the type of expected data (eg, int
for %u
) in order to convert it
properly.
Here is a draft implementation:
import re
# two helper functions
def unrepr (text) :
"reverse of repr"
return eval(text)
def hexint (text) :
"converts 0x... hexa string to integer"
return int(text, 16)
# main class
class Parser :
def __call__ (self, data, pattern) :
regexp, types = self.compile(pattern)
match = regexp.match(data)
if not match :
raise ValueError, "cannot match pattern %r on %r" % (pattern, data)
groups = [data[match.start("Grp%s" % i):match.end("Grp%s" % i)]
for i in range(len(types))]
return tuple(t(g) for t, g in zip(types, groups))
def compile (self, pattern) :
chars = []
types = []
i = 0
while i < len(pattern) :
if pattern[i] != "%" :
chars.append(pattern[i])
i += 1
elif pattern[i+1] == "%" :
chars.append("%")
i += 2
elif pattern[i+1] == "s" :
chars.append("(?P<Grp%s>.*?)" % len(types))
types.append(str)
i += 2
elif pattern[i+1] == "r" :
chars.append("(?P<Grp%s>(?P<Par%s>['\"]).*?(?P=Par%s))"
% (len(types), len(types), len(types)))
types.append(unrepr)
i += 2
elif pattern[i+1] == "c" :
chars.append("(?P<Grp%s>.)" % len(types))
types.append(str)
i += 2
elif pattern[i+1] == "u" :
chars.append("(?P<Grp%s>[0-9]+)" % len(types))
types.append(int)
i += 2
elif pattern[i+1] in "di" :
chars.append("(?P<Grp%s>[+-]?[0-9]+)" % len(types))
types.append(int)
i += 2
elif pattern[i+1] in "fF" :
chars.append("(?P<Grp%s>[+-]?[0-9]*\.[0-9]*)" % len(types))
types.append(float)
i += 2
elif pattern[i+1] in "xX" :
chars.append("(?P<Grp%s>0[xX]?[0-9A-Fa-f]*)" % len(types))
types.append(hexint)
i += 2
else :
raise DataError, "unsupported format '%%%s'" % pattern[i+1]
return re.compile("^" + "".join(chars) + "$"), types
Using this module, we can now write:
>>> import unformat
>>> p = unformat.Parser()
>>> p("I'd like to eat 3 pounds of spam with 10 eggs",
... "I'd like to eat %u %s of spam with %u eggs")
(3, 'pounds', 10)
Of course, since we have regexp, the format string can be more
general. For instance, we may use .*
instead of spam
and eggs
:
>>> import unformat
>>> p = unformat.Parser()
>>> p("I'd like to eat 3 pounds of spam with 10 eggs",
... "I'd like to eat %u %s of .* with %u .*")
(3, 'pounds', 10)
Possible improvements include:
- more general and precise regexps (eg,
%r
and%s
are not correct) - support for all formatting directives
- support for named directives (like
%(name)s
) - caching of compiled strings
- avoid
eval
inunrepr
because it may be usafe with foreign data (useast.literal_eval
instead)