Times ago I’ve been impressed by this video about Parsley, demonstrating a very convincing parsing tool.
Yesterday, I needed to parse some BibTeX. After testing several packages without being really convinced, I decided to give Parsley a try.
My need is actually to parse only a restricted version of BibTeX, in
particular, all items must be enclosed in braces {...}
while full
BibTeX allows "..."
or no delimiter at all. We need a few rules:
text
matches a text with no braces inside.anything:x ?(x not in '{}') -> x
matches any character not in'{}'
and the rule matches a list of suchx
collected indata
. So we concatenate them in-> "".join(data)
string
matches a text with possibly nested braces. Notice howstring:s -> '{{ '{%s}' }}' % s
restores the braces when astring
is matched inside anotherstring
value
matches astring
followed by a comma1pair
matches a key/value likeauthor = {...},
item
matches the pairs inside a BibTeX referenceentry
adds the type (e.g.,@Article
) and key reference- finally
biblio
matches the whole content of a bib file
All together, this yields the following code:
import parsley
parser = parsley.makeGrammar(r"""
text = (anything:x ?(x not in '{}') -> x)+:data
-> "".join(data)
string = '{' (text|(string:s -> '{{ '{%s}' }}' % s))+:data '}'
-> "".join(data)
value = string:data ','
-> data
pair = ws (letter+):key ws '=' ws value:val ws
-> "".join(key), val
item = pair:first pair*:rest
-> [first] + rest
entry = ws '@' (letter+):kind ws '{'
(anything:x ?(x not in ' \t\n\r{,}') -> x)+:key ','
item:content '}' ws
-> [('type', "".join(kind)), ('key', "".join(key))] + content
biblio = ws (entry:e ws -> e)*:items
-> [dict(i) for i in items]
""", {})
And that’s it. By running parser(bibdata).biblio()
I get my bib file
turned into a list
of dict
. Moreover, not only Parsley allows to
easily build a parser, but also it gives really helpful error messages
on parsing errors, which is usually not the case for most parser
generators.
-
My only disappointment is that Parsley could not handle a better rule for
item
: when I writeitem = pair:first (',' pair)*:rest ','?
and drop rulevalue
usingstring
instead, Parsley complains for incorrect syntax at parse time. Maybe I should really read the doc… ↩