We have seen how to use Parsley to parse BibTeX, but now we need to parse the LaTeX code inside the BibTeX entries and convert it to something else: plain text or, here, Markdown. Here again, Parsley appears to be really handy.
Of course, only LaTeX can actually parse full LaTeX1, so we will build a very limited parser, but which should be enough to handle the simple LaTeX code we find in a BibTeX that aim to be portable (i.e, it should have no fancy macros). Note also that we assume that the code is correct, so we won’t check this and the generated Markdown may be wrong if the LaTeX code is invalid.
First we need a grammar to match the various tokens of a LaTeX source code:
text
matches regular text with no white space and no special characters inside. This text is processed using methodtext
of objecttex
which is responsible for emulating LaTeX behaviour (see later)blank
matches white spaces, including newlines, and this is processed bytex.blank
name
andmacro
allow to match a call to a macro, which can be a name like in\emph
or a single character like in\\
or\'
bgroup
andegroup
respectively match{
and}
while calling the appropriate methods oftex
arg
matches a macro argument, which can be a group{...}
or a single charactercall
matches a call to a macro, together with its arguments. Note how(-> tex.arity(m)):n
allows to get the number of arguments expected bym
and bind it ton
which is then used inarg{n}:a
to collect ina
exactlyn
arguments. This is really a killer feature of Parsley!comment
matches comments and drop themmath
matches$
and toggles math mode. Note that we won’t parse maths, we will just typeset them in italicsdata
matches any of the text blocks above- finally,
doc
matches a full string of LaTeX source code with possibly nested groups
The resulting grammar is as follows:
grammar = r"""
text = (anything:x ?(x not in '{\\\n\t\r %$}') -> x)+:d
-> tex.text("".join(d))
blank = (' '|'\n'|'\t'|'\r')+:d -> tex.blank("".join(d))
name = letterOrDigit+:d -> "".join(d)
macro = '\\' ((name:n ws -> n)|anything)
bgroup = '{' -> tex.bgroup()
egroup = '}' -> tex.egroup()
arg = ((bgroup !(tex.pushpar()) doc:a !(tex.poppar()) egroup -> a)
|anything)
call = macro:m (-> tex.arity(m)):n arg{n}:a -> tex.call(m, *a)
comment = '%' (anything:x ?(x not in '\n'))+ '\n' -> ''
math = '$' -> tex.call('math')
data = (text|blank|comment|call|math)+:d -> "".join(d)
doc = (data|(bgroup:b doc:d egroup:e -> b+d+e))+:d -> "".join(d)
"""
Next, we need to build the class for object tex
used in the grammar.
In it’s constructor, it just compiles the grammar passing itself as
tex
in the parser’s environment. Method __call__
does the actual
parsing calling method doc
of the parser.
# -*- coding: utf-8 -*-
import inspect
import parsley
class LaTeX (object) :
def __init__ (self) :
self.parser = parsley.makeGrammar(grammar, {"tex": self})
def __call__ (self, data) :
self.tags = [[]]
self.newpar = True
self.pars = []
self.envs = []
return self.parser(data).doc().strip()
The various attributes assigned by __call__
allow to keep track of
current state while parsing and converting code:
-
tags
is a stack of lists corresponding to the nested groups and allowing to close various tags when exiting a group. For instance, if we parse{\it hello world}
we need to close italics at the end of the group. To this respect, our parser differs from LaTeX in that macros like\it
will have cumulative effects: for instance{\it hello \bf world}
will be rendered as_hello **world**_
, which is hello world. To achieve this in LaTeX, we should have used\itshape
and\bfseries
instead of\it
and\bf
. Tags management is made using the following methods:def tag (self, tag) : self.tags[-1].append(tag) def bgroup (self) : self.tags.append([]) return "" def egroup (self) : tags = self.tags.pop(-1) return "".join(self.call(tag, False) for tag in reversed(tags))
See how
egroup
pops the tags and call the appropriate method usingself.call
with a second parameter set toFalse
indicating that we are closing a tag (more below). -
newpar
andpars
allow to manage the empty lines between paragraphs and the white space at the beginning of each paragraph. The former isTrue
when we are currently beginning a paragraph and the latter is a stack to save/restore this information when we parse nesteddoc
in the grammar. This is the role oftex.pushpar()
andtex.poppar()
that we’ve encountered in rulearg
. This white space management is made by the following methods:def text (self, txt) : self.newpar = False return txt def blank (self, txt) : if self.newpar : return "" elif txt.count("\n") > 1 : self.newpar = True return "\n\n" else : return " " def pushpar (self) : self.pars.append(self.newpar) def poppar (self) : self.newpar = self.pars.pop(-1)
We see how
text
setsnewpar
toFalse
and howblank
allows to avoid white space at the beginning of paragraphs. -
Finally,
envs
is a stack of environments corresponding the the nesting of\begin{...}
and\end{...}
in LaTeX source code.
Then come macros emulation: shortcut
allows to map 1-character
macros to names, then call(MACRO, ...)
is a dispatcher to
call_MACRO(...)
. Finally, arity
uses module inspect
to compute
how many arguments a function call_MACRO
expects. Note that we do
not count a last argument opentag=True
that is used for tags in
groups (which avoid to have both open_TAG
and close_TAG
methods).
shortcut = {"'" : "acute",
"`" : "grave",
'"' : "diaeresis",
"^" : "circumflex",
"~" : "tilde",
"\\" : "newline",
"$" : "dollar",
}
def call (self, name, *args) :
name = self.shortcut.get(name, name)
handler = getattr(self, "call_%s" % name)
return handler(*args)
def arity (self, name) :
name = self.shortcut.get(name, name)
a, _, _, d = inspect.getargspec(getattr(self, "call_%s" % name))
if a[-1] == "opentag" and d and d[-1] == True :
return len(a) - 2
else :
return len(a) -1
Here come the implementation of the macros that add accents to
characters, like \'a
. We use hard-coded dicts that should probably
be completed to handle more cases. But the principle would remain the
same.
_acute = dict(zip("aeiouy", u"áéíóúý"))
_grave = dict(zip("aeiouy", u"àèìòùỳ"))
_diaeresis = dict(zip("aeiouy", u"äëïöüÿ"))
_circumflex = dict(zip("aeiouy", u"âêîôûŷ"))
_tilde = dict(zip("aon", u"ãõñ"))
def accent (self, text, accent) :
return getattr(self, "_" + accent).get(text[0], text[0]) + text[1:]
def call_acute (self, text) :
return self.accent(text, "acute")
def call_grave (self, text) :
return self.accent(text, "grave")
def call_diaeresis (self, text) :
return self.accent(text, "diaeresis")
def call_circumflex (self, text) :
return self.accent(text, "circumflex")
def call_tilde (self, text) :
return self.accent(text, "tilde")
Then a few useful macros, which can be easily completed with more macros.
def call_newline (self) :
return self.text("<br/>")
def call_href (self, url, text) :
return self.text("[%s](%s)" % (text, url))
def call_dollar (self) :
return self.text("$")
def call_math (self) :
return self.text("*")
def call_l (self) :
return self.text("l")
def call_L (self) :
return self.text("L")
def call_ae (self) :
return self.text(u"æ")
def call_oe (self) :
return self.text(u"œ")
We come to text styling. Tags are macros like \it
that apply
immediately, inserting an opening marker, and push there names onto
the current group using method tag
so that at the end of the group,
they will be called again, but with opentag=False
this time. When
generating Markdown, the opening and closing tags are the same, but we
could easily insert things like <tag>
and </tag>
instead. Other
macros like \emph
apply to a group, which is passed as string as
processed recursively in grammar rule arg
. Note that we have used
marker _
for italics and *
for maths.2
def call_it (self, opentag=True) :
if opentag :
self.tag("it")
return self.text("_")
def call_bf (self, opentag=True) :
if opentag :
self.tag("bf")
return self.text("**")
def call_tt (self, opentag=True) :
if opentag :
self.tag("tt")
return self.text("`")
def call_emph (self, text) :
return self.text("_%s_" % text)
def call_textit (self, text) :
return self.text("_%s_" % text)
def call_texttt (self, text) :
return self.text("`%s`" % text)
def call_textbf (self, text) :
return self.text("**%s**" % text)
Finally, we have macros for environments, i.e., \begin{...}
and
\end{...}
, which are directly implemented as macros \begin
and
\end
that manage the environment names onto stack envs
as
explained above. These macros call respectively begin_ENV
and
end_ENV
to perform the appropriate operations. This is how
environment itemize
is emulated, together with macro \item
.
def call_begin (self, name) :
handler = getattr(self, "begin_%s" % name)
self.bgroup()
self.envs.append(name)
return handler()
def call_end (self, name) :
handler = getattr(self, "end_%s" % name)
self.egroup()
return handler()
def begin_itemize (self) :
newpar, self.newpar = self.newpar, True
if newpar :
return ""
elif self.envs.count("itemize") > 1 :
return "\n"
else :
return "\n\n"
def call_item (self) :
if self.newpar :
self.newpar = False
return " " * self.envs.count("itemize") + "* "
else :
return "\n" + " " * self.envs.count("itemize") + "* "
def end_itemize (self) :
self.newpar = True
if "itemize" in self.envs :
return "\n"
else :
return "\n\n"
In this implementation of itemize
, we take care to appropriately
handle newpar
in order to insert the correct amount of newlines. For
instance on \begin{itemize}
, if there is already a paragraph
separation above, we don’t need to insert more newlines. Notice also
how we use stack envs
to insert the correct indentation on \item
and the correct number of newlines on the beginning or ending of an
itemize
.
-
LaTeX is built on the top of TeX that is a terribly complex language to parse and interpret. This process is very clearly explained in Donald Knuth’s TeXbook, but probably it could not be implemented using traditional grammar-based parsing. My goal here is not to have a Python implementation of TeX, but instead to have a quick parser/translator that will do the job for simple situations. ↩
-
Markdown allows both but doing so, we avoid problems with math within italics like in
\textit{variable $x$ is zero}
which is correctly interpreted (i.e., maths will be consistently rendered in italics independently of the context) when translated to\_variable \*x\* is zero\_
(what we do) but not when translated to\*variable \*x\* is zero\*
or\_variable \_x\_ is zero\_
that typesetx
in roman. On the other hand,\textit{this \emph{is} important}
is translated to\_this \_is\_ important\_
which is not interpreted as we would like by python-markdown (because it is sensitive to spaces before/after_
). ↩