Getting Started

Overview

A lexer and codec to work with LaTeX code in Python.

The codec provides a convenient way of going between text written in LaTeX and unicode. Since it is not a LaTeX compiler, it is more appropriate for short chunks of text, such as a paragraph or the values of a BibTeX entry, and it is not appropriate for a full LaTeX document. In particular, its behavior on the LaTeX commands that do not simply select characters is intended to allow the unicode representation to be understandable by a human reader, but is not canonical and may require hand tuning to produce the desired effect.

The encoder does a best effort to replace unicode characters outside of the range used as LaTeX input (ascii by default) with a LaTeX command that selects the character. More technically, the unicode code point is replaced by a LaTeX command that selects a glyph that reasonably represents the code point. Unicode characters with special uses in LaTeX are replaced by their LaTeX equivalents. For example,

original text encoded LaTeX
¥ \yen
ü \"u
\N{NO-BREAK SPACE} ~
~ \textasciitilde
% \%
# \#
\textbf{x} \textbf{x}

The decoder does a best effort to replace LaTeX commands that select characters with the unicode for the character they are selecting. For example,

original LaTeX decoded unicode
\yen ¥
\"u ü
~ \N{NO-BREAK SPACE}
\textasciitilde ~
\% %
\# #
\textbf{x} \textbf {x}
# #

In addition, comments are dropped (including the final newline that marks the end of a comment), paragraphs are canonicalized into double newlines, and other newlines are left as is. Spacing after LaTeX commands is also canonicalized.

For example,

hi % bye
there\par world
\textbf     {awesome}

is decoded as

hi there

world
\textbf {awesome}

When decoding, LaTeX commands not directly selecting characters (for example, macros and formatting commands) are passed through unchanged. The same happens for LaTeX commands that select characters but are not yet recognized by the codec. Either case can result in a hybrid unicode string in which some characters are understood as literally the character and others as parts of unexpanded commands. Consequently, at times, backslashes will be left intact for denoting the start of a potentially unrecognized control sequence.

Given the numerous and changing packages providing such LaTeX commands, the codec will never be complete, and new translations of unrecognized unicode or unrecognized LaTeX symbols are always welcome.

Installation

Install the module with pip install latexcodec, or from source using python setup.py install.

Minimal Example

Simply import the latexcodec module to enable "latex" to be used as an encoding:

import latexcodec
text_latex = b"\\'el\\`eve"
assert text_latex.decode("latex") == u"élève"
text_unicode = u"ångström"
assert text_unicode.encode("latex") == b'\\aa ngstr\\"om'

There are also a ulatex encoding for text transforms. The simplest way to use this codec goes through the codecs module (as for all text transform codecs on Python):

import codecs
import latexcodec
text_latex = u"\\'el\\`eve"
assert codecs.decode(text_latex, "ulatex") == u"élève"
text_unicode = u"ångström"
assert codecs.encode(text_unicode, "ulatex") == u'\\aa ngstr\\"om'

By default, the LaTeX input is assumed to be ascii, as per standard LaTeX. However, you can also specify an extra codec as latex+<encoding> or ulatex+<encoding>, where <encoding> describes another encoding. In this case characters will be translated to and from that encoding whenever possible. The following code snippet demonstrates this behaviour:

import latexcodec
text_latex = b"\xfe"
assert text_latex.decode("latex+latin1") == u"þ"
assert text_latex.decode("latex+latin2") == u"ţ"
text_unicode = u"ţ"
assert text_unicode.encode("latex+latin1") == b'\\c t'  # ţ is not latin1
assert text_unicode.encode("latex+latin2") == b'\xfe'   # but it is latin2

When encoding using the ulatex codec, you have the option to pass through characters that cannot be encoded in the desired encoding, by using the 'keep' error. This can be a useful fallback option if you want to encode as much as possible, whilst still retaining as much as possible of the original code when encoding fails. If instead you want to translate to LaTeX but keep as much of the unicode as possible, use the ulatex+utf8 codec, which should never fail.

import codecs
import latexcodec
text_unicode = u'⌨'  # \u2328 = keyboard symbol, currently not translated
try:
    # raises a value error as \u2328 cannot be encoded into latex
    codecs.encode(text_unicode, "ulatex+ascii")
except ValueError:
    pass
assert codecs.encode(text_unicode, "ulatex+ascii", "keep") == u'⌨'
assert codecs.encode(text_unicode, "ulatex+utf8") == u'⌨'

Limitations

  • Not all unicode characters are registered. If you find any missing, please report them on the tracker: https://github.com/mcmtroffaes/latexcodec/issues
  • Unicode combining characters are currently not handled.
  • By design, the codec never removes curly brackets. This is because it is very hard to guess whether brackets are part of a command or not (this would require a full latex parser). Moreover, bibtex uses curly brackets as a guard against case conversion, in which case automatic removal of curly brackets may not be desired at all, even if they are not part of a command. Also see: http://stackoverflow.com/a/19754245/2863746