LaTeX Lexer

This module contains all classes for lexing LaTeX code, as well as general purpose base classes for incremental LaTeX decoders and encoders, which could be useful in case you are writing your own custom LaTeX codec.

class latexcodec.lexer.Token(name, text)

A collections.namedtuple() storing information about a matched token.

See also

LatexLexer.tokens

name

The name of the token as a str.

text

The matched token text as bytes. The constructor also accepts text as memoryview, in which case it is automatically converted to bytes. This ensures that the token is hashable.

__len__()

Length of the token text.

__nonzero__()

Whether the token contains any text.

decode(encoding)

Returns the decoded token text in the specified encoding.

Note

Control words get an extra space added at the back to make sure separation from the next token, so that decoded token sequences can be str.join()ed together.

For example, the tokens b'\hello' and b'world' will correctly result in u'\hello world' (remember that LaTeX eats space following control words). If no space were added, this would wrongfully result in u'\helloworld'.

class latexcodec.lexer.LatexLexer(errors='strict')

Bases: codecs.IncrementalDecoder

A very simple lexer for tex/latex code.

flush_raw_tokens()

Flush the raw token buffer.

get_raw_tokens(bytes_, final=False)

Yield tokens without any further processing. Tokens are one of:

  • \<word>: a control word (i.e. a command)
  • \<symbol>: a control symbol (i.e. ^ etc.)
  • #<n>: a parameter
  • a series of byte characters
getstate()

Get state.

reset()

Reset state.

setstate(state)

Set state. The state must correspond to the return value of a previous getstate() call.

class latexcodec.lexer.LatexIncrementalLexer(errors='strict')

Bases: latexcodec.lexer.LatexLexer

A very simple incremental lexer for tex/latex code. Roughly follows the state machine described in Tex By Topic, Chapter 2.

The generated tokens satisfy:

  • no newline characters: paragraphs are separated by ‘par’
  • spaces following control tokens are compressed
get_tokens(bytes_, final=False)

Yield tokens while maintaining a state. Also skip whitespace after control words and (some) control symbols. Replaces newlines by spaces and par commands depending on the context.

class latexcodec.lexer.LatexIncrementalDecoder(errors='strict')

Bases: latexcodec.lexer.LatexIncrementalLexer

Simple incremental decoder. Transforms lexed LaTeX tokens into unicode.

To customize decoding, subclass and override get_unicode_tokens().

decode(bytes_, final=False)

Decode LaTeX bytes_ into a unicode string.

This implementation calls get_unicode_tokens() and joins the resulting unicode strings together.

get_unicode_tokens(bytes_, final=False)

Decode every token in inputenc encoding. Override to process the tokens in some other way (for example, for token translation).

class latexcodec.lexer.LatexIncrementalEncoder(errors='strict')

Bases: codecs.IncrementalEncoder

Simple incremental encoder for LaTeX. Transforms unicode into bytes.

To customize decoding, subclass and override get_latex_bytes().

encode(unicode_, final=False)

Encode the unicode_ string into LaTeX bytes.

This implementation calls get_latex_bytes() and joins the resulting bytes together.

get_latex_bytes(unicode_, final=False)

Encode every character in inputenc encoding. Override to process the unicode in some other way (for example, for character translation).