Package glue :: Package ligolw :: Module tokenizer :: Class Tokenizer

Class Tokenizer

object --+
         |
        Tokenizer

A tokenizer for LIGO Light Weight XML Stream and Array elements. Converts (usually comma-) delimited text streams into sequences of Python objects. An instance is created by calling the class with the delimiter character as the single argument. Text is appended to the internal buffer by passing it to the append() method. Tokens are extracted by iterating over the instance. The Tokenizer is able to directly extract tokens as various Python types. The set_types() method is passed a sequence of the types to which tokens are to be converted. The types will be used in order, cyclically. For example, passing [int] to set_types() causes all tokens to be converted to integers, while [str, int] causes the first token to be returned as a string, the second as an integer, then the third as a string again, and so on. The default is to extract all tokens as strings. If a token type is set to None then the corresponding tokens are skipped. For example, invoking set_types() with [int, None] causes the first token to be converted to an integer, the second to be skipped the third to be converted to an integer, and so on. This can be used to improve parsing performance when only a subset of the input stream is required.

Example:

>>> from glue.ligolw import tokenizer
>>> t = tokenizer.Tokenizer(u",")
>>> t.set_types([str, int])
>>> list(t.append("a,10,b,2"))
['a', 10, 'b']
>>> list(t.append("0,"))
[20]

Notes. The last token will not be extracted until a delimiter character is seen to terminate it. Tokens can be quoted with '"' characters, which will removed before conversion to the target type. An empty token (two delimiters with only whitespace between them) is returned as None regardless of the requested type. To prevent a zero-length string token from being interpreted as None, place it in quotes.

Instance Methods

[hide private]

__init__(...)
x.__init__(...) initializes x; see help(type(x)) for signature

__iter__(x)
iter(x)

a new object with type S, a subtype of T

__new__(T, S, ...)

append(...)
Append a unicode object to the tokenizer's internal buffer.

the next value, or raise StopIteration

next(x)

set_types(...)
Set the types to be used cyclically for token parsing.

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties

[hide private]

data
The current contents of the internal buffer.

Inherited from object: __class__

Method Details

[hide private]

init(...)
(Constructor)

x.__init__(...) initializes x; see help(type(x)) for signature

Overrides: object.__init__

new(T, S, ...)

Returns: a new object with type S, a subtype of T
Overrides: object.__new__

append(...)

Append a unicode object to the tokenizer's internal buffer. Also accepts str objects as input.

set_types(...)

Set the types to be used cyclically for token parsing. This function accepts an iterable of callables. Each callable will be passed the token to be converted as a unicode string. Special fast-paths are included to handle the Python builtin types float, int, long, str, and unicode. The default is to return all tokens as unicode objects.

Class Tokenizer

__init__(...) (Constructor)

__new__(T, S, ...)

append(...)

set_types(...)

init(...)
(Constructor)

new(T, S, ...)