Class Tokenizer
object --+
|
Tokenizer
A tokenizer for LIGO Light Weight XML Stream and Array elements.
Converts (usually comma-) delimited text streams into sequences of Python
objects. An instance is created by calling the class with the delimiter
character as the single argument. Text is appended to the internal
buffer by passing it to the append() method. Tokens are extracted by
iterating over the instance. The Tokenizer is able to directly extract
tokens as various Python types. The set_types() method is passed a
sequence of the types to which tokens are to be converted. The types
will be used in order, cyclically. For example, passing [int] to
set_types() causes all tokens to be converted to integers, while [str,
int] causes the first token to be returned as a string, the second as an
integer, then the third as a string again, and so on. The default is to
extract all tokens as strings. If a token type is set to None then the
corresponding tokens are skipped. For example, invoking set_types() with
[int, None] causes the first token to be converted to an integer, the
second to be skipped the third to be converted to an integer, and so on.
This can be used to improve parsing performance when only a subset of the
input stream is required.
Example:
>>> from glue.ligolw import tokenizer
>>> t = tokenizer.Tokenizer(u",")
>>> t.set_types([str, int])
>>> list(t.append("a,10,b,2"))
['a', 10, 'b']
>>> list(t.append("0,"))
[20]
Notes. The last token will not be extracted until a delimiter
character is seen to terminate it. Tokens can be quoted with '"'
characters, which will removed before conversion to the target type. An
empty token (two delimiters with only whitespace between them) is
returned as None regardless of the requested type. To prevent a
zero-length string token from being interpreted as None, place it in
quotes.
|
__init__(...)
x.__init__(...) initializes x; see help(type(x)) for signature |
|
|
|
|
a new object with type S, a subtype of T
|
|
|
append(...)
Append a unicode object to the tokenizer's internal buffer. |
|
|
the next value, or raise StopIteration
|
|
|
set_types(...)
Set the types to be used cyclically for token parsing. |
|
|
Inherited from object :
__delattr__ ,
__format__ ,
__getattribute__ ,
__hash__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__sizeof__ ,
__str__ ,
__subclasshook__
|
|
data
The current contents of the internal buffer.
|
Inherited from object :
__class__
|
__init__(...)
(Constructor)
|
|
x.__init__(...) initializes x; see help(type(x)) for signature
- Overrides:
object.__init__
|
- Returns: a new object with type S, a subtype of T
- Overrides:
object.__new__
|
Append a unicode object to the tokenizer's internal buffer. Also
accepts str objects as input.
|
Set the types to be used cyclically for token parsing. This function
accepts an iterable of callables. Each callable will be passed the token
to be converted as a unicode string. Special fast-paths are included to
handle the Python builtin types float, int, long, str, and unicode. The
default is to return all tokens as unicode objects.
|