chemfp.io module

Helper module for working with file I/O.

Only the Location class is part of the public API and meant to be used directly. The public API also returns ProgressBar objects, which are part of the public API.

The other parts of this module are not part of the public API. (Let me know if anything else should be part of the public API.)

class chemfp.io.Location(filename=None)

Bases: object

Get location and other internal reader and writer state information

A Location instance gives a way to access information like the current record number, line number, and molecule object.

>>> import chemfp
>>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166",
...                        "ChEBI_lite.sdf.gz", id_tag="ChEBI ID") as reader:
...   for id, fp in reader:
...     if id == "CHEBI:3499":
...         print("Record starts at line", reader.location.lineno)
...         print("Record byte range:", reader.location.offsets)
...         print("Number of atoms:", reader.location.mol.GetNumAtoms())
...         break
... 
[08:18:12]  S group MUL ignored on line 103
Record starts at line 3599
Record byte range: (138171, 141791)
Number of atoms: 36
  • filename - a string describing the source or destination
  • lineno - the line number for the start of the file
  • mol - the toolkit molecule for the current record
  • offsets - the (start, end) byte positions for the current record
  • position - the approximate current position in the input file. (This should only be used for progress information.)
  • start_position - the position when the location was attached to the file. Usually 0.
  • end_position - the expected final position.
  • position_units - a short name for how the position is measured. For files this usually “bytes”.
  • recno - the current record number
  • record - the record as a byte string
  • record_format - the record format, like “sdf” or “can”
  • output_recno - the number of records written successfully
  • row - the row when reading a CSV file
  • bytes_read - the byte number of bytes read. Currently only available during a UnicodeDecodeError.
  • obj - a user-specified object connected with this location and any of its copies.

Most of the readers and writers do not support all of the properties. Unsupported properties return a None. The filename is a read/write attribute and the other attributes are read-only.

If you don’t pass a location to the readers and writers then they will create a new one based on the source or destination, respectively. You can also pass in your own Location, created as Location if you have an actual filename, or Location.from_source() or Location.from_destination() if you have a more generic source or destination.

__init__(filename=None)

Use filename as the location’s filename

bytes_read

The number of bytes read to this point.

This is only valid in the location attached to a UnicodeDecodeError, which currently only occurs during CSV processing. It can be used with the error’s start, end, and object to get the error location relative to the entire file.

clear_registry()

(Do not use) Remove all registered handlers.

Part of the internal API, and subject to change.

Meta private:
copy()

(Do not use) Return a shallow copy of this location

Part of the internal API, and subject to change.

Meta private:
end()

(Do not use) Save the current values of any handlers and remove the handlers.

Part of the internal API, and subject to change.

Meta private:
end_position

The (expected) end position in the file

The end_position, along with position is meant for progress information and is not necessarily the number of bytes in the current file.

For input files this is typically the file size, if available. Note that the actual file size may change while the file is being processed, in which case end_position will be invalid.

The position is measured in position_units units, typically “bytes” for input readers, and “records” for output writers.

filename = None

A string describing the source or destination, or None (read/write)

first_line

The first line of the current record, as a string.

The newline and any preceeding control return characters are not included.

The first line is decoded as UTF-8, with unknown characters replaced with ‘?’. Use first_line_bytes if this is not appropriate.

first_line_bytes

The first line of the current record, as a byte string.

The newline and any preceeding control return characters are not included.

If you want a text/Unicode string and the input record is UTF-8 encoded then use first_line.

classmethod from_destination(destination)

Create a Location instance based on the destination

If destination is a string then it’s used as the filename. If destination is None then the location filename is “<stdout>”. If destination is a file object then its name attribute is used as the filename, or None if there is no attribute.

classmethod from_source(source)

Create a Location instance based on the source

If source is a string then it’s used as the filename. If source is None then the location filename is “<stdin>”. If source is a file object then its name attribute is used as the filename, or None if there is no attribute.

get_registry()

(Do not use) Get the current registry as a dict.

Part of the internal API, and subject to change.

Meta private:
lineno

The current line number, starting from 1.

This will be 0 if file processing has not yet started.

mol

The molecule object for the current record.

offsets

The (start, end) byte offsets, starting from 0.

start is the record start byte position and end is one byte past the last byte of the record.

output_recno

The number of records actually written to the file or string.

The value recno - output_recno is the number of records sent to the writer but which had an error and could not be written to the output.

position

The (approximate) current position in the file.

The position, along with end_position should only be used for progress information. They do not correspond to the start or end position of the current record, nor to the sum of the record sizes.

For example, for compressed files this may be be the location of the end of the most recently read compressed data.

Several records may have the same position, and position may equal end_position even if more records are available.

The position is measured in position_units units, typically “bytes” for input readers, and “records” for output writers.

position_units

The units used to measure position and end_position

This is typically “bytes” for input readers and “records” for output readers.

progress_bar(**kwargs)

Return a ProgressBar given the location and kwargs values.

The ProgressBar is a wrapper around tqdm. if the kwarg progress is True (the default) then the progress bar will be displayed. If False then a progress bar will not be used. If it is callable, then it will be called instead of calling the tqdm constructor. It must understand the tqdm constructor parameters.

The remaining kwargs are passed to the tqdm constructor. If unspecified, the progress_bar() method will add some kwargs depending on what is available from the Location.

If desc is not specified, the a default is created based on the optional file_count kwarg followed by the location’s filename.

A ProgressBar only shows the progress for a single file. When processing multiple files, use file_count to add information to the default desc about the how many files have been processed and how many are going to be processed.

If file_count is not None then it must be a 2-element tuple containing the current file index (starting at 0) and the number of files to process. The number of files to process may be None if that number if not known. The file_count (1, 5) results in the description “(2/5)” and the file_count (1, None) results in the description “(2)”.

The progress bar will show one of several types of progress, depending on what information is available from the location.

If it’s possible to get a position (typically the byte position in the input file or the compressed input file) then that is used. The tqdbm units is set if the location has a position_units and, if position_units is “bytes” then tqdm’s unit_scale is set to 1. The tqdm initial is set to the current position.

Otherwise, if it’s possible to get a recno then it is used for the progress information, with tqdm units set to “recs” and the tqdm initial is set to the current record number (which is likely 0).

Otherwise, the tqdm units will be “its”, which is the tqdm default.

The tqdm `total are set to the location’s end_position, if available. This must only be available when it’s possible to get the current position.

The tqdm dynamic_ncols is set to True, to allow terminal resizing to update the tqdm output.

Again, if you set the values in the kwargs then they will not be overwritten but will be passed directly to the tqdm constructor (or to the progress callable).

Parameters:
  • progress (boolean or callable) – True to show a progress bar, False to not show one, or a callable to use instead of the tqdm constructor
  • file_count (None, or (i, None), or (i, N)) – optional file index and number of files, used as the start of ‘desc’
  • **kwargs – kwargs passed to the tqdm constructor or progress callable.
Returns:

a ProgressBar

recno

The current record number

For writers this is the number of records sent to the writer, and output_recno is the number of records sucessfully written to the file or string.

record

The current record as an uncompressed byte or text string.

record_format

The name of the record format.

This is a string like “smi”, “fps”, or “sdf”, without any compression suffix.

register(**kwargs)

(Do not use) Used by the reader or writer to add property callback handlers.

Part of the internal API, and subject to change.

Meta private:
row

The row columns, when reading a CSV file.

save(**kwargs)

(Do not use) Used by the reader or writer to specify final values.

Part of the internal API, and subject to change.

Meta private:
start_position

The position when the location was attached to the file

The start_position and end_position, along with position are meant for progress information and are not necessarily the number of bytes in the current file.

For input files this is typically 0.

The position is measured in position_units units, typically “bytes” for input readers, and “records” for output writers.

where()

Return a human readable description about the current reader or writer state.

The description will contain the filename, line number, record number, and up to the first 40 characters of the first line of the record, if those properties are available.

class chemfp.io.ProgressBar(tqdm, start_position, get_position)

Bases: object

Use a tdqm progress bar to track location progress.

The progress shown depends on the capabilities of the Location.

  • If position information is available, use it, along with optional units and end position information.
  • Otherwise, display the record or iteration progress.

See Location.progress_bar() for details about how to configure and create a ProgressBar. This class should not be called directly.

__call__(it)

Wrap an iterator and update the toolbar after yielding each item

Example of use:

with location.progress_bar() as progress_bar:
    writer.write_fingerprints(progress_bar(reader))
Meta public:
__enter__()

Context manager to close the progress bar upon completion.

Meta public:
__exit__(type, value, traceback)

Close the progress bar when the context ends.

Meta public:
close()

Close the progress bar.

update(ignored=0)

Update the toolbar to the current progress.

Use this to update the progress bar manually when not being used as an iterator wrapper.

Example of use:

with location.progress_bar() as progress_bar:
  for i, mol in enumerate(mol_reader):
    if i % 100 == 0:
      # only update after every 100 molecules
      progress_bar.update()

Note: The function takes an optional ‘ignored’ parameter for API compatibility with the tqdm progress bar update() method. The value is ignored.