chemfp.csv_readers module

This module contains CSV file readers and methods to work with CSV dialects.

The CSV readers are based on Python’s ‘csv’ module, described at https://docs.python.org/3/library/csv.html . I found it hard to configure a new dialect using that module, and prefer using the get_dialect() function defined here.

The read_csv_rows() function returns a CSVRowReader which iterates over columns in a row.

The read_csv_ids_and_molecules_with_parser() function returns a CSVIdAndMoleculeReader which iterates over (id, molecule) pairs from columns in a CSV file. You probably want to use the “read_csv_ids_and_molecules()” function for a given toolkit wrapper than use this function directly.

The read_csv_ids_and_fingerprints() function returns a CSVFingerprintReader which extracts pre-computed fingerprints from the CSV file and iterates over them as (id, fingerprint) pairs.

exception chemfp.csv_readers.CSVConfigurationError

Bases: _csv.Error, TypeError

Exception raised due to a CSV configuration error

exception chemfp.csv_readers.CSVDialectError

Bases: _csv.Error

Exception raised when a requested named dialect is not known

exception chemfp.csv_readers.CSVUnicodeDecodeError(encoding, object, start, end, reason, bytes_read)

Bases: UnicodeDecodeError

A subclass of UnicodeDecodeError used to get the error position in the CSV file

The normal UnicodeDecodeError only reports the location relative to the text block. The CSVUnicodeDecodeError includes the total number of bytes read before raising the exeception (as bytes_read). Use the following to computer the start and end locations relative to the entire file:

start_in_file = err.bytes_read - len(err.object) + err.start
end_in_file = err.bytes_read - len(err.object) + err.end
end_in_file

Return the end position relative to the start of the file

start_in_file

Return the start position relative to the start of the file

chemfp.csv_readers.get_dialect_name(dialect)

Given a csv.Dialect or CSVDialect, try to figure out its name.

The default supported names are ‘csv’, ‘tsv’, as well as the dialects registered in csv.list_dialects(), of which ‘unix’ is the only relevant one.

class chemfp.csv_readers.CSVDialect(*, delimiter=', ', quotechar='"', escapechar=None, doublequote=True, skipinitialspace=False, quoting=0, lineterminator='rn', strict=False)

Bases: csv.Dialect

This is a subclass of csv.Dialect

For details about how the configuration attributes work see https://docs.python.org/3/library/csv.html .

classmethod from_dialect(dialect)

Return a new CSVDialect given a dialect name or Dialect-like object.

Raise a CSVDialectException if a named dialect or alias is unknown.

A Dialect-like object must have all of the expected CSVDialect attributes (‘delimiter’, ‘quotechar’, and so on.)

get_dialect_name()

Return the dialect name, if known, otherwise return None

chemfp.csv_readers.get_dialect(dialect='csv', **kwargs)

Create a new CSVDialect given a base dialect and optional modifiers.

If a configuration value is not specified in the keyword arguments then use the corresponding attribute of the dialect.

The dialect may be a string with the dialect name, or a Dialect-like object with the required CSV attributes. If the named dialect is unknown then raise a CSVDialectException exception. Raise an AttributeError if the Dialect-like object does not contain a required attribute.

Raise a TypeError if an unsupported keyword argument is passed in.

Parameters:
  • dialect (a string or Dialect instance) – the default dialect properties for the returned dialect
  • kwargs (keyword arguments) – overide the default configuration properties
Returns:

a CSVDialect

class chemfp.csv_readers.CSVRowReader(dialect, has_header, titles, row_iter, close, location)

Bases: object

Information about the CSV row reader

  • dialect - a CSVDialect describing the CSV reader configuration
  • has_header - True if the CSV reader was configured to read header titles
  • titles - a list of title names, or None if there was no header
  • location- a chemfp.io.Location instance
  • closed - True if the reader has been closed

The CSVRowReader is also a context manager which closes the CSV file when done.

close()

Close the reader

If the reader wasn’t previously closed then close it.

chemfp.csv_readers.read_csv_rows(source=None, dialect=None, has_header=True, compression='auto', location=None, encoding='utf8', encoding_errors='strict')

Read rows from a CSV file

Read from source, which may be a filename, a file-like object, or None (the default) to read from stdin.

Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.

If has_header is True then the first line/record contains column titles, and if False then there are no column titles.

Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the CSV source
  • dialect (None, a string name, or a Dialect instance) – the CSV dialect
  • has_header (bool) – True if the first record contains titles, False of it does not
  • compression (string or None) – file compression format
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string) – the name of the file’s character encoding
  • encoding_errors (string) – the method used handle decoding errors
Returns:

a CSVRowReader iterating rows

exception chemfp.csv_readers.CSVColumnError(argname, titles, row, column, title, record_id)

Bases: IndexError, chemfp.ChemFPError

Base class for CSV column error exceptions

  • argname - the internal/parameter name for the missing column
  • titles - the list of column titles, or None if no titles
  • row - the list of row values if processing rows, or None
  • column - the missing column index, starting from 1
  • title - the missing column title, or None
  • record_id - the record id for this row, if applicable and available
exception chemfp.csv_readers.CSVColumnDecodeError(argname, titles, row, column, title, record_id, decoder_name, decode_err)

Bases: chemfp.csv_readers.CSVColumnError

Exception raised if the fingerprint column could not be decoded

exception chemfp.csv_readers.CSVColumnIndexError(argname, titles, column)

Bases: chemfp.csv_readers.CSVColumnError

Exception raised if the column specified by index is larger than the number of header titles

exception chemfp.csv_readers.CSVColumnTitleError(argname, titles, title)

Bases: chemfp.csv_readers.CSVColumnError

Exception raised if the column specified by name is not found in the header titles

class chemfp.csv_readers.CSVIdAndMoleculeReader(metadata, structure_reader, close, location, dialect, has_header, titles, id_column, mol_column)

Bases: chemfp.base_toolkit.IdAndMoleculeReader

Read structures from columns in a CSV file and iterate over the (id, toolkit molecule) pairs

The additional attributes beyond the IdAndMoleculeReader are:

  • dialect - a CSVDialect describing the CSV reader configuration
  • has_header - True if the CSV reader was configured to read header titles
  • titles - a list of title names, or None if there was no header
  • id_column - the column index for the id, or None if it comes from the molecule record
  • mol_column - the column index for the molecule record

Note: the id_column and mol_column values start with 1, so a value of 1 means the first column, 2 means the second, and so on.

The CSVIdAndMoleculeReader is also a context manager which closes the CSV file when done.

close()

Close the reader

If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the reader may have opened, and set self.closed to False.

id_title

Get the id column title, or None if no titles or id column is None

mol_title

Get the molecule record column title, or None if there are no columns

chemfp.csv_readers.read_csv_ids_and_molecules_with_parser(source, parse_id_and_mol, *, id_column=1, mol_column=2, dialect=None, has_header=True, compression='auto', csv_errors='strict', location=None, encoding='utf8', encoding_errors='strict', record_format=None, record_args=None)

Read ids and molecules from column(s) of a CSV file using a molecule parser function.

Read from source, which may be a filename, a file-like object, or None to read from stdin.

The required parse_id_and_mol is a function which parses the molecule record (as a string) and returns the 2-element tuple containing the record id an molecule. The id is only used if id_column is None. If the molecule is None then the record will be skipped. If the parser raises a ParseError exception then the current location will be attached to the exception and re-raised. The toolkit “make_id_and_molecule_parser()” function returns an appropriate function.

Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.

Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.

If has_header is True then the first line/record contains column titles, and if False then there are no column titles.

Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.

The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.

The record_format and record_args are used to set the “record_format” and “args” values of the returned reader’s FormatMetadata metadata object.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the CSV source
  • parse_id_and_mol (it must take a string and return an (id, molecule) pair) – the function used to parse a molecule record
  • id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier
  • mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record
  • dialect (None, a string name, or a Dialect instance) – the CSV dialect
  • has_header (bool) – True if the first record contains titles, False of it does not
  • compression (string or None) – file compression format
  • csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string) – the name of the file’s character encoding
  • encoding_errors (string) – the method used handle decoding errors
  • record_format (string) – the molecular structure format name
  • record_args (None or a dictionary of format reader or writer args) – the molecular structure format name
Returns:

a CSVIdAndMoleculeReader iterating (id, molecule) pairs

class chemfp.csv_readers.CSVFingerprintReader(metadata, id_fp_iterator, location, close, dialect, has_header, titles, id_column, fp_column)

Bases: chemfp.FingerprintIterator

Read fingerprints from columns in a CSV file and iterate over the (id, fingerprint) pairs

The additional attributes beyond the FingerprintIterator are:

  • dialect - a CSVDialect describing the CSV reader configuration
  • has_header - True if the CSV reader was configured to read header titles
  • titles - a list of title names, or None if there was no header
  • id_column - the column index for the id
  • fp_column - the column index for the fingerprint record

The CSVFingerprintReader is also a context manager which closes the CSV file when done.

fp_title

Get the fingerprint column title, or None if no titles

id_title

Get the id column title, or None if no titles

chemfp.csv_readers.read_csv_ids_and_fingerprints(source, *, id_column=1, fp_column=2, decoder='hex', decoder_name=None, dialect=None, has_header=True, compression='auto', errors='strict', csv_errors='strict', location=None, encoding='utf8', encoding_errors='strict')

Read ids and fingerprints from columns of a CSV file using a fingerprint decoder function.

Read from source, which may be a filename, a file-like object, or None to read from stdin.

Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.

Use decoder to describe how to decode the fingerprint. This is either a named decoder (see chemfp.encodings.get_decoder()) or a function which takes a string and returns a 2-element tuple of the number of bits and byte-string fingerprint. The number of bits may be None if the fingerprint size can be inferred from the fingerprint length. When decoder_name is not None, use it as the decoder name during error reporting, otherwise use decoder if it is a string.

Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.

If has_header is True then the first line/record contains column titles, and if False then there are no column titles.

Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.

The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the CSV source
  • id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier
  • mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record
  • decoder (a string or a function) – the decoder name or function used to parse the fingerprint record
  • decoder_name (a string, or None) – a label for the decoder text to use during error reporting
  • dialect (None, a string name, or a Dialect instance) – the CSV dialect
  • has_header (bool) – True if the first record contains titles, False of it does not
  • compression (string or None) – file compression format
  • csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string) – the name of the file’s character encoding
  • encoding_errors (string) – the method used handle decoding errors
Returns:

a CSVFingerprintReader iterating (id, fingerprint) pairs