chemfp.csv_readers module¶
This module contains CSV file readers and methods to work with CSV dialects.
The CSV readers are based on Python’s ‘csv’ module, described at
https://docs.python.org/3/library/csv.html . I found it hard to
configure a new dialect using that module, and prefer using the
get_dialect()
function defined here.
The read_csv_rows()
function returns a CSVRowReader
which iterates over columns in a row.
The read_csv_ids_and_molecules_with_parser()
function returns a
CSVIdAndMoleculeReader
which iterates over (id, molecule)
pairs from columns in a CSV file. You probably want to use the
“read_csv_ids_and_molecules()” function for a given toolkit
wrapper than use this function directly.
The read_csv_ids_and_fingerprints()
function returns a
CSVFingerprintReader
which extracts pre-computed fingerprints
from the CSV file and iterates over them as (id, fingerprint) pairs.
-
exception
chemfp.csv_readers.
CSVConfigurationError
¶ Bases:
_csv.Error
,TypeError
Exception raised due to a CSV configuration error
-
exception
chemfp.csv_readers.
CSVDialectError
¶ Bases:
_csv.Error
Exception raised when a requested named dialect is not known
-
exception
chemfp.csv_readers.
CSVUnicodeDecodeError
(encoding, object, start, end, reason, bytes_read)¶ Bases:
UnicodeDecodeError
A subclass of UnicodeDecodeError used to get the error position in the CSV file
The normal UnicodeDecodeError only reports the location relative to the text block. The CSVUnicodeDecodeError includes the total number of bytes read before raising the exeception (as bytes_read). Use the following to computer the start and end locations relative to the entire file:
start_in_file = err.bytes_read - len(err.object) + err.start end_in_file = err.bytes_read - len(err.object) + err.end
-
end_in_file
¶ Return the end position relative to the start of the file
-
start_in_file
¶ Return the start position relative to the start of the file
-
-
chemfp.csv_readers.
get_dialect_name
(dialect)¶ Given a csv.Dialect or CSVDialect, try to figure out its name.
The default supported names are ‘csv’, ‘tsv’, as well as the dialects registered in csv.list_dialects(), of which ‘unix’ is the only relevant one.
-
class
chemfp.csv_readers.
CSVDialect
(*, delimiter=', ', quotechar='"', escapechar=None, doublequote=True, skipinitialspace=False, quoting=0, lineterminator='rn', strict=False)¶ Bases:
csv.Dialect
This is a subclass of csv.Dialect
For details about how the configuration attributes work see https://docs.python.org/3/library/csv.html .
-
classmethod
from_dialect
(dialect)¶ Return a new CSVDialect given a dialect name or Dialect-like object.
Raise a CSVDialectException if a named dialect or alias is unknown.
A Dialect-like object must have all of the expected CSVDialect attributes (‘delimiter’, ‘quotechar’, and so on.)
-
get_dialect_name
()¶ Return the dialect name, if known, otherwise return None
-
classmethod
-
chemfp.csv_readers.
get_dialect
(dialect='csv', **kwargs)¶ Create a new CSVDialect given a base dialect and optional modifiers.
If a configuration value is not specified in the keyword arguments then use the corresponding attribute of the dialect.
The dialect may be a string with the dialect name, or a Dialect-like object with the required CSV attributes. If the named dialect is unknown then raise a CSVDialectException exception. Raise an AttributeError if the Dialect-like object does not contain a required attribute.
Raise a TypeError if an unsupported keyword argument is passed in.
Parameters: - dialect (a string or Dialect instance) – the default dialect properties for the returned dialect
- kwargs (keyword arguments) – overide the default configuration properties
Returns: a CSVDialect
-
class
chemfp.csv_readers.
CSVRowReader
(dialect, has_header, titles, row_iter, close, location)¶ Bases:
object
Information about the CSV row reader
dialect
- aCSVDialect
describing the CSV reader configurationhas_header
- True if the CSV reader was configured to read header titlestitles
- a list of title names, or None if there was no headerlocation
- achemfp.io.Location
instanceclosed
- True if the reader has been closed
The CSVRowReader is also a context manager which closes the CSV file when done.
-
close
()¶ Close the reader
If the reader wasn’t previously closed then close it.
-
chemfp.csv_readers.
read_csv_rows
(source=None, dialect=None, has_header=True, compression='auto', location=None, encoding='utf8', encoding_errors='strict')¶ Read rows from a CSV file
Read from source, which may be a filename, a file-like object, or None (the default) to read from stdin.
Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.
If has_header is True then the first line/record contains column titles, and if False then there are no column titles.
Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.
Parameters: - source (a filename, file object, or None to read from stdin) – the CSV source
- dialect (None, a string name, or a Dialect instance) – the CSV dialect
- has_header (bool) – True if the first record contains titles, False of it does not
- compression (string or None) – file compression format
- location (a
chemfp.io.Location
object, or None) – object used to track parser state information - encoding (string) – the name of the file’s character encoding
- encoding_errors (string) – the method used handle decoding errors
Returns: a
CSVRowReader
iterating rows
-
exception
chemfp.csv_readers.
CSVColumnError
(argname, titles, row, column, title, record_id)¶ Bases:
IndexError
,chemfp.ChemFPError
Base class for CSV column error exceptions
argname
- the internal/parameter name for the missing columntitles
- the list of column titles, or None if no titlesrow
- the list of row values if processing rows, or Nonecolumn
- the missing column index, starting from 1title
- the missing column title, or Nonerecord_id
- the record id for this row, if applicable and available
-
exception
chemfp.csv_readers.
CSVColumnDecodeError
(argname, titles, row, column, title, record_id, decoder_name, decode_err)¶ Bases:
chemfp.csv_readers.CSVColumnError
Exception raised if the fingerprint column could not be decoded
-
exception
chemfp.csv_readers.
CSVColumnIndexError
(argname, titles, column)¶ Bases:
chemfp.csv_readers.CSVColumnError
Exception raised if the column specified by index is larger than the number of header titles
-
exception
chemfp.csv_readers.
CSVColumnTitleError
(argname, titles, title)¶ Bases:
chemfp.csv_readers.CSVColumnError
Exception raised if the column specified by name is not found in the header titles
-
class
chemfp.csv_readers.
CSVIdAndMoleculeReader
(metadata, structure_reader, close, location, dialect, has_header, titles, id_column, mol_column)¶ Bases:
chemfp.base_toolkit.IdAndMoleculeReader
Read structures from columns in a CSV file and iterate over the (id, toolkit molecule) pairs
The additional attributes beyond the
IdAndMoleculeReader
are:dialect
- aCSVDialect
describing the CSV reader configurationhas_header
- True if the CSV reader was configured to read header titlestitles
- a list of title names, or None if there was no headerid_column
- the column index for the id, or None if it comes from the molecule recordmol_column
- the column index for the molecule record
Note: the id_column and mol_column values start with 1, so a value of 1 means the first column, 2 means the second, and so on.
The CSVIdAndMoleculeReader is also a context manager which closes the CSV file when done.
-
close
()¶ Close the reader
If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the reader may have opened, and set
self.closed
to False.
-
id_title
¶ Get the id column title, or None if no titles or id column is None
-
mol_title
¶ Get the molecule record column title, or None if there are no columns
-
chemfp.csv_readers.
read_csv_ids_and_molecules_with_parser
(source, parse_id_and_mol, *, id_column=1, mol_column=2, dialect=None, has_header=True, compression='auto', csv_errors='strict', location=None, encoding='utf8', encoding_errors='strict', record_format=None, record_args=None)¶ Read ids and molecules from column(s) of a CSV file using a molecule parser function.
Read from source, which may be a filename, a file-like object, or None to read from stdin.
The required parse_id_and_mol is a function which parses the molecule record (as a string) and returns the 2-element tuple containing the record id an molecule. The id is only used if id_column is None. If the molecule is None then the record will be skipped. If the parser raises a
ParseError
exception then the current location will be attached to the exception and re-raised. The toolkit “make_id_and_molecule_parser()” function returns an appropriate function.Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.
Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.
If has_header is True then the first line/record contains column titles, and if False then there are no column titles.
Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.
The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.
The record_format and record_args are used to set the “record_format” and “args” values of the returned reader’s
FormatMetadata
metadata object.Parameters: - source (a filename, file object, or None to read from stdin) – the CSV source
- parse_id_and_mol (it must take a string and return an (id, molecule) pair) – the function used to parse a molecule record
- id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier
- mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record
- dialect (None, a string name, or a Dialect instance) – the CSV dialect
- has_header (bool) – True if the first record contains titles, False of it does not
- compression (string or None) – file compression format
- csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state information - encoding (string) – the name of the file’s character encoding
- encoding_errors (string) – the method used handle decoding errors
- record_format (string) – the molecular structure format name
- record_args (None or a dictionary of format reader or writer args) – the molecular structure format name
Returns: a
CSVIdAndMoleculeReader
iterating (id, molecule) pairs
-
class
chemfp.csv_readers.
CSVFingerprintReader
(metadata, id_fp_iterator, location, close, dialect, has_header, titles, id_column, fp_column)¶ Bases:
chemfp.FingerprintIterator
Read fingerprints from columns in a CSV file and iterate over the (id, fingerprint) pairs
The additional attributes beyond the
FingerprintIterator
are:dialect
- aCSVDialect
describing the CSV reader configurationhas_header
- True if the CSV reader was configured to read header titlestitles
- a list of title names, or None if there was no headerid_column
- the column index for the idfp_column
- the column index for the fingerprint record
The CSVFingerprintReader is also a context manager which closes the CSV file when done.
-
fp_title
¶ Get the fingerprint column title, or None if no titles
-
id_title
¶ Get the id column title, or None if no titles
-
chemfp.csv_readers.
read_csv_ids_and_fingerprints
(source, *, id_column=1, fp_column=2, decoder='hex', decoder_name=None, dialect=None, has_header=True, compression='auto', errors='strict', csv_errors='strict', location=None, encoding='utf8', encoding_errors='strict')¶ Read ids and fingerprints from columns of a CSV file using a fingerprint decoder function.
Read from source, which may be a filename, a file-like object, or None to read from stdin.
Use id_column and mol_column to specify the columns containing the record identifier and molecule record. By default the identifiers come from column 1 (the first column) and the molecules from column 2 (the second column). Columns can be specified by integer position (starting with 1), or by a string matching the title from the header line. If id_column is None then the molecule id will come from parsing the molecule record.
Use decoder to describe how to decode the fingerprint. This is either a named decoder (see
chemfp.encodings.get_decoder()
) or a function which takes a string and returns a 2-element tuple of the number of bits and byte-string fingerprint. The number of bits may be None if the fingerprint size can be inferred from the fingerprint length. When decoder_name is not None, use it as the decoder name during error reporting, otherwise use decoder if it is a string.Use dialect to specify the type of CSV file. The default of None infers the dialect from the filename extension; *.csv for comma-separated, and *.tsv for tab-separated. The dialect can be specified directly as “csv” or “tsv”, as a registered Python csv dialect at https://docs.python.org/3/library/csv.html (though “excel” is the same as “csv” and “excel-tab” is the same as “tsv”), or as a csv.Dialect or a .class:CSVDialect instance.
If has_header is True then the first line/record contains column titles, and if False then there are no column titles.
Use compression to specify how the file compression format. The default “auto” uses the filename extension. Other options are “gz” and “zst”, or the empty string “” to mean no compresssion.
The csv_errors describes how to handle failures in molecule CSV parsing, respectively. The default is to stop parsing if a CSV row does not contain enough columns.
The location parameter takes a
chemfp.io.Location
instance. If None then a default Location will be created.The encoding and encoding_errors are strings describing the input file character encoding, and how to handle decoding errors. See https://docs.python.org/3/library/codecs.html#error-handlers and https://docs.python.org/3/library/codecs.html#error-handlers for details.
Parameters: - source (a filename, file object, or None to read from stdin) – the CSV source
- id_column (integer position (starting from 1), string, or None) – the column position or column title containing the identifier
- mol_column (integer position (starting from 1), string) – the column position or column title containing the structure record
- decoder (a string or a function) – the decoder name or function used to parse the fingerprint record
- decoder_name (a string, or None) – a label for the decoder text to use during error reporting
- dialect (None, a string name, or a Dialect instance) – the CSV dialect
- has_header (bool) – True if the first record contains titles, False of it does not
- compression (string or None) – file compression format
- csv_errors (one of "strict", "report", or "ignore") – specify how to handle CSV errors
- location (a
chemfp.io.Location
object, or None) – object used to track parser state information - encoding (string) – the name of the file’s character encoding
- encoding_errors (string) – the method used handle decoding errors
Returns: a
CSVFingerprintReader
iterating (id, fingerprint) pairs