chemfp API

This chapter contains the docstrings for the public portion of the chemfp API.

chemfp top-level module

The following functions and classes are in the top-level chemfp module.

chemfp.open(source, format=None, location=None)

Read fingerprints from a fingerprint file

Read fingerprints from source, using the given format. If source is a string then it is treated as a filename. If source is None then fingerprints are read from stdin. Otherwise, source must be a Python file object supporting the read and readline methods.

If format is None then the fingerprint file format and compression type are derived from the source filename, or from the name attribute of the source file object. If the source is None then the stdin is assumed to be uncompressed data in “fps” format.

The supported format strings are:

  • “fps”, “fps.gz” for fingerprints in FPS format
  • “fpb” for fingerprints in FPB format

The optional location is a chemfp.io.Location instance. It will only be used if the source is in FPS format.

If the source is in FPS format then open will return a chemfp.fps_io.FPSReader, which will use the location if specified.

If the source is in FPB format then open will return a chemfp.arena.FingerprintArena and the location will not be used.

Here’s an example of printing the contents of the file:

from chemfp.bitops import hex_encode
reader = chemfp.open("example.fps.gz")
for id, fp in reader:
    print(id, hex_encode(fp))
Parameters:
  • source (A filename string, a file object, or None) – The fingerprint source.
  • format (string, or None) – The file format and optional compression.
Returns:

a chemfp.fps_io.FPSReader or chemfp.arena.FingerprintArena

chemfp.load_fingerprints(reader, metadata=None, reorder=True, alignment=None, format=None)

Load all of the fingerprints into an in-memory FingerprintArena data structure

The function reads all of the fingerprints and identifers from reader and stores them into an in-memory chemfp.arena.FingerprintArena data structure which supports fast similarity searches.

If reader is a string or has a read attribute then it will be passed to the chemfp.open() function and the result used as the reader. If that returns a FingerprintArena then the reorder and alignment parameters are ignored and the arena returned.

If reader is a FingerprintArena then the reorder and alignment parameters are ignored. If metadata is None then the input reader is returned without modifications, otherwise a new FingerprintArena is created, whose metadata attribue is metadata.

Otherwise the reader or the result of opening the file must be an iterator which returns (id, fingerprint) pairs. These will be used to create a new arena.

metadata specifies the metadata for all returned arenas. If not given the default comes from the source file or from reader.metadata.

The loader may reorder the fingerprints for better search performance. To prevent ordering, use reorder=False. The reorder parameter is ignored if the reader is an arena or FPB file.

The alignment option specifies the alignment data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None will determine the best alignment based on the fingerprint size and available popcount methods. This parameter is ignored if the reader is an arena or FPB file.

Parameters:
  • reader (a string, file object, or (id, fingerprint) iterator) – An iterator over (id, fingerprint) pairs
  • metadata (Metadata) – The metadata for the arena, if other than reader.metadata
  • reorder (True or False) – Specify if fingerprints should be reordered for better performance
  • alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
  • format (None, "fps", "fps.gz", or "fpb") – The file format name if the reader is a string
Returns:

chemfp.arena.FingerprintArena

chemfp.read_molecule_fingerprints(type, source=None, format=None, id_tag=None, reader_args=None, errors="strict")

Read structures from source and return the corresponding ids and fingerprints

This returns an chemfp.fps_io.FPSReader which can be iterated over to get the id and fingerprint for each read structure record. The fingerprint generated depends on the value of type. Structures are read from source, which can either be the structure filename, or None to read from stdin.

type contains the information about how to turn a structure into a fingerprint. It can be a string or a metadata instance. String values look like OpenBabel-FP2/1, OpenEye-Path, and OpenEye-Path/1 min_bonds=0 max_bonds=5 atype=DefaultAtom btype=DefaultBond. Default values are used for unspecified parameters. Use a Metadata instance with type and aromaticity values set in order to pass aromaticity information to OpenEye.

If format is None then the structure file format and compression are determined by the filename’s extension(s), defaulting to uncompressed SMILES if that is not possible. Otherwise format may be “smi” or “sdf” optionally followed by ”.gz” or ”.bz2” to indicate compression. The OpenBabel and OpenEye toolkits also support additional formats.

If id_tag is None, then the record id is based on the title field for the given format. If the input format is “sdf” then id_tag specifies the tag field containing the identifier. (Only the first line is used for multi-line values.) For example, ChEBI omits the title from the SD files and stores the id after the “> <ChEBI ID>” line. In that case, use id_tag = "ChEBI ID".

The reader_args is a dictionary with additional structure reader parameters. The parameters depend on the toolkit and the format. Unknown parameters are ignored.

errors specifies how to handle errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record.

Here is an example of using fingerprints generated from structure file:

from chemfp.bitops import hex_encode
fp_reader = chemfp.read_molecule_fingerprints("OpenBabel-FP4/1", "example.sdf.gz")
print("Each fingerprint has", fp_reader.metadata.num_bits, "bits")
for (id, fp) in fp_reader:
  print(id, hex_encode(fp))

See also chemfp.read_molecule_fingerprints_from_string().

Parameters:
  • type (string or Metadata) – information about how to convert the input structure into a fingerprint
  • source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
  • format (string, or None to autodetect based on the source) – The file format and optional compression. Examples: “smi” and “sdf.gz”
  • id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.
  • reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
  • errors (one of "strict", "report", or "ignore") – specify how to handle parse errors
Returns:

a chemfp.FingerprintReader

chemfp.read_molecule_fingerprints_from_string(type, content, format, id_tag=None, reader_args=None, errors="strict")

Read structures from the content string and return the corresponding ids and fingerprints

The parameters are identical to chemfp.read_molecule_fingerprints() except that the entire content is passed through as a content string, rather than as a source filename. See that function for details.

You must specify the format! As there is no source filename, it’s not possible to guess the format based on the extension, and there is no support for auto-detecting the format by looking at the string content.

Parameters:
  • type (string or Metadata) – information about how to convert the input structure into a fingerprint
  • content (string) – The structure data as a string.
  • format (string) – The file format and optional compression. Examples: “smi” and “sdf.gz”
  • id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.
  • reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
  • errors (one of "strict" (raise exception), "report" (send a message to stderr and continue processing), or "ignore" (continue processing)) – specify how to handle parse errors
Returns:

a chemfp.FingerprintReader

chemfp.open_fingerprint_writer(destination, metadata=None, format=None, alignment=8, reorder=True, tmpdir=None, max_spool_size=None, errors="strict", location=None)

Create a fingerprint writer for the given destination

The fingerprint writer is an object with methods to write fingerprints to the given destination. The output format is based on the format. If that’s None then the format depends on the destination, or is “fps” if the attempts at format detection fail.

The metadata, if given, is a Metadata instance, and used to fill the header of an FPS file or META block of an FPB file.

If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None for stdout. If the output format is “fpb” then destination must be a filename.

Some options only apply to FPB output. The alignment specifies the arena byte alignment. By default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to False to preserve the input fingerprint order.

The default FPB writer stores everything into memory before writing the file, which may cause performance problems if there isn’t enough available free memory. In that case, set max_spool_size to the number of bytes of memory to use before spooling intermediate data to a file. (Note: there are two independent spools so this may use up to roughly twice as much memory as specified.)

Use tmpdir to specify where to write the temporary spool files if you don’t want to use the operating system default. You may also set the TMPDIR, TEMP or TMP environment variables.

Some options only apply to FPS output. errors specifies how to handle recoverable write errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record.

The location is a Location instance. It lets the caller access state information such as the number of records that have been written.

Parameters:
  • destination (a filename, file object, or None) – the output destination
  • metadata (a Metadata instance, or None) – the fingerprint metadata
  • format (None, "fps", "fps.gz", or "fpb") – the output format
  • alignment (positive integer) – arena byte alignment for FPB files
  • reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order
  • tmpdir (string or None) – the directory to use for temporary files, when max_spool_size is specified
  • max_spool_size (integer, or None) – number of bytes to store in memory before using a temporary file. If None, use memory for everything.
  • location (a Location instance, or None) – a location object used to access output state information
Returns:

a chemfp.FingerprintWriter

ChemFPError

class chemfp.ChemFPError

Base class for all of the chemfp exceptions

ParseError

class chemfp.ParseError

Exception raised by the molecule and fingerprint parsers and writers

The public attributes are:

msg

a string describing the exception

location

a chemfp.io.Location instance, or None

Metadata

class chemfp.Metadata

Store information about a set of fingerprints

The public attributes are:

num_bits

the number of bits in the fingerprint

num_bytes

the number of bytes in the fingerprint

type

the fingerprint type string

aromaticity

aromaticity model (only used with OEChem, and now deprecated)

software

software used to make the fingerprints

sources

list of sources used to make the fingerprint

date

a datetime timestamp of when the fingerprints were made

__repr__()

Return a string like Metadata(num_bits=1024, num_bytes=128, type='OpenBabel/FP2', ....)

__str__()

Show the metadata in FPS header format

copy(num_bits=None, num_bytes=None, type=None, aromaticity=None, software=None, sources=None, date=None)

Return a new Metadata instance based on the current attributes and optional new values

When called with no parameter, make a new Metadata instance with the same attributes as the current instance.

If a given call parameter is not None then it will be used instead of the current value. If you want to change a current value to None then you will have to modify the new Metadata after you created it.

Parameters:
  • num_bits (an integer, or None) – the number of bits in the fingerprint
  • num_bytes (an integer, or None) – the number of bytes in the fingerprint
  • type (string or None) – the fingerprint type description
  • aromaticity (None) – obsolete
  • software (string or None) – a description of the software
  • sources (list of strings, a string (interpreted as a list with one string), or None) – source filenames
  • date (a datetime instance, or None) – creation or processing date for the contents
Returns:

a new Metadata instance

FingerprintReader

class chemfp.FingerprintReader

Base class for all chemfp objects holding fingerprint records

All FingerprintReader instances have a metadata attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.

__iter__()

iterate over the (id, fingerprint) pairs

iter_arenas(arena_size=1000)

iterate through arena_size fingerprints at a time, as subarenas

Iterate through arena_size fingerprints at a time, returned as chemfp.arena.FingerprintArena instances. The arenas are in input order and not reordered by popcount.

This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer.

If arena_size is None then this makes an iterator which returns a single arena containing all of the fingerprints.

Parameters:arena_size (positive integer, or None) – The number of fingerprints to put into each arena.
Returns:an iterator of chemfp.arena.FingerprintArena instances
save(destination, format=None)

Save the fingerprints to a given destination and format

The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.

If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None; None writes to stdout.

If the output format is “fpb” then destination must be a filename.

Parameters:
  • destination (a filename, file object, or None) – the output destination
  • format (None, "fps", "fps.gz", or "fpb") – the output format
Returns:

None

get_fingerprint_type()

Get the fingerprint type object based on the metadata’s type field

This uses self.metadata.type to get the fingerprint type string then calls chemfp.get_fingerprint_type() to get and return a chemfp.types.FingerprintType instance.

This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.

Returns:a chemfp.types.FingerprintType

FingerprintIterator

class chemfp.FingerprintIterator

A chemfp.FingerprintReader for an iterator of (id, fingerprint) pairs

This is often used as an adapter container to hold the metadata and (id, fingerprint) iterator. It supports an optional location, and can call a close function when the iterator has completed.

A FingerprintIterator is a context manager which will close the underlying iterator if it’s given a close handler.

Like all iterators you can use next() to get the next (id, fingerprint) pair.

__init__(metadata, id_fp_iterator, location=None, close=None)

Initialize with a Metadata instance and the (id, fingerprint) iterator

The metadata is a Metadata instance. The id_fp_iterator is an iterator which returns (id, fingerprint) pairs.

The optional location is a chemfp.io.Location. The optional close callable is called (as close()) whenever self.close() is called and when the context manager exits.

__iter__()

Iterate over the (id, fingerprint) pairs

close()

Close the iterator

The call will be forwarded to the close callable passed to the constructor. If that close is None then this does nothing.

Fingerprints

class chemfp.Fingerprints

A chemf.FingerprintReader containing a metadata and a list of (id, fingerprint) pairs.

This is typically used as an adapater when you have a list of (id, fingerprint) pairs and you want to pass it (and the metadata) to the rest of the chemfp API.

This implements a simple list-like collection of fingerprints. It supports:
  • for (id, fingerprint) in fingerprints: ...
  • id, fingerprint = fingerprints[1]
  • len(fingerprints)

More features, like slicing, will be added as needed or when requested.

__init__(metadata, id_fp_pairs)

Initialize with a Metadata instance and the (id, fingerprint) pair list

The metadata is a Metadata instance. The id_fp_iterator is an iterator which returns (id, fingerprint) pairs.

FingerprintWriter

class chemfp.FingerprintWriter

Base class for the fingerprint writers

The three fingerprint writer classes are:

Use chemfp.open_fingerprint_writer() to create a fingerprint writer class; do not create them directly.

All classes have the following attributes:

  • metadata - a chemfp.Metadata instance
  • closed - False when the file is open, else True

Fingerprint writers are also their own context manager, and close the writer on context exit.

write_fingerprint(id, fp)

Write a single fingerprint record with the given id and fp to the destination

Parameters:
  • id (string) – the record identifier
  • fp (byte string) – the fingerprint
write_fingerprints(id_fp_pairs)

Write a sequence of (id, fingerprint) pairs to the destination

Parameters:id_fp_pairs – An iterable of (id, fingerprint) pairs. id is a string and fingerprint is a byte string.
close()

Close the writer

This will set self.closed to False.

ChemFPProblem

class chemfp.ChemFPProblem

Information about a compatibility problem between a query and target.

Instances are generated by chemfp.check_fingerprint_problems() and chemfp.check_metadata_problems().

The public attributes are:

severity

one of “info”, “warning”, or “error”

error_level

5 for “info”, 10 for “warning”, and 20 for “error”

category

a string used as a category name. This string will not change over time.

description

a more detailed description of the error, including details of the mismatch. The description depends on query_name and target_name and may change over time.

The current category names are:
  • “num_bits mismatch” (error)
  • “num_bytes_mismatch” (error)
  • “type mismatch” (warning)
  • “aromaticity mismatch” (info)
  • “software mismatch” (info)
chemfp.check_fingerprint_problems(query_fp, target_metadata, query_name="query", target_name="target")

Return a list of compatibility problems between a fingerprint and a metadata

If there are no problems then this returns an empty list. If there is a bit length or byte length mismatch between the query_fp byte string and the target_metadata then it will return a list containing a ChemFPProblem instance, with a severity level “error” and category “num_bytes mismatch”.

This function is usually used to check if a query fingerprint is compatible with the target fingerprints. In case of a problem, the default message looks like:

>>> problems = check_fingerprint_problems("A"*64, Metadata(num_bytes=128))
>>> problems[0].description
'query contains 64 bytes but target has 128 byte fingerprints'

You can change the error message with the query_name and target_name parameters:

>>> import chemfp
>>> problems = check_fingerprint_problems("z"*64, chemfp.Metadata(num_bytes=128),
...      query_name="input", target_name="database")
>>> problems[0].description
'input contains 64 bytes but database has 128 byte fingerprints'
Parameters:
  • query_fp (byte string) – a fingerprint (usually the query fingerprint)
  • target_metadata (Metadata instance) – the metadata to check against (usually the target metadata)
  • query_name (string) – the text used to describe the fingerprint, in case of problem
  • target_name (string) – the text used to describe the metadata, in case of problem
Returns:

a list of ChemFPProblem instances

chemfp.check_metadata_problems(query_metadata, target_metadata, query_name="query", target_name="target")

Return a list of compatibility problems between two metadata instances.

If there are no probelms then this returns an empty list. Otherwise it returns a list of ChemFPProblem instances, with a severity level ranging from “info” to “error”.

Bit length and byte length mismatches produce an “error”. Fingerprint type and aromaticity mismatches produce a “warning”. Software version mismatches produce an “info”.

This is usually used to check if the query metadata is incompatible with the target metadata. In case of a problem the messages look like:

>>> import chemfp
>>> m1 = chemfp.Metadata(num_bytes=128, type="Example/1")
>>> m2 = chemfp.Metadata(num_bytes=256, type="Counter-Example/1")
>>> problems = chemfp.check_metadata_problems(m1, m2)
>>> len(problems)
2
>>> print(problems[1].description)
query has fingerprints of type 'Example/1' but target has fingerprints of type 'Counter-Example/1'

You can change the error message with the query_name and target_name parameters:

>>> problems = chemfp.check_metadata_problems(m1, m2, query_name="input", target_name="database")
>>> print(problems[1].description)
input has fingerprints of type 'Example/1' but database has fingerprints of type 'Counter-Example/1'
Parameters:
  • fp (byte string) – a fingerprint
  • metadata (Metadata instance) – the metadata to check against
  • query_name (string) – the text used to describe the fingerprint, in case of problem
  • target_name (string) – the text used to describe the metadata, in case of problem
Returns:

a list of ChemFPProblem instances

chemfp.count_tanimoto_hits(queries, targets, threshold=0.7, arena_size=100)

Count the number of targets within threshold of each query term

For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.9):
  print(query_id, "has", count, "neighbors with at least 0.9 similarity")

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.count_tanimoto_hits_fp() or chemfp.search.count_tanimoto_hits_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (a positive integer, or None) – The number of queries to process in a batch
Returns:

iterator of the (query_id, score) pairs, one for each query

chemfp.count_tanimoto_hits_symmetric(fingerprints, threshold=0.7)

Find the number of other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint_id, count) pairs.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, count) in chemfp.count_tanimoto_hits_symmetric(arena, threshold=0.6):
    print(fp_id, "has", count, "neighbors with at least 0.6 similarity")

You may also be interested in chemfp.search.count_tanimoto_hits_symmetric().

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, count) pairs, one for each fingerprint

Find all targets within threshold of each query term

For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, hits) in chemfp.id_threshold_tanimoto_search(queries, targets, threshold=0.8):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity")
    non_identical = [target_id for (target_id, score) in hits if score != 1.0]
    print("  The non-identical hits are:", non_identical)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.threshold_tanimoto_search_fp() or chemfp.search.threshold_tanimoto_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.

chemfp.threshold_tanimoto_search_symmetric(fingerprints, threshold=0.7)

Find the other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hit order is arbitrary.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.threshold_tanimoto_search_symmetric(arena, threshold=0.75):
    print(fp_id, "has", len(hits), "neighbors:")
    for (other_id, score) in hits.get_ids_and_scores():
        print("   %s  %.2f" % (other_id, score))

You may also be interested in the chemfp.search.threshold_tanimoto_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

Find the k-nearest targets within threshold of each query term

For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.

This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.

Example:

# Use the first 5 fingerprints as the queries
queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5))
targets = chemfp.load_fingerprints("pubchem_subset.fps")

# Find the 3 nearest hits with a similarity of at least 0.8
for (query_id, hits) in chemfp.id_knearest_tanimoto_search(queries, targets, k=3, threshold=0.8):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity")
    if hits:
        target_id, score = hits[-1]
        print("    The least similar is", target_id, "with score", score)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.knearest_tanimoto_search_fp() or chemfp.search.knearest_tanimoto_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.

chemfp.knearest_tanimoto_search_symmetric(fingerprints, k=3, threshold=0.7)

Find the k-nearest fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hits are ordered from highest score to lowest, with ties broken arbitrarily.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.knearest_tanimoto_search_symmetric(arena, k=5, threshold=0.5):
    print(fp_id, "has", len(hits), "neighbors, with scores", end="")
    print(", ".join("%.2f" % x for x in hits.get_scores()))

You may also be interested in the chemfp.search.knearest_tanimoto_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

chemfp.count_tversky_hits(queries, targets, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100)

Count the number of targets within threshold of each query term

For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, count) in chemfp.count_tversky_hits(
          queries, targets, threshold=0.9, alpha=0.5, beta=0.5):
  print(query_id, "has", count, "neighbors with at least 0.9 Dice similarity")

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.count_tversky_hits_fp() or chemfp.search.count_tversky_hits_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (a positive integer, or None) – The number of queries to process in a batch
Returns:

iterator of the (query_id, score) pairs, one for each query

chemfp.count_tversky_hits_symmetric(fingerprints, threshold=0.7, alpha=1.0, beta=1.0)

Find the number of other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint_id, count) pairs.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, count) in chemfp.count_tversky_hits_symmetric(
        arena, threshold=0.6, alpha=0.5, beta=0.5):
    print(fp_id, "has", count, "neighbors with at least 0.6 Dice similarity")

You may also be interested in chemfp.search.count_tversky_hits_symmetric().

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, count) pairs, one for each fingerprint

Find all targets within threshold of each query term

For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, hits) in chemfp.id_threshold_tanimoto_search(
           queries, targets, threshold=0.8, alpha=0.5, beta=0.5):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity")
    non_identical = [target_id for (target_id, score) in hits if score != 1.0]
    print("  The non-identical hits are:", non_identical)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.threshold_tversky_search_fp() or chemfp.search.threshold_tversky_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.

chemfp.threshold_tversky_search_symmetric(fingerprints, threshold=0.7, alpha=1.0, beta=1.0)

Find the other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hit order is arbitrary.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.threshold_tversky_search_symmetric(
           arena, threshold=0.75, alpha=0.5, beta=0.5):
    print(fp_id, "has", len(hits), "Dice neighbors:")
    for (other_id, score) in hits.get_ids_and_scores():
        print("   %s  %.2f" % (other_id, score))

You may also be interested in the chemfp.search.threshold_tversky_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

Find the k-nearest targets within threshold of each query term

For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.

This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.

Example:

# Use the first 5 fingerprints as the queries
queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5))
targets = chemfp.load_fingerprints("pubchem_subset.fps")

# Find the 3 nearest hits with a similarity of at least 0.8
for (query_id, hits) in chemfp.id_knearest_tversky_search(
          queries, targets, k=3, threshold=0.8, alpha=0.5, beta=0.5):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity")
    if hits:
        target_id, score = hits[-1]
        print("    The least similar is", target_id, "with score", score)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.knearest_tversky_search_fp() or chemfp.search.knearest_tversky_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.

chemfp.knearest_tversky_search_symmetric(fingerprints, k=3, threshold=0.7, alpha=1.0, beta=1.0)

Find the k-nearest fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hits are ordered from highest score to lowest, with ties broken arbitrarily.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.knearest_tversky_search_symmetric(
        arena, k=5, threshold=0.5, alpha=0.5, beta=0.5):
    print(fp_id, "has", len(hits), "neighbors, with Dice scores", end="")
    print(", ".join("%.2f" % x for x in hits.get_scores()))

You may also be interested in the chemfp.search.knearest_tversky_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

chemfp.get_fingerprint_families()

Return a list of available fingerprint families

Returns:a list of chemfp.types.FingerprintFamily instances
chemfp.get_fingerprint_family(family_name)

Return the named fingerprint family, or raise a ValueError if not available

Given a family_name like OpenBabel-FP2 or OpenEye-MACCS166 return the corresponding chemfp.types.FingerprintFamily.

Parameters:family_name (string) – the family name
Returns:a chemfp.types.FingerprintFamily instance
chemfp.get_fingerprint_family_names(include_unavailable=False)

Return a set of fingerprint family name strings

The function tries to load each known fingerprint family. The names of the families which could be loaded are returned as a set of strings.

If include_unavailable is True then this will return a set of all of the fingerprint family names, including those which could not be loaded.

The set contains both the versioned and unversioned family names, so both OpenBabel-FP2/1 and OpenBabel-FP2 may be returned.

Parameters:include_unavailable (True or False) – Should unavailable family names be included in the result set?
Returns:a set of strings
chemfp.get_fingerprint_type(type, fingerprint_kwargs=None)

Get the fingerprint type based on its type string and optional keyword arguments

Given a fingerprint type string like OpenBabel-FP2, or RDKit-Fingerprint/1 fpSize=1024, return the corresponding chemfp.types.FingerprintType.

The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the fingerprint_kwargs dictionary, where the dictionary values are native Python values. If the same parameter is specified in the type string and the kwargs dictionary then the fingerprint_kwargs takes precedence.

For example:

>>> fptype = get_fingerprint_type("RDKit-Fingerprint fpSize=1024 minPath=3", {"fpSize": 4096})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

Use get_fingerprint_type_from_text_settings() if your fingerprint parameter values are all string-encoded, eg, from the command-line or a configuration file.

Parameters:
  • type (string) – a fingerprint type string
  • fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
Returns:

a chemfp.types.FingerprintType

chemfp.get_fingerprint_type_from_text_settings(type, settings=None)

Get the fingerprint type based on its type string and optional settings arguments

Given a fingerprint type string like OpenBabel-FP2, or RDKit-Fingerprint/1 fpSize=1024, return the corresponding chemfp.types.FingerprintType.

The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the settings dictionary, where the dictionary values are string-encoded values. If the same parameter is specified in the type string and the settings dictionary then the settings take precedence.

For example:

>>> fptype = get_fingerprint_type_from_text_settings("RDKit-Fingerprint fpSize=1024 minPath=3",
...                                                  {"fpSize": "4096"})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

This function is for string settings from a configuration file or command-line. Use get_fingerprint_type() if your fingerprint parameters are Python values.

Parameters:
  • type (string) – a fingerprint type string
  • fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
Returns:

a chemfp.types.FingerprintType

chemfp.has_fingerprint_family(family_name)

Test if the fingerprint family is available

Return True if the fingerprint family_name is available, otherwise False. The family_name may be versioned or unversioned, like “OpenBabel-FP2/1” or “OpenEye-MACCS166”.

Parameters:family_name (string) – the family name
Returns:True or False
chemfp.get_max_threads()

Return the maximum number of threads available.

WARNING: this likely doesn’t do what you think it does. Do not use!

If OpenMP is not available then this will return 1. Otherwise it returns the maximum number of threads available, as reported by omp_get_num_threads().

chemfp.get_num_threads()

Return the number of OpenMP threads to use in searches

Initially this is the value returned by omp_get_max_threads(), which is generally 4 unless you set the environment variable OMP_NUM_THREADS to some other value.

It may be any value in the range 1 to get_max_threads(), inclusive.

Returns:the current number of OpenMP threads to use
chemfp.set_num_threads(num_threads)

Set the number of OpenMP threads to use in searches

If num_threads is less than one then it is treated as one, and a value greater than get_max_threads() is treated as get_max_threads().

Parameters:num_threads (int) – the new number of OpenMP threads to use
chemfp.get_toolkit(toolkit_name)

Return the named toolkit, if available, or raise a ValueError

If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” and the named toolkit is available, then it will return chemfp.openbabel_toolkit, chemfp.openeye_toolkit, or chemfp.rdkit_toolkit, respectively.:

>>> import chemfp
>>> chemfp.get_toolkit("openeye")
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
>>> chemfp.get_toolkit("rdkit")
Traceback (most recent call last):
     ...
ValueError: Unable to get toolkit 'rdkit': No module named rdkit
Parameters:toolkit_name (string) – the toolkit name
Returns:the chemfp toolkit
Raises:ValueError if toolkit_name is unknown or the toolkit does not exist
chemfp.get_toolkit_names()

Return a set of available toolkit names

The function checks if each supported toolkit is available by trying to import its corresponding module. It returns a set of toolkit names:

>>> import chemfp
>>> chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
Returns:a set of toolkit names, as strings
chemfp.has_toolkit(toolkit_name)

Return True if the named toolkit is available, otherwise False

If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” then this function will test to see if the given toolkit is available, and if so return True. Otherwise it returns False.

>>> import chemfp
>>> chemfp.has_toolkit("openeye")
True
>>> chemfp.has_toolkit("openbabel")
False

The initial test for a toolkit can be slow, especially if the underlying toolkit loads a lot of shared libraries. The test is only done once, and cached.

Parameters:toolkit_name (string) – the toolkit name
Returns:True or False

chemfp.types - fingerprint families and types

A “fingerprint type” is an object which knows how to convert a molecule into a fingerprint. A “fingerprint family” is an object which uses a set of parameters to make a specific fingerprint type.

>>> import chemfp
>>> fpfamily = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> fpfamily.get_defaults()
{'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
>>>
>>> fptype = fpfamily()  # create the default fingerprint type
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>>
>>> fptype = fpfamily(fpSize=1024)   # use a non-default value
>>> fptype.get_type()
'RDKit-Fingerprint/2  minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'
>>> mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
>>> fptype.compute_fingerprint(mol)
'\x04\x00\x00\x00\x00\x00\x10\x00\x00\x00  ... x00\x00\x00\x00\x00'

Fingerprint family class

FingerprintFamily

class chemfp.types.FingerprintFamily

A FingerprintFamily is used to create a FingerprintType or get information about its parameters

Two reasons to use a FingerprintFamily (instead of using chemfp.get_fingerprint_type() or chemfp.get_fingerprint_type_from_text_settings()) are:

  • figure out the default arguments;
  • given a text settings or parameter dictionary, use the keys from the default argument keys to remove other parameters before creating a FingerprintType (otherwise the creation function will raise an exception)

All fingerprint families have the following attributes:

  • name - the type name, including version
  • toolkit - the toolkit API for the underlying chemistry toolkit, or None
__repr__()

Return a string like ‘FingerprintFamily(<RDKit-Fingerprint/2>)’

name

Read-only attribute.

The full fingerprint name, including the version

base_name

Read-only attribute.

The base fingerprint name, without the version

version

Read-only attribute.

The fingerprint version

toolkit

Read-only attribute.

The toolkit used to implement this fingerprint, or None

__call__(**fingerprint_kwargs)

Create a fingerprint type; keyword arguments can override the defaults

The argument values are native Python values, not string-encoded values:

>>> import chemfp
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> fptype = family()
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>> fptype = family(fpSize=1024)
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'

The function will raise an exception for unknown arguments.

Parameters:fingerprint_kwargs – the fingerprint parameters
Returns:an object implementing the chemfp.types.FingerprintType API
from_kwargs(fingerprint_kwargs=None)

Create a fingerprint type; items in the fingerprint_kwargs dictionary can override the defaults

The dictionary values are native Python values, not string-encoded values:

>>> import chemfp
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> fptype = family()
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>> fptype = family.from_kwargs({"fpSize": 1024})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'

The function will raise an exception for unknown arguments.

Parameters:fingerprint_kwargs (a dictionary where the values are Python objects) – the fingerprint parameters
Returns:an object implementing the chemfp.types.FingerprintType API
from_text_settings(settings=None)

Create a fingerprint type; settings is a dictionary with string-encoded value that can override the defaults

The dictionary values are string-encoded values, not native Python values. This function exists to help handle command-line arguments and setting files.:

>>> import chemfp
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> fptype = family.from_text_settings()
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'
>>> fptype = family.from_text_settings({"fpSize": "1024"})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'

The function will raise an exception for unknown arguments.

Parameters:settings (a dictionary where the values are string-encoded) – the fingerprint text settings
Returns:an object implementing the chemfp.types.FingerprintType API
get_kwargs_from_text_settings(settings=None)

Convert a dictionary of string-encoded fingerprint parameters into native Python values

String-encoded values (“text settings”) can come from the command-line, a configuration file, a web reqest, or other text sources. The fingerprint types need actual Python values. This method converts the first to the second:

>>> import chemfp
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> family.get_kwargs_from_text_settings()
{'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
>>> family.get_kwargs_from_text_settings({"fpSize": "128", "maxPath": "5"})
{'maxPath': 5, 'fpSize': 128, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
Parameters:settings (a dictionary where the values are string-encoded) – the fingerprint text settings
Returns:an dictionary of (decoded) fingerprint parameters
get_defaults()

Return the default parameters as a dictionary

The dictionary values are native Python objects:

>>> import chemfp
>>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
>>> family.get_defaults()
{'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
Returns:an dictionary of fingerprint parameters

Base fingerprint type

FingerprintType

class chemfp.types.FingerprintType

The base to all fingerprint types

A fingerprint type has the following public attributes:

name

the fingerprint name, including the version

base_name

the fingerprint name, without the version

version

the fingerprint version

toolkit

the toolkit API for the underlying chemistry toolkit, or None

software

a string which characterizes the toolkit, including version information

num_bits

the number of bits in this fingerprint type

fingerprint_kwargs

a dictionary of the fingerprint arguments

The built-in fingerprint types are:

get_type()

Get the full type string (name and parameters) for this fingerprint type

Returns:a canonical fingerprint type string, including its parameters
get_metadata(sources=None)

Return a Metadata appropriate for the given fingerprint type.

This is most commonly used to make a chemfp.Metadata that can be passed into a chemfp.FingerprintWriter.

If sources is a string or a list of strings then it will passed to the newly created Metadata instance. It should contain filenames or other description of the fingerprint sources.

Parameters:sources (None, a string, or list of strings) – fingerprint source filenames or other description
Returns:a chemfp.Metadata
make_fingerprinter()

Make a ‘fingerprinter’; a callable which takes a molecule and returns a fingerprint

Returns:a function object which takes a molecule and return a fingerprint
read_molecule_fingerprints(source, format=None, id_tag=None, reader_args=None, errors="strict", location=None)

Read fingerprints from a structure source as a FingerprintIterator

Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. Use the fingerprint type to compute the fingerprint. For SD files, use id_tag to get the record id from the given SD tag instead of the title line.

The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for self.toolkit.read_molecules.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a Location instance. If None then a default Location will be created.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a Location object, or None) – object used to track parser state information
Returns:

a chemfp.FingerprintIterator which iterates over the (id, fingerprint) pair

read_molecule_fingerprints_from_string(content, format=None, id_tag=None, reader_args=None, errors="strict", location=None)

Read fingerprints from structure records in a string, as a FingerprintIterator

Iterate through the format structure records in content. Use the fingerprint type to compute the fingerprint. For SD files, use id_tag to get the record id from the given SD tag instead of the title line.

The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for self.toolkit.read_molecules.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a Location instance. If None then a default Location will be created.

Parameters:
  • content – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a Location object, or None) – object used to track parser state information
Returns:

a chemfp.FingerprintIterator which iterates over the (id, fingerprint) pair

parse_molecule_fingerprint(content, format, reader_args=None, errors="strict")

Parse the first molecule record of the content then compute and return the fingerprint

Read the first molecule from content, which contains records in the given format. Compute and return its fingerprint.

The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for self.toolkit.read_molecules.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and return None for the fingerprint, and “ignore” returns None for the fingerprint without any extra message.

Parameters:
  • content – the string containing at least one structure record
  • format (a format name string, or Format object) – the input structure format
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

the fingerprint as a byte string

parse_id_and_molecule_fingerprint(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first molecule record of the content then compute and return the id and fingerprint

Read the first molecule from content, which contains records in the given format. Compute its fingerprint and get the molecule id. For an SD record use id_tag to get the record id from the given SD tag instead of from the title line.

Return the id and fingerprint as the (id, fingerprint) pair.

The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for self.toolkit.read_molecules.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and return None for values it cannot compute, and “ignore” is like “report” but without the error message. For “report” and “ignore”, if the molecule cannot be parsed then the result will be (None, None). If the fingerprint cannot be computed then the result will be (id, None).

Parameters:
  • content – the string containing at least one structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a pair of (id string, fingerprint byte string)

make_id_and_molecule_fingerprint_parser(format, id_tag=None, reader_args=None, errors="strict")

Make a function which parses molecule from a record and returns the id and computed fingerprint

This is a very specialized function, designed for performance, but it doesn’t appear to give any advantage. You likely don’t need it.

Return a function which parses a content string containing structure records in the given format to get a molecule. Use the molecule to compute the fingerprint and get its id. For an SD record use id_tag to get the record id from the given SD tag instead of from the title line.

The new function will return the (id, fingerprint) pair.

The reader_args dictionary parameters depend on the toolkit and format. For details see the docstring for self.toolkit.read_molecules.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and return None for values it cannot compute, and “ignore” is like “report” but without the error message. For “report” and “ignore”, if the molecule cannot be parsed then the result will be (None, None). If the fingerprint cannot be computed then the result will be (id, None).

Parameters:
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a function which takes a content string and returns an (id, fingerprint) pair

compute_fingerprint(mol)

Compute and return the fingerprint byte string for the toolkit molecule

Parameters:mol – a toolkit molecule
Returns:the fingerprint as a byte string
compute_fingerprints(mols)

Compute and return the fingerprint for each toolkit molecule in an iterator

This function is a slightly optimized version of:

for mol in mols:
  yield self.compute_fingerprint(mol)
Parameters:mols – an iterable of toolkit molecules
Returns:a generator of fingerprints, one per molecule
get_fingerprint_family()

Return the fingerprint family for this fingerprint type

Returns:a FingerprintFamily

Open Babel fingerprints

Open Babel implements four fingerprints families and chemfp implements two fingerprint families using the Open Babel toolkit. These are:

  • OpenBabel-FP2 - Indexes linear fragments up to 7 atoms.
  • OpenBabel-FP3 - SMARTS patterns specified in the file patterns.txt
  • OpenBabel-FP4 - SMARTS patterns specified in the file SMARTS_InteLigand.txt
  • OpenBabel-MACCS - SMARTS patterns specified in the file MACCS.txt, which implements nearly all of the 166 MACCS keys
  • RDMACCS-OpenBabel - a chemfp implementation of nearly all of the MACCS keys
  • ChemFP-Substruct-OpenBabel - an experimental chemfp implementation of the PubChem keys

Most people use FP2 and MACCS.

Note: chemfp-2.0 implements both RDMACCS-OpenBabel/1 and RDMACCS-OpenBabel/2. Version 1 did not have a definition for key 44.

OpenBabelFP2FingerprintType_v1

class chemfp.openbabel_types.OpenBabelFP2FingerprintType_v1

OpenBabel FP2 fingerprint based on path enumeration

See http://openbabel.org/wiki/FP2

This is a Daylight-like path enumeration fingerprint with 1021 bits.

The OpenBabel-FP2/1 FingerprintType has no parameters.

OpenBabelFP3FingerprintType_v1

class chemfp.openbabel_types.OpenBabelFP3FingerprintType_v1

OpenBabel FP3 fingerprint

See http://openbabel.org/wiki/FP3

55 bit fingerprints based on a set of SMARTS patterns defining functional groups.

The OpenBabel-FP3/1 FingerprintType has no parameters.

OpenBabelFP4FingerprintType_v1

class chemfp.openbabel_types.OpenBabelFP4FingerprintType_v1

OpenBabel FP4 fingerprint

http://openbabel.org/wiki/FP4

307 bit fingerprints based on a set of SMARTS patterns defining functional groups.

The OpenBabel-FP4/1 FingerprintType has no parameters.

OpenBabelMACCSFingerprintType_v1

class chemfp.openbabel_types.OpenBabelMACCSFingerprintType_v1

Open Babel’s implementation of the 166 MACCS keys

WARNING: This implementation contains serious bugs! All of the ring sizes are wrong.

See http://openbabel.org/wiki/Tutorial:Fingerprints and https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt .

The OpenBabel-MACCS/1 FingerprintType has no parameters.

Note: this version is only available in older (pre-2012) versions of Open Babel.

OpenBabelMACCSFingerprintType_v2

class chemfp.openbabel_types.OpenBabelMACCSFingerprintType_v2

Open Babel’s implementation of the 166 MACCS keys

See http://openbabel.org/wiki/Tutorial:Fingerprints and https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt .

Note: Open Babel added support for key 44 on 20 October 2014. This should have been version 3. However, I didn’t notice until 1 May 2017 that there was no chemfp test for it. Since everyone has been using it as v2, and very few people used the older version, I won’t change the version number.

The OpenBabel-MACCS/2 FingerprintType has no parameters.

SubstructOpenBabelFingerprinter_v1

class chemfp.openbabel_patterns.SubstructOpenBabelFingerprinter_v1

chemfp’s Substruct fingerprint implementation for OEChem, version 1

WARNING: these fingerprints have not been validated.

The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits.

The ChemFP-Substruct-OpenBabel/1 FingerprintType has no parameters.

RDMACCSOpenBabelFingerprinter_v1

class chemfp.openbabel_patterns.RDMACCSOpenBabelFingerprinter_v1

chemfp’s RDMACCS fingerprint implementation for Open Babel, version 1

The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.

This version does not define key 44.

The RDMACSS-OpenBabel/1 FingerprintType has no parameters.

RDMACCSOpenBabelFingerprinter_v2

class chemfp.openbabel_patterns.RDMACCSOpenBabelFingerprinter_v2

chemfp’s RDMACCS fingerprint implementation for Open Babel, version 2

The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.

This version defines key 44.

The RDMACSS-OpenBabel/2 FingerprintType has no parameters.

OpenEye fingerprints

OpenEye’s OEGraphSim library implements four bitstring-based fingerprint families, and chemfp implements two fingerprint families based on OEChem. These are:

  • OpenEye-Path - exhaustive enumeration of all linear fragments up to a given size
  • OpenEye-Circular - exhaustive enumeration of all circular fragments grown radially from each heavy atom up to a given radius
  • OpenEye-Tree - exhaustive enumeration of all trees up to a given size
  • OpenEye-MACCS166 - an implementation of the 166 MACCS keys
  • RDMACCS-OpenEye - a chemfp implementation of the 166 MACCS keys
  • ChemFP-Substruct-OpenEye - an experimental chemfp implementation of the PubChem keys

Note: chemfp-2.0 implements both RDMACCS-OpenEye/1 and RDMACCS-OpenEye/2. Version 1 did not have a definition for key 44.

OpenEyeCircularFingerprintType_v2

class chemfp.openeye_types.OpenEyeCircularFingerprintType_v2

OEGraphSim fingerprint based on circular fingerprints around heavy atoms, version 2

See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-circular

The OpenEye-Circular/2 FingerprintType parameters are:

  • numbits - the number of bits in the fingerprint (default: 4096)
  • minradius - the minimum radius (default: 0)
  • maxradius - the maximum radius (default: 5)
  • atype - the atom type (default: “Default”)
  • btype - the bond type (default: “Default”)

The atype is either 0 or a ‘|’ separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic,

The btype is either 0 or a ‘|’ separated string containing one or more of the following: BondOrder, Chiral, InRing.

OpenEyeMACCSFingerprintType_v2

class chemfp.openeye_types.OpenEyeMACCSFingerprintType_v2

OEGraphSim implementation of the 166 MACCS keys, version 2

See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#maccs .

The OpenEye-MACCS166/2 FingerprintType has no parameters.

This corresponds to GraphSim version ‘2.0.0’.

OpenEyeMACCSFingerprintType_v3

class chemfp.openeye_types.OpenEyeMACCSFingerprintType_v3

OEGraphSim implementation of the 166 MACCS keys, version 3

See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#maccs .

The OpenEye-MACCS166/3 FingerprintType has no parameters.

This corresponds to GraphSim version ‘2.2.0’, with fixes for bits 91 and 92.

OpenEyePathFingerprintType_v2

class chemfp.openeye_types.OpenEyePathFingerprintType_v2

OEGraphSim fingerprint based on path-based enumeration, version 2

See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-path

The OpenEye-Path/2 FingerprintType parameters are:

  • numbits - the number of bits in the fingerprint (default: 4096)
  • minbonds - the minimum number of bonds (default: 0)
  • maxbonds - the maximum number of bonds (default: 5)
  • atype - the atom type (default: “Default”)
  • btype - the bond type (default: “Default”)

The atype is either 0 or a ‘|’ separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic,

The btype is either 0 or a ‘|’ separated string containing one or more of the following: BondOrder, Chiral, InRing.

OpenEyeTreeFingerprintType_v2

class chemfp.openeye_types.OpenEyeTreeFingerprintType_v2

OEGraphSim fingerprint based on tree fingerprints, version 2

See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-tree

The OpenEye-Tree/2 FingerprintType parameters are:

  • numbits - the number of bits in the fingerprint (default: 4096)
  • minbonds - minimum number of bonds in the tree
  • maxbonds - maximum number of bonds in the tree
  • atype - the atom type (default: “Default”)
  • btype - the bond type (default: “Default”)

The atype is either 0 or a ‘|’ separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic,

The btype is either 0 or a ‘|’ separated string containing one or more of the following: BondOrder, Chiral, InRing.

SubstructOpenEyeFingerprinter_v1

class chemfp.openeye_patterns.SubstructOpenEyeFingerprinter_v1

chemfp’s Substruct fingerprint implementation for OEChem, version 1

WARNING: these fingerprints have not been validated.

The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits.

The ChemFP-Substruct-OpenEye/1 FingerprintType has no parameters.

RDMACCSOpenEyeFingerprinter_v1

class chemfp.openeye_patterns.RDMACCSOpenEyeFingerprinter_v1

chemfp’s RDMACCS fingerprint implementation for OEChem, version 1

The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.

This version does not define key 44.

The RDMACSS-OpenEye/1 FingerprintType has no parameters.

RDMACCSOpenEyeFingerprinter_v2

class chemfp.openeye_patterns.RDMACCSOpenEyeFingerprinter_v2

chemfp’s RDMACCS fingerprint implementation for OEChem, version 2

The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.

This version defines key 44.

The RDMACSS-OpenEye/2 FingerprintType has no parameters.

RDKit fingerprints

RDKit implements six fingerprint families, and chemfp implements two fingerprint families based on RDKit. These are:

  • RDKit-Fingerprint - exhaustive enumeration of linear and branched trees
  • RDKit-MACCS166 - The RDKit implementation of the MACCS keys
  • RDKit-Morgan - EFCP-like circular fingerprints
  • RDKit-AtomPair - atom pair fingerprints
  • RDKit-Torsion - topological-torsion fingerprints
  • RDKit-Pattern - substructure screen fingerprint
  • RDMACCS-RDKit - a chemfp implementation of the 166 MACCS keys
  • ChemFP-Substruct-RDKit - an experimental chemfp implementation of the PubChem keys

Note: chemfp-2.0 implements both RDMACCS-RDKit/1 and RDMACCS-RDKit/2. Version 1 did not have a definition for key 44.

RDKitFingerprintType_v1

class chemfp.rdkit_types.RDKitFingerprintType_v1

RDKit’s Daylight-like fingerprint based on linear path and branched tree enumeration, version 1

See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#RDKFingerprint

The RDKit-Fingerprint/1 FingerprintType parameters are:

  • fpSize - number of bits in the fingerprint (default: 2048)
  • minPath - minimum number of bonds (default: 1)
  • maxPath - maximum number of bonds (default: 7)
  • nBitsPerHash - number of bits to set for each path hash (default: 2)
  • useHs - include information about the number of hydrogens on each atom? (default: True)

Note: this version is only available in older (pre-2014) versions of RDKit

RDKitFingerprintType_v2

class chemfp.rdkit_types.RDKitFingerprintType_v2

RDKit’s Daylight-like fingerprint based on linear path and branched tree enumeration, version 2

See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#RDKFingerprint

The RDKit-Fingerprint/2 FingerprintType parameters are:

  • fpSize - number of bits in the fingerprint (default: 2048)
  • minPath - minimum number of bonds (default: 1)
  • maxPath - maximum number of bonds (default: 7)
  • nBitsPerHash - number of bits to set for each path hash (default: 2)
  • useHs - include information about the number of hydrogens on each atom? (default: True)

RDKitMACCSFingerprintType_v1

class chemfp.rdkit_types.RDKitMACCSFingerprintType_v1

RDKit’s implementation of the 166 MACCS keys, version 1

See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMACCSKeysFingerprint

The RDKit-MACCS166/1 fingerprints have no parameters.

This version of RDKit does not support MACCS key 44 (“OTHER”).

RDKitMACCSFingerprintType_v2

class chemfp.rdkit_types.RDKitMACCSFingerprintType_v2

RDKit’s implementation of the 166 MACCS keys, version 2

See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMACCSKeysFingerprint

The RDKit-MACCS166/1 fingerprints have no parameters. RDKit version added this version in late 2014.

RDKitMorganFingerprintType_v1

class chemfp.rdkit_types.RDKitMorganFingerprintType_v1

RDKit Morgan (ECFP-like) fingerprints, version 1

See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMorganFingerprintAsBitVect

The RDKit-Morgan/1 FingerprintType parameters are:

  • fpSize - number of bits in the fingerprint (default: 2048)
  • radius - radius for the Morgan algorithm (default: 2)
  • useFeatures - use chemical-feature invariants (default: 0)
  • useChirality - use chirality information (default: 0)
  • useBondTypes - include bond type information (default: 1)

RDKitAtomPairFingerprint_v1

class chemfp.rdkit_types.RDKitAtomPairFingerprint_v1

RDKit atom pair fingerprints, version 1”

See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetHashedAtomPairFingerprintAsBitVect

The RDKit-AtomPair/1 FingerprintType parameters are:

  • fpSize - number of bits in the fingerprint (default: 2048)
  • minLength - minimum bond count for a pair (default: 1)
  • maxLength - maximum bond count for a pair (default: 30)

Note: this version is only available in older (pre-2012) versions of RDKit

RDKitAtomPairFingerprint_v2

class chemfp.rdkit_types.RDKitAtomPairFingerprint_v2

RDKit atom pair fingerprints, version 2”

See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetHashedAtomPairFingerprintAsBitVect

The RDKit-AtomPair/2 FingerprintType parameters are:

  • fpSize - number of bits in the fingerprint (default: 2048)
  • minLength - minimum bond count for a pair (default: 1)
  • maxLength - maximum bond count for a pair (default: 30)

RDKitTorsionFingerprintType_v1

class chemfp.rdkit_types.RDKitTorsionFingerprintType_v1

RDKit torsion fingerprints, version 1

See http://www.rdkit.org/Python_Docs/rdkit.Chem.AtomPairs.Torsions-module.html

An implementation of Topological-torsion fingerprints, as described in: R. Nilakantan, N. Bauman, J. S. Dixon, R. Venkataraghavan; “Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors” JCICS 27, 82-85 (1987).

The RDKit-Torsion/1 FingerprintType parameters are:

  • fpSize - number of bits in the fingerprint (default: 2048)
  • targetSize - number of bonds per torsion (default: 4)

Note: this version is only available in older (pre-2014) versions of RDKit

RDKitTorsionFingerprintType_v2

class chemfp.rdkit_types.RDKitTorsionFingerprintType_v2

RDKit torsion fingerprints, version 2

See http://www.rdkit.org/Python_Docs/rdkit.Chem.AtomPairs.Torsions-module.html

An implementation of Topological-torsion fingerprints, as described in: R. Nilakantan, N. Bauman, J. S. Dixon, R. Venkataraghavan; “Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors” JCICS 27, 82-85 (1987).

The RDKit-Torsion/2 FingerprintType parameters are:

  • fpSize - number of bits in the fingerprint (default: 2048)
  • targetSize - number of bonds per torsion (default: 4)

RDKitPatternFingerprint_v1

class chemfp.rdkit_types.RDKitPatternFingerprint_v1

RDKit’s experimental substructure screen fingerprint, version 1

See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint

The RDKit-Pattern/1 fingerprint has no parameters.

RDKitPatternFingerprint_v2

class chemfp.rdkit_types.RDKitPatternFingerprint_v2

RDKit’s experimental substructure screen fingerprint, version 2

See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint

The RDKit-Pattern/2 fingerprint has no parameters.

RDKitPatternFingerprint_v3

class chemfp.rdkit_types.RDKitPatternFingerprint_v3

RDKit’s experimental substructure screen fingerprint, version 3

See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint

The RDKit-Pattern/3 fingerprint has no parameters. This version was released 2017.03.1.

RDKitAvalonFingerprintType_v1

class chemfp.rdkit_types.RDKitAvalonFingerprintType_v1

Avalon fingerprints

The Avalon Cheminformatics toolkit is available from https://sourceforge.net/projects/avalontoolkit/ . It is not part of the core RDKit distribution. Instead, RDKit has a compile-time option to download and include it as part of the build process.

The Avalon fingerprint are described in the supplemental information for “QSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets”, Peter Gedeck, Bernhard Rohde, and Christian Bartels, J. Chem. Inf. Model., 2006, 46 (5), pp 1924-1936, DOI: 10.1021/ci050413p. The supplemental information is available from http://pubs.acs.org/doi/suppl/10.1021/ci050413p

It uses a set of feature classes which “have been fine-tuned to provide good screen-out for the set of substructure queries encounted at Novartis while limiting redundancy.” The classes are ATOM_COUNT, ATOM_SYMBOL_PATH, AUGMENTED_ATOM, AUGMENTED_BOND, HCOUNT_PAIR, HCOUNT_PATH, RING_PATH, BOND_PATH, HCOUNT_CLASS_PATH, ATOM_CLASS_PATH, RING_PATTERN, RING_SIZE_COUNTS, DEGREE_PATHS, CLASS_SPIDERS, FEATURE_PAIRS and ALL_PATTERNS.

SubstructRDKitFingerprintType_v1

class chemfp.rdkit_patterns.SubstructRDKitFingerprintType_v1

chemfp’s Substruct fingerprint implementation for RDKit, version 1

WARNING: these fingerprints have not been validated.

The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits.

The ChemFP-Substruct-RDKit/1 FingerprintType has no parameters.

RDMACCSRDKitFingerprinter_v1

class chemfp.rdkit_patterns.RDMACCSRDKitFingerprinter_v1

chemfp’s RDMACCS fingerprint implementation for RDKit, version 1

The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.

This version does not define key 44.

The RDMACSS-RDKit/1 FingerprintType has no parameters.

RDMACCSRDKitFingerprinter_v2

class chemfp.rdkit_patterns.RDMACCSRDKitFingerprinter_v2

chemfp’s RDMACCS fingerprint implementation for RDKit, version 2

The RDMACSS keys are MACCS-166-like fingerprints based on RDKit’s MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits.

This version defines key 44.

The RDMACSS-RDKit/2 FingerprintType has no parameters.

chemfp.arena module

There should be no reason for you to import this module yourself. It contains the FingerprintArena implementation. FingerprintArena instances are returns part of the public API but should not be constructed directly.

FingerprintArena

class chemfp.arena.FingerprintArena

Store fingerprints in a contiguous block of memory for fast searches

A fingerprint arena implements the chemfp.FingerprintReader API.

A fingerprint arena stores all of the fingerprints in a continuous block of memory, so the per-molecule overhead is very low.

The fingerprints can be sorted by popcount, so the fingerprints with no bits set come first, followed by those with 1 bit, etc. If self.popcount_indices is a non-empty string then the string contains information about the start and end offsets for all the fingerprints with a given popcount. This information is used for the sublinear search methods.

The public attributes are:

metadata

chemfp.Metadata about the fingerprints

ids

list of identifiers, in index order

Other attributes, which might be subject to change, and which I won’t fully explain, are:
  • arena - a contiguous block of memory, which contains the fingerprints
  • start_padding - number of bytes to the first fingerprint in the block
  • end_padding - number of bytes after the last fingerprint in the block
  • storage_size - number of bytes used to store a fingerprint
  • num_bytes - number of bytes in each fingerprint (must be <= storage_size)
  • num_bits - number of bits in each fingerprint
  • alignment - the fingerprint alignment
  • start - the index for the first fingerprint in the arena/subarena
  • end - the index for the last fingerprint in the arena/subarena
  • arena_ids - all of the identifiers for the parent arena

The FingerprintArena is its own context manager, but it does nothing on context exit. This is a bug when the FingerprintArena uses a memory-mapped FPB file because there is currently no explicit way to close the file. Only the garbage collector is able to do that.

__len__()

Number of fingerprint records in the FingerprintArena

__getitem__(i)

Return the (id, fingerprint) pair at index i

__iter__()

Iterate over the (id, fingerprint) contents of the arena

get_fingerprint_type()

Get the fingerprint type object based on the metadata’s type field

This uses self.metadata.type to get the fingerprint type string then calls chemfp.get_fingerprint_type() to get and return a chemfp.types.FingerprintType instance.

This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.

Returns:a chemfp.types.FingerprintType
get_fingerprint(i)

Return the fingerprint at index i

Raises an IndexError if index i is out of range.

get_by_id(id)

Given the record identifier, return the (id, fingerprint) pair,

If the id is not present then return None.

get_index_by_id(id)

Given the record identifier, return the record index

If the id is not present then return None.

get_fingerprint_by_id(id)

Given the record identifier, return its fingerprint

If the id is not present then return None

save(destination, format=None)

Save the fingerprints to a given destination and format

The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.

If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None; None writes to stdout.

If the output format is “fpb” then destination must be a filename.

Parameters:
  • destination (a filename, file object, or None) – the output destination
  • format (None, "fps", "fps.gz", or "fpb") – the output format
Returns:

None

iter_arenas(arena_size = 1000)

Base class for all chemfp objects holding fingerprint records

All FingerprintReader instances have a metadata attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.

copy(indices=None, reorder=None)

Create a new arena using either all or some of the fingerprints in this arena

By default this create a new arena. The fingerprint data block and ids may be shared with the original arena, which makes this a shallow copy. If the original arena is a slice, or “sub-arena” of an arena, then the copy will allocate new space to store just the fingerprints in the slice and use its own list for the ids.

The indices parameter, if not None, is an iterable which contains the indicies of the fingerprint records to copy. Duplicates are allowed, though discouraged.

If indices are specified then the default reorder value of None, or the value True, will reorder the fingerprints for the new arena by popcount. This improves overall search performance. If reorder is False then the new arena will preserve the order given by the indices.

If indices are not specified, then the default is to preserve the order type of the original arena. Use reorder=True to always reorder the fingerprints in the new arena by popcount, and reorder=False to always leave them in the current ordering.

>>> import chemfp
>>> arena = chemfp.load_fingerprints("pubchem_queries.fps")
>>> arena.ids[1], arena.ids[5], arena.ids[10], arena.ids[18]
(b'9425031', b'9425015', b'9425040', b'9425033')
>>> len(arena)
19
>>> new_arena = arena.copy(indices=[1, 5, 10, 18])
>>> len(new_arena)
4
>>> new_arena.ids
[b'9425031', b'9425015', b'9425040', b'9425033']
>>> new_arena = arena.copy(indices=[18, 10, 5, 1], reorder=False)
>>> new_arena.ids
[b'9425033', b'9425040', b'9425015', b'9425031']
Parameters:
  • indices (iterable containing integers, or None) – indicies of the records to copy into the new arena
  • reorder (True to reorder, False to leave in input order, None for default action) – describes how to order the fingerprints
count_tanimoto_hits_fp(query_fp, threshold=0.7)

Count the fingerprints which are sufficiently similar to the query fingerprint

Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

integer count

threshold_tanimoto_search_fp(query_fp, threshold=0.7)

Find the fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a SearchResult, in arbitrary order.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

knearest_tanimoto_search_fp(query_fp, k=3, threshold=0.7)

Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResult, sorted from highest score to lowest.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

count_tversky_hits_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0)

Count the fingerprints which are sufficiently similar to the query fingerprint

Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

integer count

threshold_tversky_search_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0)

Find the fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a SearchResult, in arbitrary order.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

knearest_tversky_search_fp(query_fp, k=3, threshold=0.7, alpha=1.0, beta=1.0)

Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResult, sorted from highest score to lowest.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

chemfp.search module

The following functions and classes are in the chemfp.search module.

There are three main classes of functions. The ones ending with *_fp use a query fingerprint to search a target arena. The ones ending with *_arena use a query arena to search a target arena. The ones ending with *_symmetric use arena to search itself, except that a fingerprint is not tested against itself.

These functions share the same name with very similar functions in the top-level chemfp module. My apologies for any confusion. The top-level functions are designed to work with both arenas and iterators as the target. They give a simple search API, and automatically process in blocks, to give a balanced trade-off between performance and response time for the first results.

The functions in this module only work with arena as the target. By default it searches the entire arena before returning. If you want to process portions of the arena then you need to specify the range yourself.

chemfp.search.count_tanimoto_hits_fp(query_fp, target_arena, threshold=0.7)

Count the number of hits in target_arena at least threshold similar to the query_fp

Example:

query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0]
targets = chemfp.load_fingerprints("targets.fps")
print(chemfp.search.count_tanimoto_hits_fp(query_fp, targets, threshold=0.1))
Parameters:
  • query_fp (a byte string) – the query fingerprint
  • target_arena – the target arena
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

an integer count

chemfp.search.count_tanimoto_hits_arena(query_arena, target_arena, threshold=0.7)

For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it

Example:

queries = chemfp.load_fingerprints("queries.fps")
targets = chemfp.load_fingerprints("targets.fps")
counts = chemfp.search.count_tanimoto_hits_arena(queries, targets, threshold=0.1)
print(counts[:10])

The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.

Parameters:
Returns:

an array of counts

chemfp.search.count_tanimoto_hits_symmetric(arena, threshold=0.7, batch_size=100)

For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it

A fingerprint never matches itself.

The computation can take a long time. Python won’t check check for a ^C until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a ^C.

Note: the batch_size may disappear in future versions of chemfp. I can’t detect any performance difference between the current value and a larger value, so it seems rather pointless to have. Let me know if it’s useful to keep as a user-defined parameter.

Example:

arena = chemfp.load_fingerprints("targets.fps")
counts = chemfp.search.count_tanimoto_hits_symmetric(arena, threshold=0.2)
print(counts[:10])

The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.

Parameters:
  • arena (a chemfp.arena.FingerprintArena) – the set of fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • batch_size (integer) – the number of rows to process before checking for a ^C
Returns:

an array of counts

chemfp.search.partial_count_tanimoto_hits_symmetric(counts, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None)

Compute a portion of the symmetric Tanimoto counts

For most cases, use chemfp.search.count_tanimoto_hits_symmetric() instead of this function!

This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.

counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.

The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.

You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:

import chemfp
import chemfp.search
from chemfp import futures
import array

chemfp.set_num_threads(1)  # Globally disable OpenMP

arena = chemfp.load_fingerprints("targets.fps")  # Load the fingerprints
n = len(arena)
counts = array.array("i", [0]*n)

with futures.ThreadPoolExecutor(max_workers=4) as executor:
    for row in xrange(0, n, 10):
        executor.submit(chemfp.search.partial_count_tanimoto_hits_symmetric,
                        counts, arena, threshold=0.2,
                        query_start=row, query_end=min(row+10, n))

print(counts)
Parameters:
  • counts (a contiguous block of integer) – the accumulated Tanimoto counts
  • arena (a chemfp.arena.FingerprintArena) – the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • query_start (an integer) – the query start row
  • query_end (an integer, or None to mean the last query row) – the query end row
  • target_start (an integer) – the target start row
  • target_end (an integer, or None to mean the last target row) – the target end row
Returns:

None

chemfp.search.count_tversky_hits_fp(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0)

Count the number of hits in target_arena least threshold similar to the query_fp (Tversky)

Example:

query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0]
targets = chemfp.load_fingerprints("targets.fps")
print(chemfp.search.count_tversky_hits_fp(query_fp, targets, threshold=0.1))
Parameters:
  • query_fp (a byte string) – the query fingerprint
  • target_arena – the target arena
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

an integer count

chemfp.search.count_tversky_hits_arena(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0)

For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it

Example:

queries = chemfp.load_fingerprints("queries.fps")
targets = chemfp.load_fingerprints("targets.fps")
counts = chemfp.search.count_tversky_hits_arena(queries, targets, threshold=0.1,
              alpha=0.5, beta=0.5)
print(counts[:10])

The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.

Parameters:
Returns:

an array of counts

chemfp.search.count_tversky_hits_symmetric(arena, threshold=0.7, alpha=1.0, beta=1.0, batch_size=100)

For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it

A fingerprint never matches itself.

The computation can take a long time. Python won’t check check for a ^C until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a ^C.

Note: the batch_size may disappear in future versions of chemfp. I can’t detect any performance difference between the current value and a larger value, so it seems rather pointless to have. Let me know if it’s useful to keep as a user-defined parameter.

Example:

arena = chemfp.load_fingerprints("targets.fps")
counts = chemfp.search.count_tversky_hits_symmetric(
      arena, threshold=0.2, alpha=0.5, beta=0.5)
print(counts[:10])

The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.

Parameters:
  • arena (a chemfp.arena.FingerprintArena) – the set of fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • batch_size (integer) – the number of rows to process before checking for a ^C
Returns:

an array of counts

chemfp.search.partial_count_tversky_hits_symmetric(counts, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None)

Compute a portion of the symmetric Tversky counts

For most cases, use chemfp.search.count_tversky_hits_symmetric() instead of this function!

This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.

counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.

The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.

You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:

import chemfp
import chemfp.search
from chemfp import futures
import array

chemfp.set_num_threads(1)  # Globally disable OpenMP

arena = chemfp.load_fingerprints("targets.fps")  # Load the fingerprints
n = len(arena)
counts = array.array("i", [0]*n)

with futures.ThreadPoolExecutor(max_workers=4) as executor:
    for row in xrange(0, n, 10):
        executor.submit(chemfp.search.partial_count_tversky_hits_symmetric,
                        counts, arena, threshold=0.2, alpha=0.5, beta=0.5,
                        query_start=row, query_end=min(row+10, n))

print(counts)
Parameters:
  • counts (a contiguous block of integer) – the accumulated Tversky counts
  • arena (a chemfp.arena.FingerprintArena) – the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • query_start (an integer) – the query start row
  • query_end (an integer, or None to mean the last query row) – the query end row
  • target_start (an integer) – the target start row
  • target_end (an integer, or None to mean the last target row) – the target end row
Returns:

None

chemfp.search.threshold_tanimoto_search_fp(query_fp, target_arena, threshold=0.7)

Search for fingerprint hits in target_arena which are at least threshold similar to query_fp

The hits in the returned chemfp.search.SearchResult are in arbitrary order.

Example:

query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0]
targets = chemfp.load_fingerprints("targets.fps")
print(list(chemfp.search.threshold_tanimoto_search_fp(query_fp, targets, threshold=0.15)))
Parameters:
  • query_fp (a byte string) – the query fingerprint
  • target_arena (a chemfp.arena.FingerprintArena) – the target arena
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

a chemfp.search.SearchResult

chemfp.search.threshold_tanimoto_search_arena(query_arena, target_arena, threshold=0.7)

Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena

The hits in the returned chemfp.search.SearchResults are in arbitrary order.

Example:

queries = chemfp.load_fingerprints("queries.fps")
targets = chemfp.load_fingerprints("targets.fps")
results = chemfp.search.threshold_tanimoto_search_arena(queries, targets, threshold=0.5)
for query_id, query_hits in zip(queries.ids, results):
    if len(query_hits) > 0:
        print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters:
Returns:

a chemfp.search.SearchResults

chemfp.search.threshold_tanimoto_search_symmetric(arena, threshold=0.7, include_lower_triangle=True, batch_size=100)

Search for the hits in the arena at least threshold similar to the fingerprints in the arena

When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.

The hits in the returned chemfp.search.SearchResults are in arbitrary order.

The computation can take a long time. Python won’t check check for a ^C until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a ^C.

Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to have as a user-defined parameter.

Example:

arena = chemfp.load_fingerprints("queries.fps")
full_result = chemfp.search.threshold_tanimoto_search_symmetric(arena, threshold=0.2)
upper_triangle = chemfp.search.threshold_tanimoto_search_symmetric(
          arena, threshold=0.2, include_lower_triangle=False)
assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
Parameters:
  • arena (a chemfp.arena.FingerprintArena) – the set of fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
  • batch_size (integer) – the number of rows to process before checking for a ^C
Returns:

a chemfp.search.SearchResults

chemfp.search.partial_threshold_tanimoto_search_symmetric(results, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0)

Compute a portion of the symmetric Tanimoto search results

For most cases, use chemfp.search.threshold_tanimoto_search_symmetric() instead of this function!

This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.

results is a chemfp.search.SearchResults instance which is at least as large as the arena. It should be reused for successive updates.

The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.

It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.

You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:

import chemfp
import chemfp.search
from chemfp import futures
import array

chemfp.set_num_threads(1)

arena = chemfp.load_fingerprints("targets.fps")
n = len(arena)
results = chemfp.search.SearchResults(n, n, arena.ids)

with futures.ThreadPoolExecutor(max_workers=4) as executor:
    for row in xrange(0, n, 10):
        executor.submit(chemfp.search.partial_threshold_tanimoto_search_symmetric,
                        results, arena, threshold=0.2,
                        query_start=row, query_end=min(row+10, n))

chemfp.search.fill_lower_triangle(results)

The hits in the chemfp.search.SearchResults are in arbitrary order.

Parameters:
  • results (a chemfp.search.SearchResults instance) – the intermediate search results
  • arena (a chemfp.arena.FingerprintArena) – the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • query_start (an integer) – the query start row
  • query_end (an integer, or None to mean the last query row) – the query end row
  • target_start (an integer) – the target start row
  • target_end (an integer, or None to mean the last target row) – the target end row
  • results_offset – use results[results_offset] as the base for the results
  • results_offset – an integer
Returns:

None

chemfp.search.fill_lower_triangle(results)

Duplicate each entry of results to its transpose

This is used after the symmetric threshold search to turn the upper-triangle results into a full matrix.

Parameters:results (a chemfp.search.SearchResults) – search results
chemfp.search.threshold_tversky_search_fp(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0)

Search for fingerprint hits in target_arena which are at least threshold similar to query_fp

The hits in the returned chemfp.search.SearchResult are in arbitrary order.

Example:

query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0]
targets = chemfp.load_fingerprints("targets.fps")
print(list(chemfp.search.threshold_tversky_search_fp(
           query_fp, targets, threshold=0.15, alpha=0.5, beta=0.5)))
Parameters:
  • query_fp (a byte string) – the query fingerprint
  • target_arena (a chemfp.arena.FingerprintArena) – the target arena
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

a chemfp.search.SearchResult

chemfp.search.threshold_tversky_search_arena(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0)

Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena

The hits in the returned chemfp.search.SearchResults are in arbitrary order.

Example:

queries = chemfp.load_fingerprints("queries.fps")
targets = chemfp.load_fingerprints("targets.fps")
results = chemfp.search.threshold_tversky_search_arena(
              queries, targets, threshold=0.5, alpha=0.5, beta=0.5)
for query_id, query_hits in zip(queries.ids, results):
    if len(query_hits) > 0:
        print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters:
Returns:

a chemfp.search.SearchResults

chemfp.search.threshold_tversky_search_symmetric(arena, threshold=0.7, alpha=1.0, beta=1.0, include_lower_triangle=True, batch_size=100)

Search for the hits in the arena at least threshold similar to the fingerprints in the arena

When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.

The hits in the returned chemfp.search.SearchResults are in arbitrary order.

The computation can take a long time. Python won’t check check for a ^C until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a ^C

Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to have as a user-defined parameter.

Example:

arena = chemfp.load_fingerprints("queries.fps")
full_result = chemfp.search.threshold_tversky_search_symmetric(
      arena, threshold=0.2, alpha=0.5, beta=0.5)
upper_triangle = chemfp.search.threshold_tversky_search_symmetric(
          arena, threshold=0.2, alpha=0.5, beta=0.5, include_lower_triangle=False)
assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
Parameters:
  • arena (a chemfp.arena.FingerprintArena) – the set of fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
  • batch_size (integer) – the number of rows to process before checking for a ^C
Returns:

a chemfp.search.SearchResults

chemfp.search.partial_threshold_tversky_search_symmetric(results, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0)

Compute a portion of the symmetric Tversky search results

For most cases, use chemfp.search.threshold_tversky_search_symmetric() instead of this function!

This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.

results is a chemfp.search.SearchResults instance which is at least as large as the arena. It should be reused for successive updates.

The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.

It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.

You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:

import chemfp
import chemfp.search
from chemfp import futures
import array

chemfp.set_num_threads(1)

arena = chemfp.load_fingerprints("targets.fps")
n = len(arena)
results = chemfp.search.SearchResults(n, n, arena.ids)

with futures.ThreadPoolExecutor(max_workers=4) as executor:
    for row in xrange(0, n, 10):
        executor.submit(chemfp.search.partial_threshold_tversky_search_symmetric,
                        results, arena, threshold=0.2, alpha=0.5, beta=0.5,
                        query_start=row, query_end=min(row+10, n))

chemfp.search.fill_lower_triangle(results)

The hits in the chemfp.search.SearchResults are in arbitrary order.

Parameters:
  • counts (a SearchResults instance) – the intermediate search results
  • arena (a chemfp.arena.FingerprintArena) – the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • query_start (an integer) – the query start row
  • query_end (an integer, or None to mean the last query row) – the query end row
  • target_start (an integer) – the target start row
  • target_end (an integer, or None to mean the last target row) – the target end row
  • results_offset – use results[results_offset] as the base for the results
  • results_offset – an integer
Returns:

None

chemfp.search.knearest_tanimoto_search_fp(query_fp, target_arena, k=3, threshold=0.7)

Search for k-nearest hits in target_arena which are at least threshold similar to query_fp

The hits in the chemfp.search.SearchResults are ordered by decreasing similarity score.

Example:

query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0]
targets = chemfp.load_fingerprints("targets.fps")
print(list(chemfp.search.knearest_tanimoto_search_fp(query_fp, targets, k=3, threshold=0.0)))
Parameters:
  • query_fp (a byte string) – the query fingerprint
  • target_arena (a chemfp.arena.FingerprintArena) – the target arena
  • k (positive integer) – the number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

a chemfp.search.SearchResult

chemfp.search.knearest_tanimoto_search_arena(query_arena, target_arena, k=3, threshold=0.7)

Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena

The hits in the chemfp.search.SearchResults are ordered by decreasing similarity score.

Example:

queries = chemfp.load_fingerprints("queries.fps")
targets = chemfp.load_fingerprints("targets.fps")
results = chemfp.search.knearest_tanimoto_search_arena(queries, targets, k=3, threshold=0.5)
for query_id, query_hits in zip(queries.ids, results):
    if len(query_hits) >= 2:
        print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters:
  • query_arena (a chemfp.arena.FingerprintArena) – The query fingerprints.
  • target_arena (a chemfp.arena.FingerprintArena) – The target fingerprints.
  • k (positive integer) – the number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

a chemfp.search.SearchResults

chemfp.search.knearest_tanimoto_search_symmetric(arena, k=3, threshold=0.7, batch_size=100)

Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena

The hits in the SearchResults are ordered by decreasing similarity score.

The computation can take a long time. Python won’t check check for a ^C until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a ^C.

Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to keep as a user-defined parameter.

Example:

arena = chemfp.load_fingerprints("queries.fps")
results = chemfp.search.knearest_tanimoto_search_symmetric(arena, k=3, threshold=0.8)
for (query_id, hits) in zip(arena.ids, results):
    print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in  hits.get_ids_and_scores()))
Parameters:
  • arena (a chemfp.arena.FingerprintArena) – the set of fingerprints
  • k (positive integer) – the number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
  • batch_size (integer) – the number of rows to process before checking for a ^C
Returns:

a chemfp.search.SearchResults

chemfp.search.knearest_tversky_search_fp(query_fp, target_arena, k=3, threshold=0.7, alpha=1.0, beta=1.0)

Search for k-nearest hits in target_arena which are at least threshold similar to query_fp

The hits in the chemfp.search.SearchResults are ordered by decreasing similarity score.

Example:

query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0]
targets = chemfp.load_fingerprints("targets.fps")
print(list(chemfp.search.knearest_tversky_search_fp(
        query_fp, targets, k=3, threshold=0.0, alpha=0.5, beta=0.5)))
Parameters:
  • query_fp (a byte string) – the query fingerprint
  • target_arena – the target arena
  • k (positive integer) – the number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

a chemfp.search.SearchResults

chemfp.search.knearest_tversky_search_arena(query_arena, target_arena, k=3, threshold=0.7, alpha=1.0, beta=1.0)

Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena

The hits in the chemfp.search.SearchResults are ordered by decreasing similarity score.

Example:

queries = chemfp.load_fingerprints("queries.fps")
targets = chemfp.load_fingerprints("targets.fps")
results = chemfp.search.knearest_tversky_search_arena(
      queries, targets, k=3, threshold=0.5, alpha=0.5, beta=0.5)
for query_id, query_hits in zip(queries.ids, results):
    if len(query_hits) >= 2:
        print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters:
  • query_arena (a chemfp.arena.FingerprintArena) – The query fingerprints.
  • target_arena (a chemfp.arena.FingerprintArena) – The target fingerprints.
  • k (positive integer) – the number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

a chemfp.search.SearchResults

chemfp.search.knearest_tversky_search_symmetric(arena, k=3, threshold=0.7, alpha=1.0, beta=1.0, batch_size=100)

Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena

The hits in the SearchResults are ordered by decreasing similarity score.

The computation can take a long time. Python won’t check check for a ^C until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a ^C.

Note: the batch_size may disappear in future versions of chemfp. Let me know if it really is useful for you to keep as a user-defined parameter.

Example:

arena = chemfp.load_fingerprints("queries.fps")
results = chemfp.search.knearest_tversky_search_symmetric(
         arena, k=3, threshold=0.8, alpha=0.5, beta=0.5)
for (query_id, hits) in zip(arena.ids, results):
    print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in  hits.get_ids_and_scores()))
Parameters:
  • arena (a chemfp.arena.FingerprintArena) – the set of fingerprints
  • k (positive integer) – the number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
  • batch_size (integer) – the number of rows to process before checking for a ^C
Returns:

a chemfp.search.SearchResults

chemfp.search.contains_fp(query_fp, target_arena)

Find the target fingerprints which contain the query fingerprint bits as a subset

A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a chemfp.search.SearchResult containing all of the target fingerprints in target_arena that contain the query_fp.

The SearchResult scores are all 0.0.

There is currently no direct way to limit the arena search range. Instead create a subarena by using Python’s slice notation on the arena then search the subarena.

Parameters:
Returns:

a SearchResult instance

chemfp.search.contains_arena(query_arena, target_arena)

Find the target fingerprints which contain the query fingerprints as a subset

A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a chemfp.search.SearchResults where SearchResults[i] contains all of the target fingerprints in target_arena that contain the fingerprint for entry query_arena [i].

The SearchResult scores are all 0.0.

There is currently no direct way to limit the arena search range, though you can create and search a subarena by using Python’s slice notation.

Parameters:
Returns:

a chemfp.search.SearchResults instance, of the same size as query_arena

SearchResults

class chemfp.search.SearchResults

Search results for a list of query fingerprints against a target arena

This acts like a list of SearchResult elements, with the ability to iterate over each search results, look them up by index, and get the number of scores.

In addition, there are helper methods to iterate over each hit and to get the hit indicies, scores, and identifiers directly as Python lists, sort the list contents, and more.

__len__()

The number of rows in the SearchResults

__iter__()

Iterate over each SearchResult hit

__getitem__(i)

Get the i-th SearchResult

shape

Read-only attribute.

the tuple (number of rows, number of columns)

The number of columns is the size of the target arena.

iter_indices()

For each hit, yield the list of target indices

iter_ids()

For each hit, yield the list of target identifiers

iter_scores()

For each hit, yield the list of target scores

iter_indices_and_scores()

For each hit, yield the list of (target index, score) tuples

iter_ids_and_scores()

For each hit, yield the list of (target id, score) tuples

clear_all()

Remove all hits from all of the search results

count_all(min_score=None, max_score=None, interval="[]")

Count the number of hits with a score between min_score and max_score

Using the default parameters this returns the number of hits in the result.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.
  • max_score (a float, or None for +infinity) – the maximum score in the range.
  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns:

an integer count

cumulative_score_all(min_score=None, max_score=None, interval="[]")

The sum of all scores in all rows which are between min_score and max_score

Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.
  • max_score (a float, or None for +infinity) – the maximum score in the range.
  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns:

a floating point count

reorder_all(order="decreasing-score")

Reorder the hits for all of the rows based on the requested order.

The available orderings are:

  • increasing-score - sort by increasing score
  • decreasing-score - sort by decreasing score
  • increasing-index - sort by increasing target index
  • decreasing-index - sort by decreasing target index
  • move-closest-first - move the hit with the highest score to the first position
  • reverse - reverse the current ordering
Parameters:ordering (string) – the name of the ordering to use
to_csr(dtype=None)

Return the results as a SciPy compressed sparse row matrix.

The returned matrix has the same shape as the SearchResult instance and can be passed into, for example, a scikit-learn clustering algorithm.

By default the scores are stored with the dtype is “float64”.

This method requires that SciPy (and NumPy) be installed.

Parameters:dtype (string or NumPy type) – a NumPy numeric data type

SearchResult

class chemfp.search.SearchResult

Search results for a query fingerprint against a target arena.

The results contains a list of hits. Hits contain a target index, score, and optional target ids. The hits can be reordered based on score or index.

__len__()

The number of hits

__iter__()

Iterate through the pairs of (target index, score) using the current ordering

clear()

Remove all hits from this result

get_indices()

The list of target indices, in the current ordering.

get_ids()

The list of target identifiers (if available), in the current ordering

iter_ids()

Iterate over target identifiers (if available), in the current ordering

get_scores()

The list of target scores, in the current ordering

get_ids_and_scores()

The list of (target identifier, target score) pairs, in the current ordering

Raises a TypeError if the target IDs are not available.

get_indices_and_scores()

The list of (target index, score) pairs, in the current ordering

reorder(ordering="decreasing-score")

Reorder the hits based on the requested ordering.

The available orderings are:
  • increasing-score - sort by increasing score
  • decreasing-score - sort by decreasing score
  • increasing-index - sort by increasing target index
  • decreasing-index - sort by decreasing target index
  • move-closest-first - move the hit with the highest score to the first position
  • reverse - reverse the current ordering
Parameters:ordering (string) – the name of the ordering to use
count(min_score=None, max_score=None, interval="[]")

Count the number of hits with a score between min_score and max_score

Using the default parameters this returns the number of hits in the result.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.
  • max_score (a float, or None for +infinity) – the maximum score in the range.
  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns:

an integer count

cumulative_score(min_score=None, max_score=None, interval="[]")

The sum of the scores which are between min_score and max_score

Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.

The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.

The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.

Parameters:
  • min_score (a float, or None for -infinity) – the minimum score in the range.
  • max_score (a float, or None for +infinity) – the maximum score in the range.
  • interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns:

a floating point value

chemfp.bitops module

The following functions from the chemfp.bitops module provide low-level bit operations on byte and hex fingerprints.

chemfp.bitops.byte_contains(sub_fp, super_fp)

Return 1 if the on bits of sub_fp are also 1 bits in super_fp, that is, if super_fp contains sub_fp.

chemfp.bitops.byte_contains_bit(fp, bit_index)

Return True if the the given bit position is on, otherwise False

chemfp.bitops.byte_difference(fp1, fp2)

Return the absolute difference (xor) between the two byte strings, fp1 ^ fp2

chemfp.bitops.byte_from_bitlist(fp[, num_bits=1024])

Convert a list of bit positions into a byte fingerprint, including modulo folding

chemfp.bitops.byte_hex_tanimoto(fp1, fp2)

Compute the Tanimoto similarity between the byte fingerprint fp1 and the hex fingerprint fp2. Return a float between 0.0 and 1.0, or raise a ValueError if fp2 is not a hex fingerprint

chemfp.bitops.byte_hex_tversky(fp1, fp2, alpha=1.0, beta=1.0)

Compute the Tversky index between the byte fingerprint fp1 and the hex fingerprint fp2. Return a float between 0.0 and 1.0, or raise a ValueError if fp2 is not a hex fingerprint

chemfp.bitops.byte_intersect(fp1, fp2)

Return the intersection of the two byte strings, fp1 & fp2

chemfp.bitops.byte_intersect_popcount(fp1, fp2)

Return the number of bits set in the instersection of the two byte fingerprints fp1 and fp2

chemfp.bitops.byte_popcount(fp)

Return the number of bits set in the byte fingerprint fp

chemfp.bitops.byte_tanimoto(fp1, fp2)

Compute the Tanimoto similarity between the two byte fingerprints fp1 and fp2

chemfp.bitops.byte_to_bitlist(bitlist)

Return a sorted list of the on-bit positions in the byte fingerprint

chemfp.bitops.byte_tversky(fp1, fp2, alpha=1.0, beta=1.0)

Compute the Tversky index between the two byte fingerprints fp1 and fp2

chemfp.bitops.byte_union(fp1, fp2)

Return the union of the two byte strings, fp1 | fp2

chemfp.bitops.hex_contains(sub_fp, super_fp)

Return 1 if the on bits of sub_fp are also on bits in super_fp, otherwise 0. Return -1 if either string is not a hex fingerprint

chemfp.bitops.hex_contains_bit(fp, bit_index)

Return True if the the given bit position is on, otherwise False.

This function does not validate that the hex fingerprint is actually in hex.

chemfp.bitops.hex_difference(fp1, fp2)

Return the absolute difference (xor) between the two hex strings, fp1 ^ fp2. Raises a ValueError for non-hex fingerprints.

chemfp.bitops.hex_from_bitlist(fp[, num_bits=1024])

Convert a list of bit positions into a hex fingerprint, including modulo folding

chemfp.bitops.hex_intersect(fp1, fp2)

Return the intersection of the two hex strings, fp1 & fp2. Raises a ValueError for non-hex fingerprints.

chemfp.bitops.hex_intersect_popcount(fp1, fp2)

Return the number of bits set in the intersection of the two hex fingerprints fp1 and fp2, or raise a ValueError if either string is a non-hex string

chemfp.bitops.hex_isvalid(s)

Return 1 if the string s is a valid hex fingerprint, otherwise 0

chemfp.bitops.hex_popcount(fp)

Return the number of bits set in a hex fingerprint fp, or -1 for non-hex strings

chemfp.bitops.hex_tanimoto(fp1, fp2)

Compute the Tanimoto similarity between two hex fingerprints. Return a float between 0.0 and 1.0, or raise a ValueError if either string is not a hex fingerprint

chemfp.bitops.hex_tversky(fp1, fp2, alpha=1.0, beta=1.0)

Compute the Tversky index between two hex fingerprints. Return a float between 0.0 and 1.0, or raise a ValueError if either string is not a hex fingerprint

chemfp.bitops.hex_to_bitlist(bitlist)

Return a sorted list of the on-bit positions in the hex fingerprint

chemfp.bitops.hex_union(fp1, fp2)

Return the union of the two hex strings, fp1 | fp2. Raises a ValueError for non-hex fingerprints.

chemfp.bitops.hex_encode(s)

Encode the byte string or ASCII string to hex. Returns a text string.

chemfp.bitops.hex_encode_as_bytes(s)

Encode the byte string or ASCII string to hex. Returns a byte string.

chemfp.bitops.hex_decode(s)

Decode the hex-encoded value to a byte string

chemfp.encodings

Decode different fingerprint representations into chemfp form. (Currently only decoders are available. Future released may include encoders.)

The chemfp fingerprints are stored as byte strings, with the bytes in least-significant bit order (bit #0 is stored in the first/left-most byte) and with the bits in most-significant bit order (bit #0 is stored in the first/right-most bit of the first byte).

Other systems use different encodings. These include:
  • the ‘0 and ‘1’ characters, as in ‘00111101’
  • hex encoding, like ‘3d’
  • base64 encoding, like ‘SGVsbG8h’
  • CACTVS’s variation of base64 encoding

plus variations of different LSB and MSB orders.

This module decodes most of the fingerprint encodings I have come across. The fingerprint decoders return a 2-ple of the bit length and the chemfp fingerprint. The bit length is None unless the bit length is known exactly, which currently is only the case for the binary and CACTVS fingerprints. (The hex and other encoders must round the fingerprints up to a multiple of 8 bits.)

chemfp.encodings.from_binary_lsb(text)

Convert a string like ‘00010101’ (bit 0 here is off) into ‘xa8’

The encoding characters ‘0’ and ‘1’ are in LSB order, so bit 0 is the left-most field. The result is a 2-ple of the fingerprint length and the decoded chemfp fingerprint

>>> from_binary_lsb('00010101')
(8, b'\xa8')
>>> from_binary_lsb('11101')
(5, b'\x17')
>>> from_binary_lsb('00000000000000010000000000000')
(29, b'\x00\x80\x00\x00')
>>>
chemfp.encodings.from_binary_msb(text)

Convert a string like ‘10101000’ (bit 0 here is off) into ‘xa8’

The encoding characters ‘0’ and ‘1’ are in MSB order, so bit 0 is the right-most field.

>>> from_binary_msb(b'10101000')
(8, b'\xa8')
>>> from_binary_msb(b'00010101')
(8, b'\x15')
>>> from_binary_msb(b'00111')
(5, b'\x07')
>>> from_binary_msb(b'00000000000001000000000000000')
(29, b'\x00\x80\x00\x00')
>>>
chemfp.encodings.from_base64(text)

Decode a base64 encoded fingerprint string

The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order.

>>> from_base64("SGk=")
(None, b'Hi')
>>> from binascii import hexlify
>>> hexlify(from_base64("SGk=")[1])
b'4869'
>>>
chemfp.encodings.from_hex(text)

Decode a hex encoded fingerprint string

The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order.

>>> from_hex(b'10f2')
(None, b'\x10\xf2')
>>>

Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.

chemfp.encodings.from_hex_msb(text)

Decode a hex encoded fingerprint string where the bits and bytes are in MSB order

>>> from_hex_msb(b'10f2')
(None, b'\xf2\x10')
>>>

Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.

chemfp.encodings.from_hex_lsb(text)

Decode a hex encoded fingerprint string where the bits and bytes are in LSB order

>>> from_hex_lsb(b'102f')
(None, b'\x08\xf4')
>>>

Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.

chemfp.encodings.from_cactvs(text)

Decode a 881-bit CACTVS-encoded fingerprint used by PubChem

>>> from_cactvs(b"AAADceB7sQAEAAAAAAAAAAAAAAAAAWAAAAAwAAAAAAAAAAABwAAAHwIYAAAADA" +
...             b"rBniwygJJqAACqAyVyVACSBAAhhwIa+CC4ZtgIYCLB0/CUpAhgmADIyYcAgAAO" +
...             b"AAAAAAABAAAAAAAAAAIAAAAAAAAAAA==")
(881, b'\x07\xde\x8d\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x06\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x80\x03\x00\x00\xf8@\x18\x00\x00\x000P\x83y4L\x01IV\x00\x00U\xc0\xa4N*\x00I \x00\x84\xe1@X\x1f\x04\x1df\x1b\x10\x06D\x83\xcb\x0f)%\x10\x06\x19\x00\x13\x93\xe1\x00\x01\x00p\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00\x00')
>>>
For format details, see
ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt
chemfp.encodings.from_daylight(text)

Decode a Daylight ASCII fingerprint

>>> from_daylight(b"I5Z2MLZgOKRcR...1")
(None, b'PyDaylight')

See the implementation for format details.

chemfp.encodings.from_on_bit_positions(text, num_bits=1024, separator=" ")

Decode from a list of integers describing the location of the on bits

>>> from_on_bit_positions("1 4 9 63", num_bits=32)
(32, b'\x12\x02\x00\x80')
>>> from_on_bit_positions("1,4,9,63", num_bits=64, separator=",")
(64, b'\x12\x02\x00\x00\x00\x00\x00\x80')

The text contains a sequence of non-negative integer values separated by the separator text. Bit positions are folded modulo num_bits.

This is often used to convert sparse fingerprints into a dense fingerprint.

Note: if you have a list of bit position as integer values then you probably want to use chemfp.bitops.byte_from_bitlist().

chemfp.fps_io module

This module is part of the private API. Do not import it directly.

The function chemfp.open() returns an FPSReader if the source is an FPS file. The function chemfp.open_fingerprint_writer() returns an FPSWriter if the destination is an FPS file.

FPSReader

class chemfp.fps_io.FPSReader

FPS file reader

This class implements the chemfp.FingerprintReader API. It is also its own a context manager, which automatically closes the file when the manager exists.

The public attributes are:

metadata

a chemfp.Metadata instance with information about the fingerprint type

location

a chemfp.io.Location instance with parser location and state information

closed

True if the file is open, else False

The FPSReader.location only tracks the “lineno” variable.

__iter__()

Iterate through the (id, fp) pairs

iter_arenas(arena_size=1000)

iterate through arena_size fingerprints at a time, as subarenas

Iterate through arena_size fingerprints at a time, returned as chemfp.arena.FingerprintArena instances. The arenas are in input order and not reordered by popcount.

This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer.

If arena_size is None then this makes an iterator which returns a single arena containing all of the fingerprints.

Parameters:arena_size (positive integer, or None) – The number of fingerprints to put into each arena.
Returns:an iterator of chemfp.arena.FingerprintArena instances
save(destination, format=None)

Save the fingerprints to a given destination and format

The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.

If the output format is “fps” or “fps.gz” then destination may be a filename, a file object, or None; None writes to stdout.

If the output format is “fpb” then destination must be a filename.

Parameters:
  • destination (a filename, file object, or None) – the output destination
  • format (None, "fps", "fps.gz", or "fpb") – the output format
Returns:

None

get_fingerprint_type()

Get the fingerprint type object based on the metadata’s type field

This uses self.metadata.type to get the fingerprint type string then calls chemfp.get_fingerprint_type() to get and return a chemfp.types.FingerprintType instance.

This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.

Returns:a chemfp.types.FingerprintType
close()

Close the file

count_tanimoto_hits_fp(query_fp, threshold=0.7)

Count the fingerprints which are sufficiently similar to the query fingerprint

Return the number of fingerprints in the reader which are at least threshold similar to the query fingerprint query_fp.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

integer count

count_tanimoto_hits_arena(queries, threshold=0.7)

Count the fingerprints which are sufficiently similar to each query fingerprint

Returns a list containing a count for each query fingerprint in the queries arena. The count is the number of fingerprints in the reader which are at least threshold similar to the query fingerprint.

The order of results is the same as the order of the queries.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

list of integer counts, one for each query

count_tversky_hits_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0)

Count the fingerprints which are sufficiently similar to the query fingerprint

Return the number of fingerprints in the reader which are at least threshold similar to the query fingerprint query_fp.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

integer count

threshold_tanimoto_search_fp(query_fp, threshold=0.7)

Find the fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a SearchResult, in arbitrary order.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

threshold_tanimoto_search_arena(queries, threshold=0.7)

Find the fingerprints which are sufficiently similar to each of the query fingerprints

For each fingerprint in the queries arena, find all of the fingerprints in this arena which are at least threshold similar. The hits are returned as a SearchResults, where the hits in each SearchResult is in arbitrary order.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResults

threshold_tversky_search_fp(query_fp, threshold=0.7)

Find the fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a SearchResult, in arbitrary order.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

knearest_tanimoto_search_fp(query_fp, k=3, threshold=0.7)

Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResult, sorted from highest score to lowest.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

knearest_tanimoto_search_arena(queries, k=3, threshold=0.7)

Find the k-nearest fingerprints which are sufficiently similar to each of the query fingerprints

For each fingerprint in the queries arena, find the fingerprints in this reader which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResults, where the hits in each SearchResult are sorted by similarity score.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResults

knearest_tversky_search_fp(query_fp, k=3, threshold=0.7, alpha=1.0, beta=1.0)

Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this reader which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResult, sorted from highest score to lowest.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

FPSWriter

class chemfp.fps_io.FPSWriter

Write fingerprints in FPS format.

This is a subclass of chemfp.FingerprintWriter.

Instances have the following attributes:

An FPSWriter is its own context manager, and will close the output file on context exit.

The Location instance supports the “recno”, “output_recno”, and “lineno” properties.

write_fingerprint(id, fp)

Write a single fingerprint record with the given id and fp

Parameters:
  • id (string) – the record identifier
  • fp (bytes) – the fingerprint
write_fingerprints(id_fp_pairs)

Write a sequence of fingerprint records

Parameters:id_fp_pairs – An iterable of (id, fingerprint) pairs.
close()

Close the writer

This will set self.closed to False.

chemfp.fpb_io module

This module is part of the private API. Do not import directly.

The function chemfp.open_fingerprint_writer() returns an OrderedFPBWriter if the destination is an FPB file and reorder is True, or an InputOrderFPBWriter if reorder is False.

OrderedFPBWriter

class chemfp.fpb_io.OrderedFPBWriter

Fingerprint writer for FPB files where the input fingerprint order is preserved

This is a subclass of chemfp.FingerprintWriter.

Instances have the following public attributes:

metadata

a chemfp.Metadata instance

closed

False when the file is open, else True

Other attributes (like “alignment”, “include_hash”, “include_popc”, “max_spool_size”, and “tmpdir”) are undocumented and subject to change in the future. Let me know if they are useful.

An OrderedFPBWriter is also is own context manager, and will close the writer on context exit.

write_fingerprint

class chemfp.fpb_io.write_fingerprint

Write a single fingerprint record with the given id and fp to the destination

Parameters:
  • id (string) – the record identifier
  • fp (bytes) – the fingerprint

write_fingerprints

class chemfp.fpb_io.write_fingerprints

Write a sequence of (id, fingerprint) pairs to the destination

Parameters:id_fp_pairs – An iterable of (id, fingerprint) pairs.

close

class chemfp.fpb_io.close

Close the output writer

InputOrderFPBWriter

class chemfp.fpb_io.InputOrderFPBWriter

Fingerprint writer for FPB files which preserves the input fingerprint order

This is a subclass of chemfp.FingerprintWriter.

Instances have the following public attributes:

metadata

a chemfp.Metadata instance

closed

False when the file is open, else True

Other attributes (like “alignment”, “include_hash”, “include_popc”, “max_spool_size”, and “tmpdir”) are undocumented and subject to change in the future. Let me know if they are useful.

An InputOrderFPBWriter is also is own context manager, and will close the writer on context exit.

write_fingerprint

class chemfp.fpb_io.write_fingerprint

Write a single fingerprint record with the given id and fp to the destination

Parameters:
  • id (string) – the record identifier
  • fp (bytes) – the fingerprint

write_fingerprints

class chemfp.fpb_io.write_fingerprints

Write a sequence of (id, fingerprint) pairs to the destination

Parameters:id_fp_pairs – An iterable of (id, fingerprint) pairs.

close

class chemfp.fpb_io.close

Close the output writer

This will set self.closed to False

chemfp toolkit API

Open Babel, OEChem and RDKit have different ways to read and write molecules. The chemfp toolkit API is a common wrapper API for structure I/O. The chemfp functions work with native toolkit molecules; chemfp does not have a common molecule API. (For that, use Cinfony.)

While the API is the same across openbabel_toolkit, openbabel_toolkit, rdkit_toolkit, and the text_toolkit, there are some differences in how they work. For example, each of the toolkits has it own set of reader and writer arguments. The details are available in the documentation, and this chapter acts as a pointer to the specific toolkit documentation.

name

chemfp.toolkit.name

The string “openbabel”, “openeye”, “rdkit”, or “text”.

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

software

chemfp.toolkit.software

A string like “OpenBabel/2.4.1”, “OEChem/20170208”, “RDKit/2016.09.3” or “chemfp/3.1”.

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

is_licensed

chemfp.toolkit.is_licensed()

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Check if the toolkit is licensed.

get_formats

chemfp.toolkit.get_formats(include_unavailable=False)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Return a list of structure formats.

get_input_formats

chemfp.toolkit.get_input_formats()

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Return a list of input structure formats.

get_output_formats

chemfp.toolkit.get_output_formats()

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Return a list of output structure formats.

get_format

chemfp.toolkit.get_format(format)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get a named format.

get_input_format

chemfp.toolkit.get_input_format(format)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get a named input format.

get_output_format

chemfp.toolkit.get_output_format(format)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get a named output format.

get_input_format_from_source

chemfp.toolkit.get_input_format_from_source(source=None, format=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get an format given an input source.

get_output_format_from_destination

chemfp.toolkit.get_output_format_from_destination(destination=None, format=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get an format given an output destination.

read_molecules

chemfp.toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None")

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Read molecules from a structure file.

read_molecules_from_string

chemfp.toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Read molecules from structure data stored in a string.

read_ids_and_molecules

chemfp.toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Read ids and molecules from a structure file.

read_ids_and_molecules_from_string

chemfp.toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Read ids and molecules from structure data stored in a string.

make_id_and_molecule_parser

chemfp.toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict")

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Make a specialized function which returns the id and molecule given a structure record.

parse_molecule

chemfp.toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Parse a structure record into a molecule.

parse_id_and_molecule

chemfp.toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Parse a structure record into an id and molecule.

create_string

chemfp.toolkit.create_string(mol, format, id=None, writer_args=None, errors="strict")

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Convert a molecule into a Unicode string containg a structure record.

create_bytes

chemfp.toolkit.create_bytes(mol, format, id=None, writer_args=None, errors="strict")

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Convert a molecule into a byte string containing a structure record.

open_molecule_writer

chemfp.toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Create an output molecule writer, for writing to a file.

open_molecule_writer_to_string

chemfp.toolkit.open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Create an output molecule writer, for writing to a Unicode string.

open_molecule_writer_to_bytes

chemfp.toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Create an output molecule writer, for writing to a byte string.

copy_molecule

chemfp.toolkit.copy_molecule(mol)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Make a copy of a toolkit molecule.

add_tag

chemfp.toolkit.add_tag(mol, tag, value)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Add an SD tag to the molecule.

get_tag

chemfp.toolkit.get_tag(mol, tag)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get an SD tag for a molecule.

get_tag_pairs

chemfp.toolkit.get_tag_pairs()

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get the list of tag name and tag value pairs.

get_id

chemfp.toolkit.get_id(mol)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Get the molecule id.

set_id

chemfp.toolkit.set_id(mol, id)

[openbabel_toolkit] [openeye_toolkit] [rdkit_toolkit] [text_toolkit]

Set the molecule id.

chemfp.base_toolkit

The chemfp.base_toolkit module contains a few objects which are shared by the differn toolkit. There should be no reason for you to import the module yourself.

FormatMetadata

The metadata attribute of the toolkit readers and writers is a FormatMetadata instance. It contains information about the structure file.

Note that this is not the same as the fingerprint chemfp.Metadata instance, which contains information about the fingerprint file.

FormatMetadata

class chemfp.base_toolkit.FormatMetadata

Information about the reader or writer

The public attributes are:

filename

the source or destination filename, the string “<string>” for string-based I/O, or None if not known

record_format

the normalized record format name. All SMILES formats are “smi”, and this does not contain compression information

args

the final reader_args or writer_args, after all processing, and as used by the reader and writer

__repr__()

Return a string like ‘FormatMeta(filename=”cmpds.sdf.gz”, record_format=”sdf”, args={})’

BaseMoleculeReader

class chemfp.base_toolkit.BaseMoleculeReader

Base class for the toolkit readers

The public attributes are:

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

Readers are iterators, so iter(reader) returns itself. next(reader) returns either a single object or a pair of objects depending on reader.

Readers are also a context manager, and call self.close() during exit.

chemfp.base_toolkit.close()

Close the reader

If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the reader may have opened, and set self.closed to False.

class chemfp.base_toolkit.MoleculeReader

Read structures from a file and iterate over the toolkit molecules

The public attributes are:

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

Note: the toolkit implementation is free to reuse a molecule instead of returning a new one each time.

class chemfp.base_toolkit.IdAndMoleculeReader

Read structures from a file and iterate over the (id, toolkit molecule) pairs

The public attributes are:

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

Note: the toolkit implementation is free to reuse a molecule instead of returning a new one each time.

class chemfp.base_toolkit.RecordReader

Read and iterate over records as strings

The public attributes are:

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

class chemfp.base_toolkit.IdAndRecordReader

Read records from file and iterate over the (id, record string) pairs

The public attributes are:

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

Toolkit writers

The chemfp.open_molecule_writer() function returns a chemfp.base_toolkit.MoleculeWriter, and chemfp.open_molecule_writer_to_string() returns a chemfp.base_toolkit.MoleculeStringWriter. The two classes implement the chemfp.base_toolkit.BaseMoleculeWriter API, and MoleculeWriterToString also implements getvalue().

BaseMoleculeWriter

class chemfp.base_toolkit.BaseMoleculeWriter

The base molecule writer API, implemented by MoleculeWriter and MoleculeStringWriter

The public attributes are:

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

The writer is a context manager, which calls self.close() when the manager exits.

write_molecule(mol)

Write a toolkit molecule

Parameters:mol (a toolkit molecule) – the molecule to write
write_molecules(mols)

Write a sequence of molecules

Parameters:mols (a toolkit molecule iterator) – the molecules to write
write_id_and_molecule(id, mol)

Write an identifier and toolkit molecule

If id is None then the output uses the molecule’s own id/title. Specifying the id may modify the molecule’s id/title, depending on the format and toolkit.

Parameters:
  • id (string, or None) – the identifier to use for the molecule
  • mol (a toolkit molecule) – the molecule to write
write_ids_and_molecules(ids_and_mols)

Write a sequence of (id, molecule) pairs

This function works well with chemfp.toolkit.read_ids_and_molecules(), for example, to convert an SD file to SMILES file, and use an alternate id_tag to specify an alternative identifier.

Parameters:mols (a (id string, toolkit molecule) iterator) – the molecules to write
close()

Close the writer

If the reader wasn’t previously closed then close it. This will set the location properties to their final values, close any files that the writer may have opened, and set self.closed to False.

class chemfp.base_toolkit.MoleculeWriter

A BaseMoleculeWriter which writes molecules to a file.

The public attributetes are:

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

The writer is a context manager, which calls self.close() when the manager exits.

class chemfp.base_toolkit.MoleculeStringWriter

A BaseMoleculeWriter which writes molecules to a string.

This class implements the chemfp.base_toolkit.BaseMoleculeWriter API.

metadata

a chemfp.base_toolkit.FormatMetadata instance

location

a chemfp.io.Location instance

closed

False if the reader is open, otherwise True

The writer is a context manager, which calls self.close() when the manager exits.

getvalue()

Get the string containing all of the written record.

This function can also be called after the writer is closed.

Returns:a string

Format

Format

class chemfp.base_toolkit.Format

Information about a toolkit format.

Use chemfp.toolkit.get_format() and related functions to return a Format instance.

The public properties are:

__repr__()

Return a string like ‘Format(“openeye/sdf.gz”)’

prefix

Read-only attribute.

Return the prefix to turn an unqualified parameter into a fully qualified parameter

Returns:a string like “rdkit.smi” or “openbabel.sdf”
is_input_format

Read-only attribute.

Return True if this toolkit can read molecules in this format

is_output_format

Read-only attribute.

Return True if this toolkit can write molecules in this format

is_available

Read-only attribute.

Return True if this version of the toolkit understands this format

For example, if your version of RDKit does not support InChI then this would return False for the “inchi” and “inchikey” formats.

supports_io

Read-only attribute.

Return True if this format support reading or writing records

This will return False for formats like “smistring” and “inchikeystring” because those are are not record-based formats.

Note: I don’t like this name. I may change it to is_record_format. Let me know if you have ideas, or if changing the name will be a problem.

get_reader_args_from_text_settings(reader_settings)

Process the reader_settings and return the reader_args for this format.

This function exists to help convert string settings, eg, from the command-line or a configuration, into usable reader_args.

Setting names may be fully-qualified names like “rdkit.sdf.sanitize”, partially qualified names like “rdkit.*.sanitize” or “openeye.smi.delimiter”, or unqualified names like “delimiter”. The qualifiers act as a namespace so the settings can be specified without needing to know the actual toolkit or format.

The function turns the format-appropriate qualified names into unqualified ones and converts the string values into usable Python objects. For example:

>>> from chemfp import rdkit_toolkit  as T
>>> fmt = T.get_format("smi")
>>> fmt.get_reader_args_from_text_settings({"rdkit.*.sanitize": "true", "delimiter": "to-eol"})
{'delimiter': 'to-eol', 'sanitize': True}
Parameters:reader_settings (a dictionary with string keys and values) – the reader settings
Returns:a dictionary of unqualified argument names as keys and processed Python values as values
get_writer_args_from_text_settings(writer_settings)

Process writer_settings and return the writer_args for this format.

This function exists to help convert string settings, eg, from the command-line or a configuration, into usable writer_args.

Setting names may be fully-qualified names like “rdkit.sdf.kekulize”, partially qualified names like “rdkit.*.delimiter” or “openeye.smi.delimiter”, or unqualified names like “delimiter”. The qualifiers act as a namespace so the settings can be specified without needing to know the actual toolkit or format.

The function turns the format-appropriate qualified names into unqualified ones and converts the string values into usable Python objects. For example:

>>> from chemfp import rdkit_toolkit  as T
>>> fmt = T.get_format("smi")
>>> fmt.get_writer_args_from_text_settings({"rdkit.*.kekuleSmiles": "true", "canonical": "false"})
{'kekuleSmiles': True, 'canonical': False}
Parameters:writer_settings (a dictionary with string keys and values) – the writer settings
Returns:a dictionary of unqualified argument names as keys and processed Python values as values
get_default_reader_args()

Return a dictionary of the default reader arguments

The keys are unqualified (ie, without dots).

>>> from chemfp import openbabel_toolkit as T
>>> fmt = T.get_format("smi")
>>> fmt.get_default_reader_args()
{'has_header': False, 'delimiter': None, 'options': None}
Returns:a dictionary of string keys and Python objects for values
get_default_writer_args()

Return a dictionary of the default writer arguments

The keys are unqualified (ie, without dots).

>>> from chemfp import openbabel_toolkit as T
>>> fmt = T.get_format("smi")
>>> fmt.get_default_writer_args()
{'explicit_hydrogens': False, 'isomeric': True, 'delimiter': None,
'options': None, 'canonicalization': 'default'}
Returns:a dictionary of string keys and Python objects for values
get_unqualified_reader_args(reader_args)

Convert possibly qualified reader args into unqualified reader args for this format

The reader_args dictionary can be confusing because of the priority rules in how to resolve qualifiers, and because it can include irrelevant parameters, which are ignored.

The get_unqualified_reader_args function applies the qualifier resolution algorithm and removes irrelevant parameters to return a dictionary containing the equivalent unqualified reader args dictionary for this format.

>>> from chemfp import rdkit_toolkit as T
>> fmt = T.get_format("smi")
>>> fmt.get_unqualified_reader_args({"rdkit.*.delimiter": "tab", "smi.sanitize": False, "X": "Y"})
{'delimiter': 'tab', 'has_header': False, 'sanitize': False}
>>> fmt = T.get_format("can")
>>> fmt.get_unqualified_reader_args({"rdkit.*.delimiter": "tab", "smi.sanitize": False, "X": "Y"})
{'delimiter': 'tab', 'has_header': False, 'sanitize': True}
Parameters reader_args:
 reader arguments, which can contain qualified and unqualified arguments
Returns:a dictionary of reader arguments, containing only unqualified arguments appropriate for this format.
get_unqualified_writer_args(writer_args)

Convert possibly qualified writer args into unqualified writer args for this format

The writer_args dictionary can be confusing because of the priority rules in how to resolve qualifiers, and because it can include irrelevant parameters, which are ignored.

The get_unqualified_writer_args function applies the qualifier resolution algorithm and removes irrelevant parameters to return a dictionary containing the equivalent unqualified writer args dictionary for this format.

>>> from chemfp import rdkit_toolkit as T
>>> fmt = T.get_format("smi")
>>> fmt.get_unqualified_writer_args({"rdkit.*.delimiter": "tab", "smi.kekuleSmiles": True, "X": "Y"})
{'isomericSmiles': True, 'delimiter': 'tab', 'kekuleSmiles': True, 'allBondsExplicit': False, 'canonical': True}
>>> fmt = T.get_format("can")
>>> fmt.get_unqualified_writer_args({"rdkit.*.delimiter": "tab", "smi.kekuleSmiles": True, "X": "Y"})
{'isomericSmiles': False, 'delimiter': 'tab', 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
Parameters writer_args:
 writer arguments, which can contain qualified and unqualified arguments
Returns:a dictionary of writer arguments, containing only unqualified arguments appropriate for this format.

chemfp.openbabel_toolkit module

The chemfp toolkit layer for Open Babel.

name

chemfp.openbabel_toolkit.name

The string “openbabel”.

software

chemfp.openbabel_toolkit.software

A string like “OpenBabel/2.4.1”, where the second part of the string comes from OBReleaseVersion.

is_licensed (openbabel_toolkit)

chemfp.openbabel_toolkit.is_licensed()

Return True - Open Babel is always licensed

Returns:True

get_formats (openbabel_toolkit)

chemfp.openbabel_toolkit.get_formats(include_unavailable=False)

Get the list of structure formats that Open Babel supports

If include_unavailable is True then also include Open Babel formats which aren’t available to this specific version of Open Babel.

Parameters:include_unavailable (True or False) – include unavailable formats?
Returns:a list of chemfp.base_toolkit.Format objects

get_input_formats (openbabel_toolkit)

chemfp.openbabel_toolkit.get_input_formats()

Get the list of supported Open Babel input formats

Returns:a list of chemfp.base_toolkit.Format objects

get_output_formats (openbabel_toolkit)

chemfp.openbabel_toolkit.get_output_formats()

Get the list of supported Open Babel output formats

Returns:a list of chemfp.base_toolkit.Format objects

get_format (openbabel_toolkit)

chemfp.openbabel_toolkit.get_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError if Open Babel does not implement the format format_name or that format is not available.

Parameters:format_name (a string) – the format name
Returns:a chemfp.base_toolkit.Format object

get_input_format (openbabel_toolkit)

chemfp.openbabel_toolkit.get_input_format(format_name)

Get the named input format, or raise a ValueError

This will raise a ValueError if Open Babel does not implement the format format_name or that format is not an input format.

Parameters:format_name (a string) – the format name
Returns:a chemfp.base_toolkit.Format object

get_output_format (openbabel_toolkit)

chemfp.openbabel_toolkit.get_output_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError if Open Babel does not implement the format format_name or that format is not an output format.

Parameters:format_name (a string) – the format name
Returns:a chemfp.base_toolkit.Format object

get_input_format_from_source (openbabel_toolkit)

chemfp.openbabel_toolkit.get_input_format_from_source(source=None, format=None)

Get the most appropriate format given the available source and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • source (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
  • format (a Format(-like) object, string, or None) – format information, if known.
Returns:

a chemfp.base_toolkit.Format object

get_output_format_from_destination (openbabel_toolkit)

chemfp.openbabel_toolkit.get_output_format_from_destination(destination=None, format=None)

Get the most appropriate format given the available destination and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • destination (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
  • format (a Format(-like) object, string, or None) – format information, if known.
Returns:

a chemfp.base_toolkit.Format object

read_molecules (openbabel_toolkit)

chemfp.openbabel_toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads OBMol molecules from a structure file

Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)

Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.

The reader_args dictionary parameters depend on the format. Every Open Babel format supports an “options” entry, which is passed to SetOptions(). See that documentation for details. Some formats support additional parameters:

  • SMILES and InChI
    • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
    • has_header - True or False
  • SDF
    • implementation - if “openbabel” or None, use the Open Babel record parser; if “chemfp”, use chemfp’s own record parser, which has better location tracking

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

See chemfp.openbabel_toolkit.read_ids_and_molecules() if you want (id, OBMol) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.MoleculeReader iterating OBMol molecules

read_molecules_from_string (openbabel_toolkit)

chemfp.openbabel_toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads OBMol molecules from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.openbabel_toolkit.read_molecules() for details about the other parameters. See chemfp.openbabel_toolkit.read_ids_and_molecules_from_string() if you want to read (id, OBMol) pairs instead of just molecules.

Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.MoleculeReader iterating OBMol molecules

read_ids_and_molecules (openbabel_toolkit)

chemfp.openbabel_toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads (id, OBMol molecule) pairs from a structure file

See chemfp.openbabel_toolkit.read_molecules() for full parameter details. The major difference is that this returns an iterator of (id, OBMol) pairs instead of just the molecules.

Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, OBMol) pairs

read_ids_and_molecules_from_string (openbabel_toolkit)

chemfp.openbabel_toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads (id, OBMol) pairs from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.openbabel_toolkit.read_molecules() for details about the other parameters. See chemfp.openbabel_toolkit.read_molecules_from_string() if you just want to read the OBMol molecules instead of (id, OBMol) pairs.

Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, OBMol) pairs

make_id_and_molecule_parser (openbabel_toolkit)

chemfp.openbabel_toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict")

Create a specialized function which takes a record and returns an (id, OBMol) pair

The returned function is optimized for reading many records from individual strings because it only does parameter validation once. The function will reuse the OBMol for successive calls, so make a copy if you want to keep it around. However, I haven’t really noticed much of a performance difference between this and chemfp.openbabel_toolkit.parse_id_and_molecule() so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)

See chemfp.openbabel_toolkit.read_molecules() for details about the other parameters.

Parameters:
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a function of the form parser(record string) -> (id, OBMol)

parse_molecule (openbabel_toolkit)

chemfp.openbabel_toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from the content string and return an OBMol molecule.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.openbabel_toolkit.read_molecules() for details about the other parameters. See chemfp.openbabel_toolkit.parse_id_and_molecule() if you want the (id, OBMol) pair instead of just the molecule.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

an OBMol molecule

parse_id_and_molecule (openbabel_toolkit)

chemfp.openbabel_toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from content and return the (id, OBMol) pair.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.openbabel_toolkit.read_molecules() for details about the other parameters.

See chemfp.openbabel_toolkit.read_molecules() for details about the other parameters. See chemfp.openbabel_toolkit.parse_molecule() if just want the OBMol molecule and not the the (id, OBMol) pair.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

an (id, OBMol molecule) pair

create_string (openbabel_toolkit)

chemfp.openbabel_toolkit.create_string(mol, format, id=None, writer_args=None, errors="strict")

Convert an OBMol into a structure record in the given format as a Unicode string

If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.

Parameters:
  • mol (an Open Babel molecule) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a Unicode string

create_bytes (openbabel_toolkit)

chemfp.openbabel_toolkit.create_bytes(mol, format, id=None, writer_args=None, errors="strict")

Convert an OBMol into a structure record in the given format as a byte string

If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.

Parameters:
  • mol (an Open Babel molecule) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a byte string

open_molecule_writer (openbabel_toolkit)

chemfp.openbabel_toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return a MoleculeWriter which can write Open Babel molecules to a destination.

A chemfp.base_toolkit.MoleculeWriter has the methods write_molecule, write_molecules, and write_ids_and_molecules, which are ways to write an OBMol molecule, an OBMol molecule iterator, or an (id, OBMol molecule) pair iterator to a file.

Molecules are written to destination. The output format can be a string like “sdf.gz” or “smi”, a chemfp.base_toolkit.Format, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.

The writer_args dictionary parameters depend on the format. Every format supports an options entry, which is passed to Open Babel’s SetOptions(). See the Open Babel documentation for details. Some formats supports additional parameters:

  • SMILES
    • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
    • isomeric - True to write isomeric SMILES, False or default is non-isomeric
    • canonicalization - True, “default”, or None uses Open Babel’s own canonicalization algorithm; False or “none” to use no canonicalization; “universal” generates a universal SMILES; “anticanonical” generates a SMILES with randomly assigned atom classes; “inchified” uses InChI-fied SMILES
  • InChI and InChIKey
    • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
    • include_id - True or default to include the id as the second column; False has no id column
  • SDF
    • always_v3000 - True to always write V3000 files; False or default to write V3000 files only if needed.
    • include_atom_class - True to include atom class; False or default does not
    • include_hcount - True to include hcount; False or default does not

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

Parameters:
  • destination (a filename, file object, or None to write to stdout) – the structure destination
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeWriter expecting Open Babel molecules

open_molecule_writer_to_string (openbabel_toolkit)

chemfp.openbabel_toolkit.open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write Open Babel molecule records to a string.

See chemfp.openbabel_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a Unicode string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting Open Babel molecules

open_molecule_writer_to_bytes (openbabel_toolkit)

chemfp.openbabel_toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write Open Babel molecule records to a byte string

See chemfp.openbabel_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a byte string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting Open Babel molecules

copy_molecule (openbabel_toolkit)

chemfp.openbabel_toolkit.copy_molecule(mol)

Return a new OBMol molecule which is a copy of the given Open Babel molecule

Parameters:mol (an Open Babel molecule) – the molecule to copy
Returns:a new OBMol instance

add_tag (openbabel_toolkit)

chemfp.openbabel_toolkit.add_tag(mol, tag, value)

Add an SD tag value to the Open Babel molecule

Raises a KeyError if the tag is a special internal Open Babel name.

Parameters:
  • mol (an Open Babel molecule) – the molecule
  • tag (string) – the SD tag name
  • value (string) – the text for the tag
Returns:

None

get_tag (openbabel_toolkit)

chemfp.openbabel_toolkit.get_tag(mol, tag)

Get the named SD tag value, or None if it doesn’t exist

Parameters:
  • mol (an Open Babel molecule) – the molecule
  • tag (string) – the SD tag name
Returns:

a string, or None

get_tag_pairs (openbabel_toolkit)

chemfp.openbabel_toolkit.get_tag_pairs(mol)

Get a list of all SD tag (name, value) pairs for the molecule

Parameters:mol (an Open Babel molecule) – the molecule
Returns:a list of (string name, string value) pairs

get_id (openbabel_toolkit)

chemfp.openbabel_toolkit.get_id(mol)

Get the molecule’s id using Open Babel’s GetTitle()

Parameters:mol (an Open Babel molecule) – the molecule
Returns:a string

set_id (openbabel_toolkit)

chemfp.openbabel_toolkit.set_id(mol, id)

Set the molecule’s id using Open Babel’s SetTitle()

Parameters:
  • mol (an Open Babel molecule) – the molecule
  • id (string) – the new id
Returns:

None

chemfp.openeye_toolkit module

The chemfp toolkit layer for OpenEye.

name

chemfp.openeye_toolkit.name

The string “openeye”.

software

chemfp.openeye_toolkit.software

A string like “OEChem/20170208”, where the second part of the string comes from OEChemGetVersion().

is_licensed (openeye_toolkit)

chemfp.openeye_toolkit.is_licensed()

Return True if the OEChem toolkit license is valid, otherwise False.

This does not check if the OEGraphSim license is valid. I haven’t yet figured out how I want to handle that distinction. In the meanwhile you’ll need to use the OEChem API yourself.

Returns:True or False

get_formats (openeye_toolkit)

chemfp.openeye_toolkit.get_formats(include_unavailable=False)

Get the list of structure formats that OEChem supports

If include_unavailable is True then also include OEChem formats which aren’t available to this specific version of OEChem.

Parameters:include_unavailable (True or False) – include unavailable formats?
Returns:a list of chemfp.base_toolkit.Format objects

get_input_formats (openeye_toolkit)

chemfp.openeye_toolkit.get_input_formats()

Get the list of supported OEChem input formats

Returns:a list of chemfp.base_toolkit.Format objects

get_output_formats (openeye_toolkit)

chemfp.openeye_toolkit.get_output_formats()

Get the list of supported OEChem output formats

Returns:a list of chemfp.base_toolkit.Format objects

get_format (openeye_toolkit)

chemfp.openeye_toolkit.get_format(format)

Get the named format, or raise a ValueError

This will raise a ValueError if OEChem does not implement the format format_name or that format is not available.

Parameters:format_name (a string) – the format name
Returns:a chemfp.base_toolkit.Format object

get_input_format (openeye_toolkit)

chemfp.openeye_toolkit.get_input_format(format)

Get the named input format, or raise a ValueError

This will raise a ValueError if OEChem does not implement the format format_name or that format is not an input format.

Parameters:format_name (a string) – the format name
Returns:a chemfp.base_toolkit.Format object

get_output_format (openeye_toolkit)

chemfp.openeye_toolkit.get_output_format(format)

Get the named format, or raise a ValueError

This will raise a ValueError if OEChem does not implement the format format_name or that format is not an output format.

Parameters:format_name (a string) – the format name
Returns:a chemfp.base_toolkit.Format object

get_input_format_from_source (openeye_toolkit)

chemfp.openeye_toolkit.get_input_format_from_source(source=None, format=None)

Get the most appropriate format given the available source and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • source (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
  • format (a Format(-like) object, string, or None) – format information, if known.
Returns:

a chemfp.base_toolkit.Format object

get_output_format_from_destination (openeye_toolkit)

chemfp.openeye_toolkit.get_output_format_from_destination(destination=None, format=None)

Get the most appropriate format given the available destination and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • destination (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
  • format (a Format(-like) object, string, or None) – format information, if known.
Returns:

a chemfp.base_toolkit.Format object

read_molecules (openeye_toolkit)

chemfp.openeye_toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads OEGraphMol molecules from a structure file

Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)

Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.

The reader_args dictionary parameters depend on the format. Every OEChem format supports:

  • aromaticity - one of “default”, “openeye”, “daylight”, “tripos”, “mdl”, “mmff”, or None
  • flavor - a number, string-encoded number, or flavor string

A “flavor string” is a “|” or ”,” separated list of format-specific flavor terms. It can be a simple as “Default”, or a more complex string like “Default|-ENDM|DELPHI” which for the PDB reader starts with the default settings, removes the ENDM flavor, and adds the CHARGE and RADIUS flavors.

The supported input flavor terms for each format are:

  • SMILES - Canon, Strict, Default
  • sdf - Default
  • skc - Default
  • mol2, mol2h - M2H, Default
  • mmod - FormalCrg, Default
  • pdb - ALL, ALTLOC, BondOrder, CHARGE, Connect, DATA, DELPHI, END, ENDM, FORMALCHARGE, FormalCrg, ImplicitH, RADIUS, Rings, SecStruct, TER, TerMask, Default
  • xyz - BondOrder, Connect, FormalCrg, ImplicitH, Rings, Default
  • cdx - SuperAtoms, Default
  • oeb - Default

You can also pass in a numeric value like 123 or a numeric string like “0”.

In addition, the SMILES record readers have limited support for the “delimiter” reader_arg:

  • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None

Note: the first whitespace after the SMILES string will always be treated as a delimiter.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

See chemfp.openeye_toolkit.read_ids_and_molecules() if you want (id, OEGraphMol) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.MoleculeReader iterating OEGraphMol molecules

read_molecules_from_string (openeye_toolkit)

chemfp.openeye_toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads molecules from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.openeye_toolkit.read_molecules() for details about the other parameters. See chemfp.openeye_toolkit.read_ids_and_molecules_from_string() if you want to read (id, OEGraphMol) pairs instead of just molecules.

Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.MoleculeReader iterating OEGraphMol molecules

read_ids_and_molecules (openeye_toolkit)

chemfp.openeye_toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads (id, OEGraphMol molecule) pairs from a structure file

See chemfp.openeye_toolkit.read_molecules() for full parameter details. The major difference is that this returns an iterator of (id, OEGraphMol) pairs instead of just the molecules.

Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, OEGraphMol) pairs

read_ids_and_molecules_from_string (openeye_toolkit)

chemfp.openeye_toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads (id, OEGraphMol) pairs from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.openeye_toolkit.read_molecules() for details about the other parameters. See chemfp.openeye_toolkit.read_molecules_from_string() if you just want to read the OEGraphMol molecules instead of (id, OEGraphMol) pairs.

Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, OEGraphMol) pairs

make_id_and_molecule_parser (openeye_toolkit)

chemfp.openeye_toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict")

Create a specialized function which takes a record and returns an (id, OEGraphMol) pair

The returned function is optimized for reading many records from individual strings because it only does parameter validation once. The function will reuse the OEGraphMol for successive calls, so make a copy if you want to keep it around. However, I haven’t really noticed much of a performance difference between this and chemfp.openeye_toolkit.parse_id_and_molecule() so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)

See chemfp.openeye_toolkit.read_molecules() for details about the other parameters.

Parameters:
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a function of the form parser(record string) -> (id, OEGraphMol)

parse_molecule (openeye_toolkit)

chemfp.openeye_toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from the content string and return an OEGraphMol molecule.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.openeye_toolkit.read_molecules() for details about the other parameters. See chemfp.openeye_toolkit.parse_id_and_molecule() if you want the (id, OEGraphMol) pair instead of just the molecule.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

an OEGraphMol molecule

parse_id_and_molecule (openeye_toolkit)

chemfp.openeye_toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from content and return the (id, OEGraphMol) pair.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.openeye_toolkit.read_molecules() for details about the other parameters.

See chemfp.openeye_toolkit.read_molecules() for details about the other parameters. See chemfp.openeye_toolkit.parse_molecule() if just want the OEGraphMol molecule and not the the (id, OEGraphMol) pair.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

an (id, OEGraphMol molecule) pair

create_string (openeye_toolkit)

chemfp.openeye_toolkit.create_string(mol, format, id=None, writer_args=None, errors="strict")

Convert an OEChem molecule into a structure record in the given format as a Unicode string

If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.

Parameters:
  • mol (an OEChem molecule) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a string

create_bytes (openeye_toolkit)

chemfp.openeye_toolkit.create_bytes(mol, format, id=None, writer_args=None, errors="strict")

Convert an OEChem molecule into a structure record in the given format as a byte string

If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.

Parameters:
  • mol (an OEChem molecule) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a string

open_molecule_writer (openeye_toolkit)

chemfp.openeye_toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return a MoleculeWriter which can write OEChem molecules to a destination.

A chemfp.base_toolkit.MoleculeWriter has the methods write_molecule, write_molecules, and write_ids_and_molecules, which are ways to write an OEChem molecule, an OEChem molecule iterator, or an (id, OEChem molecule) pair iterator to a file.

Molecules are written to destination. The output format can be a string like “sdf.gz” or “smi”, a chemfp.base_toolkit.Format, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.

The writer_args dictionary parameters depend on the format. Every OEChem format supports:

  • aromaticity - one of “default”, “openeye”, “daylight”, “tripos”, “mdl”, “mmff”, or None
  • flavor - a number, string-encoded number, or flavor string

A “flavor string” is a “|” or ”,” separated list of format-specific flavor terms. It can be as simple as “Default”, or a more complex string like DEFAULT|-AtomStereo|-BondStero|Canonical to generate a canonical SMILES string without stereo information.

The supported output flavor terms for each format are:

  • SMILES - AtomMaps, AtomStereo, BondStereo, Canonical, ExtBonds, Hydrogens, ImpHCount, Isotopes, Kekule, RGroups, SuperAtoms
  • sdf - CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity, Default
  • mol2, mol2h - AtomNames, AtomTypeNames, BondTypeNames, Hydrogens, OrderAtoms, Substructure, Default
  • sln - Default
  • pdb - BONDS, BOTH, CHARGE, CurrentResidues, DELPHI, ELEMENT, FORMALCHARGE, FormalCrg, HETBONDS, NoResidues, OEResidues, ORDERS, OrderAtoms, RADIUS, TER, Default
  • xyz - Charges, Symbols, Default
  • cdx - Default
  • mopac - CHARGES, XYZ, Default
  • mf - Title, Default
  • oeb - Default
  • inchi, inchikey - Chiral, FixedHLayer, Hydrogens, ReconnectedMetals, Stereo, RelativeStereo, RacemicStereo, Default

You can also pass in a numeric value like 123 or a numeric string like “0”.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

Parameters:
  • destination (a filename, file object, or None to write to stdout) – the structure destination
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeWriter expecting OEChem molecules

open_molecule_writer_to_string (openeye_toolkit)

chemfp.openeye_toolkit.open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write OEChem molecule records to a Unicode string.

See chemfp.openeye_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output string as a Unicode string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting OEChem molecules

open_molecule_writer_to_bytes (openeye_toolkit)

chemfp.openeye_toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write OEChem molecule records to a byte string.

See chemfp.openeye_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output string as a byte string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting OEChem molecules

copy_molecule (openeye_toolkit)

chemfp.openeye_toolkit.copy_molecule(mol)

Return a new OEGraphMol which is a copy of the given OEChem molecule

Parameters:mol (an Open Babel molecule) – the molecule to copy
Returns:a new OBMol instance

add_tag (openeye_toolkit)

chemfp.openeye_toolkit.add_tag(mol, tag, value)

Add an SD tag value to the OEChem molecule

Parameters:
  • mol (an OEChem molecule) – the molecule
  • tag (string) – the SD tag name
  • value (string) – the text for the tag
Returns:

None

get_tag (openeye_toolkit)

chemfp.openeye_toolkit.get_tag(mol, tag)

Get the named SD tag value, or None if it doesn’t exist

Parameters:
  • mol (an OEChem molecule) – the molecule
  • tag (string) – the SD tag name
Returns:

a string, or None

get_tag_pairs (openeye_toolkit)

chemfp.openeye_toolkit.get_tag_pairs(mol)

Get a list of all SD tag (name, value) pairs for the molecule

Parameters:mol (an OEChem molecule) – the molecule
Returns:a list of (string name, string value) pairs

get_id (openeye_toolkit)

chemfp.openeye_toolkit.get_id(mol)

Get the molecule’s id using OEChem’s GetTitle()

Parameters:mol (an OEChem molecule) – the molecule
Returns:a string

set_id (openeye_toolkit)

chemfp.openeye_toolkit.set_id(mol, id)

Set the molecule’s id using OEChem’s SetTitle()

Parameters:
  • mol (an OEChem molecule) – the molecule
  • id (string) – the new id
Returns:

None

chemfp.rdkit_toolkit module

The chemfp toolkit layer for RDKit.

name

chemfp.rdkit_toolkit.name

The string “rdkit”.

software

chemfp.rdkit_toolkit.software

A string like “RDKit/2016.09.3”, where the second part of the string comes from rdkit.rdBase.rdkitVersion.

is_licensed (rdkit_toolkit)

chemfp.rdkit_toolkit.is_licensed()

Return True - RDKit is always licensed

Returns:True

get_formats (rdkit_toolkit)

chemfp.rdkit_toolkit.get_formats(include_unavailable=False)

Get the list of structure formats that RDKit supports

If include_unavailable is True then also include RDKit formats which aren’t available to this specific version of RDKit, such as the InChI formats if your RDKit installation wasn’t compiled with InChI support.

Parameters:include_unavailable (True or False) – include unavailable formats?
Returns:a list of Format objects

get_input_formats (rdkit_toolkit)

chemfp.rdkit_toolkit.get_input_formats()

Get the list of supported RDKit input formats

Returns:a list of chemfp.base_toolkit.Format objects

get_output_formats (rdkit_toolkit)

chemfp.rdkit_toolkit.get_output_formats()

Get the list of supported RDKit output formats

Returns:a list of chemfp.base_toolkit.Format objects

get_format (rdkit_toolkit)

chemfp.rdkit_toolkit.get_format(format)

Get the named format, or raise a ValueError

This will raise a ValueError if RDKit does not implement the format format_name or that format is not available.

Parameters:format_name (a string) – the format name
Returns:a list of chemfp.base_toolkit.Format objects

get_input_format (rdkit_toolkit)

chemfp.rdkit_toolkit.get_input_format(format)

Get the named input format, or raise a ValueError

This will raise a ValueError if RDKit does not implement the format format_name or that format is not an input format.

Parameters:format_name (a string) – the format name
Returns:a list of chemfp.base_toolkit.Format objects

get_output_format (rdkit_toolkit)

chemfp.rdkit_toolkit.get_output_format(format)

Get the named format, or raise a ValueError

This will raise a ValueError if RDKit does not implement the format format_name or that format is not an output format.

Parameters:format_name (a string) – the format name
Returns:a list of chemfp.base_toolkit.Format objects

get_input_format_from_source (rdkit_toolkit)

chemfp.rdkit_toolkit.get_input_format_from_source(source=None, format=None)

Get the most appropriate format given the available source and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • source (a filename (as a string), a file object, or None to read from stdin) – the structure data source.
  • format (a Format(-like) object, string, or None) – format information, if known.
Returns:

a chemfp.base_toolkit.Format object

get_output_format_from_destination (rdkit_toolkit)

chemfp.rdkit_toolkit.get_output_format_from_destination(destination=None, format=None)

Get the most appropriate format given the available destination and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • destination (a filename (as a string), a file object, or None to read from stdin) – The structure data source.
  • format (a Format(-like) object, string, or None) – format information, if known.
Returns:

a chemfp.base_toolkit.Format object

read_molecules (rdkit_toolkit)

chemfp.rdkit_toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads RDKit molecules from a structure file

Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)

Note: the reader returns a new RDKit molecule each time.

The reader_args dictionary parameters depend on the format. These include:

  • SMILES
    • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
    • has_header - True or False
    • sanitize - True or default sanitizes; False for unsanitized processing
  • InChI
    • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
    • sanitize - True or default sanitizes; False for unsanitized processing
    • removeHs - True or default removes explicit hydrogens; False leaves them in the structure
    • logLevel - an integer log level
    • treatWarningAsError - True raises an exception on error; False or default keeps processing
  • SDF
    • sanitize - True or default sanitizes; False for unsanitized processing
    • removeHs - True or default removes explicit hydrogens; False leaves them in the structure
    • strictParsing - True or default for strict parsing; False for lenient parsing

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

See chemfp.rdkit_toolkit.read_ids_and_molecules() if you want (id, molecule) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.MoleculeReader iterating RDKit molecules

read_molecules_from_string (rdkit_toolkit)

chemfp.rdkit_toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads RDKit molecules from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.read_ids_and_molecules_from_string() if you want to read (id, RDKit) pairs instead of just molecules.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.MoleculeReader iterating RDKit molecules

read_ids_and_molecules (rdkit_toolkit)

chemfp.rdkit_toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads (id, RDKit molecule) pairs from a structure file

See chemfp.rdkit_toolkit.read_molecules() for full parameter details. The major difference is that this returns an iterator of (id, RDKit molecule) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, RDKit molecule) pairs

read_ids_and_molecules_from_string (rdkit_toolkit)

chemfp.rdkit_toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads (id, RDKit molecule) pairs from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.read_molecules_from_string() if you just want to read the RDKit molecules instead of (id, molecule) pairs.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, RDKit molecule) pairs

make_id_and_molecule_parser (rdkit_toolkit)

chemfp.rdkit_toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict")

Create a specialized function which takes a record and returns an (id, RDKit molecule) pair

The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven’t really noticed much of a performance difference between this and chemfp.rdkit_toolkit.parse_id_and_molecule() so you can probably so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)

See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters.

Parameters:
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a function of the form parser(record string) -> (id, RDKit molecule)

parse_molecule (rdkit_toolkit)

chemfp.rdkit_toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from the content string and return an RDKit molecule.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.parse_id_and_molecule() if you want the (id, RDKit molecule) pair instead of just the molecule.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

an RDKit molecule

parse_id_and_molecule (rdkit_toolkit)

chemfp.rdkit_toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from content and return the (id, RDKit molecule) pair.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters.

See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.parse_molecule() if just want the RDKit molecule and not the the (id, RDKit molecule) pair.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

an (id, RDKit molecule) pair

create_string (rdkit_toolkit)

chemfp.rdkit_toolkit.create_string(mol, format, id=None, writer_args=None, errors="strict")

Convert an RDKit molecule into a structure record in the given format as a Unicode string

If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.

Parameters:
  • mol (an RDKit molecule) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a Unicode string

create_bytes (rdkit_toolkit)

chemfp.rdkit_toolkit.create_bytes(mol, format, id=None, writer_args=None, errors="strict")

Convert an RDKit molecule into a structure record in the given format as a byte string

If id is not None then use it instead of the molecule’s own title. Warning: this may briefly modify the molecule, so may not be thread-safe.

Parameters:
  • mol (an RDKit molecule) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a byte string

open_molecule_writer (rdkit_toolkit)

chemfp.rdkit_toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return a MoleculeWriter which can write RDKit molecules to a destination.

A chemfp.base_toolkit.MoleculeWriter has the methods write_molecule, write_molecules, and write_ids_and_molecules, which are ways to write an RDKit molecule, an RDKit molecule iterator, or an (id, RDKit molecule) pair iterator to a file.

Molecules are written to destination. The output format can be a string like “sdf.gz” or “smi”, a chemfp.base_toolkit.Format, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.

The writer_args dictionary parameters depend on the format. These include:

  • SMILES
    • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
    • isomericSmiles - True to generate isomeric SMILES
    • kekuleSmiles - True to generate SMILES in Kekule form
    • canonical - True to generate a canonical SMILES
    • allBondsExplicit - True to write explict ‘-‘ and ‘:’ bonds, even if they can be inferred; default is False

InChI and InChIKey

  • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
  • include_id - True or default to include the id as the second column; False has no id column
  • options - an options string passed to the underlying InChI library
  • logLevel - an integer log level
  • treatWarningAsError - True raises an exception on error; False or default keeps processing

SDF

  • includeStereo - True include stereo information; False or default does not
  • kekulize - True or default creates the connection table with bonds in Kekeule form

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

Parameters:
  • destination (a filename, file object, or None to write to stdout) – the structure destination
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeWriter expecting RDKit molecules

open_molecule_writer_to_string (rdkit_toolkit)

chemfp.rdkit_toolkit.open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write molecule records in the given format to a string.

See chemfp.rdkit_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a Unicode string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting RDKit molecules

open_molecule_writer_to_bytes (rdkit_toolkit)

chemfp.rdkit_toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write molecule records in the given format to a text string.

See chemfp.rdkit_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a byte string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting RDKit molecules

copy_molecule (rdkit_toolkit)

chemfp.rdkit_toolkit.copy_molecule(mol)

Return a new RDKit molecule which is a copy of the given molecule

Parameters:mol (an RDKit molecule) – the molecule to copy
Returns:a new RDKit Mol instance

add_tag (rdkit_toolkit)

chemfp.rdkit_toolkit.add_tag(mol, tag, value)

Add an SD tag value to the RDKit molecule

Parameters:
  • mol (an RDKit molecule) – the molecule
  • tag (string) – the SD tag name
  • value (string) – the text for the tag
Returns:

None

get_tag (rdkit_toolkit)

chemfp.rdkit_toolkit.get_tag(mol, tag)

Get the named SD tag value, or None if it doesn’t exist

Parameters:
  • mol (an RDKit molecule) – the molecule
  • tag (string) – the SD tag name
Returns:

a string, or None

get_tag_pairs (rdkit_toolkit)

chemfp.rdkit_toolkit.get_tag_pairs(mol)

Get a list of all SD tag (name, value) pairs for the molecule

Parameters:mol (an RDKit molecule) – the molecule
Returns:a list of (string name, string value) pairs

get_id (rdkit_toolkit)

chemfp.rdkit_toolkit.get_id(mol)

Get the molecule’s id from RDKit’s _Name property

Parameters:mol (an RDKit molecule) – the molecule
Returns:a string

set_id (rdkit_toolkit)

chemfp.rdkit_toolkit.set_id(mol, id)

Set the molecule’s id as RDKit’s _Name property

Parameters:
  • mol (an RDKit molecule) – the molecule
  • id (string) – the new id
Returns:

None

chemfp.text_toolkit module

The text_toolkit implements the chemfp toolkit API but where the “molecules” are simple TextRecord instances which store the records as text strings. It does not use a back-end chemistry toolkit, and it cannot convert between different chemistry representations.

The TextRecord is a base class. The actual records depend on the format, and will be one of:

The text toolkit will let you “convert” between the different SMILES formats, but it doesn’t actually change the SMILES string. The SMILES records have the attributes id, record and smiles.

The toolkit also knows a bit about the SD format. The SDF records have the attributes id, id_bytes and record, and there are methods to get SD tag values and add a tag to the end of the tag data block.

The text_toolkit also supports a few SDF-specific I/O functions to read SDF records directly as a string instead of wrapped in a TextRecord.

The record types also have the attributes encoding and encoding_errors which affect how the record bytes are parsed.

name

chemfp.text_toolkit.name

The string “text”

software

chemfp.text_toolkit.software

A string like “chemfp/3.0”.

is_licensed (text_toolkit)

chemfp.text_toolkit.is_licensed()

Return True - chemfp’s text toolkit is always licensed

Returns:True

get_formats (text_toolkit)

chemfp.text_toolkit.get_formats(include_unavailable=False)

Get the list of structure formats that chemfp’s text toolkit supports

This version of chemfp will always support the structure formats available to chemfp so ‘include_unavailable’ does not affect anything. (It may affect other toolkits.)

Parameters:include_unavailable – include unavailable formats?
Value include_unavailable:
 True or False
Returns:a list of chemfp.base_toolkit.Format objects

get_input_formats (text_toolkit)

chemfp.text_toolkit.get_input_formats()

Get the list of supported chemfp text toolkit input formats

Returns:a list of chemfp.base_toolkit.Format objects

get_output_formats (text_toolkit)

chemfp.text_toolkit.get_output_formats()

Get the list of supported chemfp text toolkit output formats

Returns:a list of chemfp.base_toolkit.Format objects

get_format (text_toolkit)

chemfp.text_toolkit.get_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError for unknown format names.

Parameters:format_name – the format name
Value format_name:
 a string
Returns:a chemfp.base_toolkit.Format object

get_input_format (text_toolkit)

chemfp.text_toolkit.get_input_format(format_name)

Get the named input format, or raise a ValueError

This will raise a ValueError for unknown format names or if that format is not an input format.

Parameters:format_name – the format name
Value format_name:
 a string
Returns:a chemfp.base_toolkit.Format object

get_output_format (text_toolkit)

chemfp.text_toolkit.get_output_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError for unknown format names or if that format is not an output format.

Parameters:format_name – the format name
Value format_name:
 a string
Returns:a chemfp.base_toolkit.Format object

get_input_format_from_source (text_toolkit)

chemfp.text_toolkit.get_input_format_from_source(source=None, format=None)

Get the most appropriate format given the available source and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
  • format (A Format(-like) object, string, or None) – Format information, if known.
Returns:

a chemfp.base_toolkit.Format object

get_output_format_from_destination (text_toolkit)

chemfp.text_toolkit.get_output_format_from_destination(destination=None, format=None)

Get the most appropriate format given the available destination and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • destination (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
  • format (A Format(-like) object, string, or None) – format information, if known.
Returns:

A chemfp.base_toolkit.Format object

read_molecules (text_toolkit)

chemfp.text_toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads TextRecord instances from a structure file

Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)

Only the SMILES formats use the reader_args dictionary. The supported parameters are:

  • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
  • has_header - True or False

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

See read_ids_and_molecules() if you want (id, TextRecord) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.base_toolkit.MoleculeReader iterating TextRecord molecules

read_molecules_from_string (text_toolkit)

chemfp.text_toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads TextRecord instances from a string containing structure records

content is a string containing 0 or more records in the format format. See read_molecules() for details about the other parameters. See read_ids_and_molecules_from_string() if you want to read (id, TextRecord) pairs instead of just molecules.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.base_toolkit.MoleculeReader iterating TextRecord molecules

read_ids_and_molecules (text_toolkit)

chemfp.text_toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return an iterator that reads (id, TextRecord) pairs from a structure file

See chemfp.text_toolkit.read_molecules() for full parameter details. The major difference is that this returns an iterator of (id, TextRecord) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.text_toolkit.IdAndMoleculeReader iterating (id, TextRecord) pairs

read_ids_and_molecules_from_string (text_toolkit)

chemfp.text_toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None)

Return an iterator that reads (id, TextRecord) pairs from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.read_molecules_from_string() if you just want to read the TextRecord molecules instead of (id, TextRecord) pairs.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, TextRecord) pairs

make_id_and_molecule_parser (text_toolkit)

chemfp.text_toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict")

Create a specialized function which takes a record and returns an (id, TextRecord) pair

The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven’t really noticed much of a performance difference between this and chemfp.text_toolkit.parse_id_and_molecule() so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)

See chemfp.text_toolkit.read_molecules() for details about the other parameters. The specific TextRecord subclass returned depends on the format.

Parameters:
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a function of the form parser(record string) -> (id, text_record)

parse_molecule (text_toolkit)

chemfp.text_toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from the content string and return a TextRecord.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.text_toolkit.read_molecules() for details about the other parameters. See chemfp.text_toolkit.parse_id_and_molecule() if you want the (id, TextRecord) pair instead of just the text record.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a TextRecord

parse_id_and_molecule (text_toolkit)

chemfp.text_toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict")

Parse the first structure record from content and return the (id, TextRecord) pair.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters.

See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.parse_molecule() if just want the TextRecord and not the the (id, TextRecord) pair.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

an (id, TextRecord molecule) pair

create_string (text_toolkit)

chemfp.text_toolkit.create_string(mol, format, id=None, writer_args=None, errors="strict")

Convert a TextRecord into a structure record in the given format as a Unicode string

If id is not None then use it instead of the molecule’s own id.

Parameters:
  • mol (a TextRecord) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a Unicode string

create_bytes (text_toolkit)

chemfp.text_toolkit.create_bytes(mol, format, id=None, writer_args=None, errors="strict")

Convert a TextRecord into a structure record in the given format as a byte string

If id is not None then use it instead of the molecule’s own id.

Parameters:
  • mol (a TextRecord) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a byte string

open_molecule_writer (text_toolkit)

chemfp.text_toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict")

Return a MoleculeWriter which can write TextRecord instances to a destination.

A chemfp.base_toolkit.MoleculeWriter has the methods write_molecule, write_molecules, and write_ids_and_molecules, which are ways to write an TextRecord, an TextRecord iterator, or an (id, TextRecord) pair iterator to a file.

TextRecords are written to destination. The output format can be a string like “sdf.gz” or “smi”, a chemfp.base_toolkit.Format, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.

That said, the text toolkit doesn’t know how to convert between SMILES and SDF formats, and will raise an exception if you try.

The writer_args is only used for the “smi”, “can”, and “usm” output formats. The only supported parameter is:

* delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

Parameters:
  • destination (a filename, file object, or None to write to stdout) – the structure destination
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.base_toolkit.MoleculeWriter expecting TextRecord instances

open_molecule_writer_to_string (text_toolkit)

chemfp.text_toolkit.open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write TextRecord instances to a string.

See chemfp.text_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a Unicode string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting TextRecord instances

open_molecule_writer_to_bytes (text_toolkit)

chemfp.text_toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None)

Return a MoleculeStringWriter which can write TextRecord instances to a string.

See chemfp.text_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a byte string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting TextRecord instances

copy_molecule (text_toolkit)

chemfp.text_toolkit.copy_molecule(mol)

Return a new TextRecord which is a copy of the given TextRecord

Parameters:mol (a TextRecord) – the text record
Returns:a new TextRecord

add_tag (text_toolkit)

chemfp.text_toolkit.add_tag(mol, tag, value)

Add an SD tag value to the TextRecord

If the mol is in “sdf” format then this will modify mol.record to append the new tag and value to the end of the tag block. The other tags will not be modified, including tags with the same tag name.

Parameters:
  • mol (a TextRecord) – the text record
  • tag (string) – the SD tag name
  • value (string) – the text for the tag
Returns:

None

get_tag (text_toolkit)

chemfp.text_toolkit.get_tag(mol, tag)

Get the named SD tag value, or None if it doesn’t exist

If the mol is in “sdf” format then this will return the corresponding tag value from mol.record, or None if the tag does not exist.

If the record is in any other format then it will return None.

Parameters:
  • mol (a TextRecord) – the molecule
  • tag (string) – the SD tag name
Returns:

a string, or None

get_tag_pairs (text_toolkit)

chemfp.text_toolkit.get_tag_pairs(mol)

Get a list of all SD tag (name, value) pairs for the TextRecord

If the mol is in “sdf” format then this will return the list of (tag, value) pairs in mol.record, where the tag and value are strings.

If the record is in any other format then it will return an empty list.

Parameters:mol (a TextRecord) – the molecule
Returns:a list of (tag name, tag value) pairs

get_id (text_toolkit)

chemfp.text_toolkit.get_id(mol)

Get the molecule’s id from the TextRecord’s id field

This is toolkit-portable way to get mol.id.

Parameters:mol (a TextRecord) – the molecule
Returns:a string

set_id (text_toolkit)

chemfp.text_toolkit.set_id(mol, id)

Set the TextRecord’s id to the new id

This is the toolkit-portable way to write mol.id = id.

Note: this does not modify mol.record. Use chemfp.text_toolkit.create_string() or similar text_toolkit functions to get the record text with a new identifier.

Parameters:
  • mol (a TextRecord) – the molecule
  • id (string) – the new id
Returns:

None

read_sdf_records (text_toolkit)

chemfp.text_toolkit.read_sdf_records(source=None, reader_args=None, compression=None, errors="strict", location=None, block_size=327680)

Return an iterator that reads each record from an SD file as a string.

Iterate through the records in source, which must be in SD format. If compression is None or “auto” then auto-detect the compression type based on source, and default to uncompressed when it can’t be determined. Use “gz” when the input is gzip compressed, and “none” or “” if uncompressed.

The reader_args parameter is currently unused. It exists for future compatability.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The block_size parameter is the number of bytes to read from the SD file. The current implementation reads a block, iterates through the records in the block, then prepends any remaining text to the start of the next block. You shouldn’t need to change this parameter, but if you do, please let me know.

Note: to prevent accidental memory consumption if the input is in the wrong format, a complete record must be found within the first 327680 bytes or 5*block_size bytes, whichever is larger.

The parser has only a basic understanding of the SD format. It knows how to handle the counts line, the SKP property, and even tag data with the value ‘$$$$’. It is not a full validator and it does not know chemistry.

WARNING: the parser does not yet handle the MS Windows newline convention.

See read_sdf_ids_and_records() if you want (id, record) pairs, and read_sdf_ids_and_values() if you want (id, tag data) pairs. See read_sdf_ids_and_records_from_string() to read from a string instead of a file or file-like object.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.RecordReader() iterating over the records as a string

read_sdf_ids_and_records (text_toolkit)

chemfp.text_toolkit.read_sdf_ids_and_records(source=None, id_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)

Return an iterator that reads the (id, record string) pairs from an SD file

See read_sdf_records() for most parameter details. That function iterates over the records, while this one iterates over the (id, record) pairs. By default the id comes from the title line. Use id_tag to get the record id from the given SD tag instead.

See read_sdf_ids_and_values() if you want to read an identifier and tag value, or two tag values, instead of returning the full record.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating (id, record string) pairs

read_sdf_ids_and_values (text_toolkit)

chemfp.text_toolkit.read_sdf_ids_and_values(source=None, id_tag=None, value_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)

Return an iterator that reads the (id, tag value string) pairs from an SD file

See read_sdf_records() for most parameter details. That function iterates over the records, while this one iterates over the (id, tag value) pairs.

By default this uses the title line for both the id and tag value strings. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • value_tag (string, or None to use the record title) – SD tag containing the value
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating (id, value string) pairs

read_sdf_records_from_string (text_toolkit)

chemfp.text_toolkit.read_sdf_records_from_string(content, reader_args=None, compression=None, errors="strict", location=None, block_size=327680)

Return an iterator that reads each record from a string containing SD records

See read_sdf_records_from_string() for the parameter details. The main difference is that this function reads from content, which is a string containing 0 or more SDF records.

If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, and the compression option is not supported. If content is a byte string then the records will be returned as byte strings, and compression is supported.

See read_sdf_ids_and_records_from_string() to read (id, record) pairs and read_sdf_ids_and_values_from_string() to read (id, tag value) pairs.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.RecordReader iterating over each record as a string

read_sdf_ids_and_records_from_string (text_toolkit)

chemfp.text_toolkit.read_sdf_ids_and_records_from_string(content=None, id_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)

Return an iterator that reads the (id, record) pairs from a string containing SD records

This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, record) pairs. By default the id comes from the first line of the SD record. Use id_tag to use a given tag value instead. See read_sdf_records() for details about the other parameters.

If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.

If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating over the (id, record string) pairs

read_sdf_ids_and_values_from_string (text_toolkit)

chemfp.text_toolkit.read_sdf_ids_and_values_from_string(content=None, id_tag=None, value_tag=None, compression=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680)

Return an iterator that reads the (id, value) pairs from a string containing SD records

This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, value) pairs, which by default both contain the title line. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.

If content is a (Unicode) string then it must only contain ASCII characters, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.

If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id and value.

See read_sdf_records() for details about the other parameters.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • value_tag (string, or None to use the record title) – SD tag containing the value
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating over the (id, value) pairs

get_sdf_tag (text_toolkit)

chemfp.text_toolkit.get_sdf_tag(sdf_record, tag)

Return the value for a named tag in an SDF record string

Get the value for the tag named tag from the string sdf_record containing an SD record.

Parameters:
  • sdf_record (string) – an SD record
  • tag (string) – a tag name
Returns:

the corresponding tag value as a string, or None

add_sdf_tag (text_toolkit)

chemfp.text_toolkit.add_sdf_tag(sdf_record, tag, value)

Add an SD tag value to an SD record string

This will append the new tag and value to the end of the tag data block in the sdf_record string.

Parameters:
  • sdf_record (string) – an SD record
  • tag (string) – a tag name
  • value (string) – the new tag value
Returns:

a new SD record string with the new tag and value

get_sdf_tag_pairs (text_toolkit)

chemfp.text_toolkit.get_sdf_tag_pairs(sdf_record)

Return the (tag, value) entries in the SDF record string

Parse the sdf_record and return the tag data as a list of (tag, value) pairs. The type of the returned strings will be the same as the type of the input sdf_record string.

Parameters:sdf_record (string) – an SDF record
Returns:a list of (tag, value) pairs

get_sdf_id (text_toolkit)

chemfp.text_toolkit.get_sdf_id(sdf_record)

Return the id for the SDF record string

The id is the first line of the sdf_record. A future version of this function may support an id_tag parameter. Let me know if that would be useful.

The returned id string will have the same type as the input sdf_record.

Parameters:sdf_record (string) – an SD record
Returns:the first line of the SD record

set_sdf_id (text_toolkit)

chemfp.text_toolkit.set_sdf_id(sdf_record, id)

Set the id of the SDF record string to a new value

Set the first line of sdf_record to the new id, which must not contain a newline.

The sdf_record and the id must have the same string type.

Parameters:
  • sdf_record (string) – an SDF record
  • id (string) – the new id

chemfp._text_toolkit module (private)

As you might have infered from the leading “_” in “_text_toolkit”, this is not a public module. There is no reason for you to import it directly, the module name is subject to change, and even the location of the classes is also subject to change. The reason why I even bring it up is because the chemfp.text_toolkit returns class instances from this module, so you might well wonder about them.

TextRecord

class chemfp._text_toolkit.TextRecord

Base class for the text_toolkit ‘molecules’, which work with the records as text.

The chemfp.text_toolkit implements the toolkit API, but it doesn’t know chemistry. Instead of returning real molecule objects, with atoms and bonds, it returns TextRecord subclass instances that hold the record as a text string.

As an implementation detail (which means its subject to change) there is a subclass for each of the support formats.

  • SDFRecord - holds “sdf” records
  • SmiRecord - holds “smi” records (the full line from a “smi” SMILES file)
  • CanRecord - holds “can” records (the full line from a “can” SMILES file)
  • UsmRecord - holds “usm” records (the full line from a “usm” SMILES file)
  • SmiStringRecord - holds “smistring” records (only the “smistring” SMILES string; no id)
  • CanStringRecord - holds “canstring” records (only the “canstring” SMILES string; no id)
  • UsmStringRecord - holds “usmstring” records (only the “usmstring” SMILES string; no id)

All of the classes have the following attributes: .. py:attribute:: id

The record identifier as a Unicode string, or None if there is no identifier
id_bytes

The record identifier as a byte string, or None if there is no identifier

record

The record, as a string. For the smistring, canstring, and usmstring formats, this is only the SMILES string.

record_format

One of “sdf”, “smi”, “can”, “usm”, “smistring”, “canstring”, or “usmstring”.

The SMILES classes have an attribute:

smiles

The SMILES string component of the record.

add_tag(tag, value)

Add an SD tag value to the TextRecord

This methods does nothing if the record is not an “sdf” record.

Parameters:
  • tag (string) – the SD tag name
  • value (string) – the text for the tag
Returns:

None

get_tag(tag)

Get the named SD tag value, or None if it doesn’t exist or is not an “sdf” record.

Parameters:tag (byte or Unicode string) – the SD tag name
Returns:a Unicode string, or None
get_tag_as_bytes(tag)

Get the named SD tag value, or None if it doesn’t exist or is not an “sdf” record.

Parameters:tag (byte string) – the SD tag name
Returns:a byte string, or None
get_tag_pairs()

Get a list of all SD tag (name, value) pairs for the TextRecord using Unicode strings

This function returns an empty list if the record is not an “sdf” record.

Returns:a list of (Unicode string name, Unicode string value) pairs
get_tag_pairs_as_bytes()

Get a list of all SD tag (name, value) pairs for the TextRecord using byte strings

This function returns an empty list if the record is not an “sdf” record.

Returns:a list of (byte string name, byte string value) pairs
copy()

Return a new record which is a copy of the given record

SDFRecord

class chemfp._text_toolkit.SDFRecord

Holds an SDF record. See chemfp._text_toolkit.TextRecord for API details

SmiRecord

class chemfp._text_toolkit.SmiRecord

Holds an “smi” record. See chemfp._text_toolkit.TextRecord for API details

CanRecord

class chemfp._text_toolkit.CanRecord

Holds an “can” record. See chemfp._text_toolkit.TextRecord for API details

UsmRecord

class chemfp._text_toolkit.UsmRecord

Holds an “usm” record. See chemfp._text_toolkit.TextRecord for API details

SmiStringRecord

class chemfp._text_toolkit.SmiStringRecord

Holds an “smistring” record. See chemfp._text_toolkit.TextRecord for API details

CanStringRecord

class chemfp._text_toolkit.CanStringRecord

Holds an “canstring” record. See chemfp._text_toolkit.TextRecord for API details

UsmStringRecord

class chemfp._text_toolkit.UsmStringRecord

Holds an “usmstring” record. See chemfp._text_toolkit.TextRecord for API details

chemfp.io module

This module implements a single public class, Location, which tracks parser state information, including the location of the current record in the file. The other functions and classes are undocumented, should not be used, and may change in future releases.

Location

class chemfp.io.Location

Get location and other internal reader and writer state information

A Location instance gives a way to access information like the current record number, line number, and molecule object.:

>>> import chemfp
>>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166",
...                        "ChEBI_lite.sdf.gz", id_tag="ChEBI ID") as reader:
...   for id, fp in reader:
...     if id == "CHEBI:3499":
...         print("Record starts at line", reader.location.lineno)
...         print("Record byte range:", reader.location.offsets)
...         print("Number of atoms:", reader.location.mol.GetNumAtoms())
...         break
...
[08:18:12]  S group MUL ignored on line 103
Record starts at line 3599
Record byte range: (138171, 141791)
Number of atoms: 36

The supported properties are:

  • filename - a string describing the source or destination
  • lineno - the line number for the start of the file
  • mol - the toolkit molecule for the current record
  • offsets - the (start, end) byte positions for the current record
  • output_recno - the number of records written successfully
  • recno - the current record number
  • record - the record as a text string
  • record_format - the record format, like “sdf” or “can”

Most of the readers and writers do not support all of the properties. Unsupported properties return a None. The filename is a read/write attribute and the other attributes are read-only.

If you don’t pass a location to the readers and writers then they will create a new one based on the source or destination, respectively. You can also pass in your own Location, created as Location(filename) if you have an actual filename, or Location.from_source(source) or Location.from_destination(destination) if you have a more generic source or destination.

__init__(filename=None)

Use filename as the location’s filename

from_source(cls, source)

Create a Location instance based on the source

If source is a string then it’s used as the filename. If source is None then the location filename is “<stdin>”. If source is a file object then its name attribute is used as the filename, or None if there is no attribute.

from_destination(cls, destination)

Create a Location instance based on the destination

If destination is a string then it’s used as the filename. If destination is None then the location filename is “<stdout>”. If destination is a file object then its name attribute is used as the filename, or None if there is no attribute.

__repr__()

Return a string like ‘Location(“<stdout>”)’

first_line

Read-only attribute.

The first line of the current record

filename

Read/write attribute.

A string which describes the source or destination. This is usually the source or destination filename but can be a string like “<stdin>” or “<stdout>”.

mol

Read-only attribute.

The molecule object for the current record

offsets

Read-only attribute.

The (start, end) byte offsets, starting from 0

start is the record start byte position and end is one byte past the last byte of the record.

output_recno

Read-only attribute.

The number of records actually written to the file or string.

The value recno - output_recno is the number of records sent to the writer but which had an error and could not be written to the output.

recno

Read-only attribute.

The current record number

For writers this is the number of records sent to the writer, and output_recno is the number of records sucessfully written to the file or string.

record

Read-only attribute.

The current record as an uncompressed text string

record_format

Read-only attribute.

The record format name

where()

Return a human readable description about the current reader or writer state.

The description will contain the filename, line number, record number, and up to the first 40 characters of the first line of the record, if those properties are available.