Top-level API

The following functions and classes are in the top-level chemfp module.

chemfp.cdk

This is a special object which forwards any use to the chemfp.cdk_toolkit. It imports the underlying module as-needed so may raise an ImportError. It is designed to be used as chemfp.cdk, like the following:

import chemfp
fp = chemfp.cdk.pubchem.from_smiles("CCO")

Please do not import “cdk” directly into your module as you are likely to get confused with CDK’s own “cdk” module. Instead, use one of the following:

from chemfp import cdk_toolkit
from chemfp import cdk_toolkit as T
chemfp.openeye

This is a special object which forwards any use to the chemfp.openeye_toolkit. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used as chemfp.cdk, like the following:

import chemfp
fp = chemfp.openeye.circular.from_smiles("CCO")

Please do not import “openeye” directly into your module as you are likely to get confused with OpenEye’s own “openeye” module. Instead, use one of the following:

from chemfp import openeye_toolkit
from chemfp import openeye_toolkit as T
chemfp.openbabel

This is a special object which forwards to the chemfp.openbabel_toolkit. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used as chemfp.openbabel, like the following:

import chemfp
fp = chemfp.openbabel.fp2.from_smiles("CCO")

Please do not import “openbabel” directly into your module as you are likely to get confused with Open Babel’s own “openbabel” modules. Instead, use one of the following:

from chemfp import openbabel_toolkit
from chemfp import openbabel_toolkit as T
chemfp.rdkit

This is a special object which forwards to the chemfp.rdkit_toolkit. It imports the underlying toolkit module as-needed so may raise an ImportError. It is designed to be used as chemfp.rdkit, like the following:

import chemfp
fp = chemfp.rdkit.morgan(fpSize=128).from_smiles("CCO")

Please do not import “rdkit” directly into your module as you are likely to get confused with CDK’s own “rdkit” module. Instead, use one of the following:

from chemfp import rdkit_toolkit
from chemfp import rdkit_toolkit as T
chemfp.__version__

A string describing this version of chemfp. For example, “4.1b1”.

chemfp.__version_info__

A 3-element tuple of integers containing the (major version, minor version, micro version) of this version of chemfp. For example, (4, 1, 0).

chemfp.SOFTWARE

The value of the string used in output file metadata to describe this version of chemfp. For example, “chemfp/4.1 (base license)”.

exception chemfp.ChemFPError

Bases: Exception

Base class for all of the chemfp exceptions

exception chemfp.ParseError(msg, location=None)

Bases: chemfp.ChemFPError, ValueError

Exception raised by the molecule and fingerprint parsers and writers

The public attributes are:

msg

a string or object describing the exception

location

a chemfp.io.Location instance, or None

exception chemfp.EncodingError

Bases: chemfp.ChemFPError, ValueError

Exception raised when the encoding or the encoding_error is unsupported or unknown

chemfp.set_default_progressbar(progressbar)

Configure the default progress bar

This must be an object implementing the tqdm class behavior or one of the following values:

  • False - do not use a progress bar
  • None or True - use the default progress bar

(False is mapped to the internal “disabled_tqdm” object.)

chemfp.get_default_progressbar()

Return the current default progress bar, or None for the default behavior

chemfp.read_molecule_fingerprints(type, source=None, format=None, id_tag=None, reader_args=None, errors='strict')

Read structures from source and return the corresponding ids and fingerprints

This returns an chemfp.fps_io.FPSReader which can be iterated over to get the id and fingerprint for each read structure record. The fingerprint generated depends on the value of type. Structures are read from source, which can either be the structure filename, or None to read from stdin.

type contains the information about how to turn a structure into a fingerprint. It can be a string or a metadata instance. String values look like OpenBabel-FP2/1, OpenEye-Path, and OpenEye-Path/1 min_bonds=0 max_bonds=5 atype=DefaultAtom btype=DefaultBond. Default values are used for unspecified parameters. Use a Metadata instance with type and aromaticity values set in order to pass aromaticity information to OpenEye.

If format is None then the structure file format and compression are determined by the filename’s extension(s), defaulting to uncompressed SMILES if that is not possible. Otherwise format may be “smi” or “sdf” optionally followed by “.gz” or “.bz2” to indicate compression. The OpenBabel and OpenEye toolkits also support additional formats.

If id_tag is None, then the record id is based on the title field for the given format. If the input format is “sdf” then id_tag specifies the tag field containing the identifier. (Only the first line is used for multi-line values.) For example, ChEBI omits the title from the SD files and stores the id after the “> <ChEBI ID>” line. In that case, use id_tag = "ChEBI ID".

The reader_args is a dictionary with additional structure reader parameters. The parameters depend on the toolkit and the format. Unknown parameters are ignored.

errors specifies how to handle errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record.

Here is an example of using fingerprints generated from structure file:

from chemfp.bitops import hex_encode
fp_reader = chemfp.read_molecule_fingerprints("OpenBabel-FP4/1", "example.sdf.gz")
print("Each fingerprint has", fp_reader.metadata.num_bits, "bits")
for (id, fp) in fp_reader:
  print(id, hex_encode(fp))

See also chemfp.read_molecule_fingerprints_from_string().

Parameters:
  • type (string or Metadata) – information about how to convert the input structure into a fingerprint
  • source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
  • format (string, or None to autodetect based on the source) – The file format and optional compression. Examples: “smi” and “sdf.gz”
  • id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.
  • reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
  • errors (one of "strict", "report", or "ignore") – specify how to handle parse errors
Returns:

a chemfp.FingerprintReader

chemfp.read_molecule_fingerprints_from_string(type, content, format, *, id_tag=None, reader_args=None, errors='strict')

Read structures from the content string and return the corresponding ids and fingerprints

The parameters are identical to chemfp.read_molecule_fingerprints() except that the entire content is passed through as a content string, rather than as a source filename. See that function for details.

You must specify the format! As there is no source filename, it’s not possible to guess the format based on the extension, and there is no support for auto-detecting the format by looking at the string content.

Parameters:
  • type (string or Metadata) – information about how to convert the input structure into a fingerprint
  • content (string) – The structure data as a string.
  • format (string) – The file format and optional compression. Examples: “smi” and “sdf.gz”
  • id_tag (string, or None to use the default title for the given format) – The tag containing the record id. Example: “ChEBI ID”. Only valid for SD files.
  • reader_args (dict, or None to use the default arguments) – additional parameters for the structure reader
  • errors (one of "strict" (raise exception), "report" (send a message to stderr and continue processing), or "ignore" (continue processing)) – specify how to handle parse errors
Returns:

a chemfp.FingerprintReader

chemfp.open(source, format=None, location=None, allow_mmap=True)

Read fingerprints from a fingerprint file

Read fingerprints from source, using the given format. If source is a string then it is treated as a filename. If source is None then fingerprints are read from stdin. Otherwise, source must be a Python file object supporting the read and readline methods.

If format is None then the fingerprint file format and compression type are derived from the source filename, or from the name attribute of the source file object. If the source is None then the stdin is assumed to be uncompressed data in “fps” format.

The supported format strings are:

  • “fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format
  • “fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format

The optional location is a chemfp.io.Location instance. It will only be used if the source is in FPS format.

If the source is in FPS format then open will return a chemfp.fps_io.FPSReader, which will use the location if specified.

If the source is in FPB format then open will return a chemfp.arena.FingerprintArena and the location will not be used. If allow_mmap is True then chemfp may use mmap to read uncompressed FPB files. If False then chemfp will read the file’s contents into memory, which may give better performance if the FPB file is on a networked file system, at the expense of higher memory use.

Here’s an example of printing the contents of the file:

from chemfp.bitops import hex_encode
reader = chemfp.open("example.fps.gz")
for id, fp in reader:
    print(id, hex_encode(fp))
Parameters:
  • source (A filename string, a file object, or None) – The fingerprint source.
  • format (string, or None) – The file format and optional compression.
  • location (a Location instance, or None) – a location object used to access parser state information
  • allow_mmap (boolean) – if True, use mmap to open uncompressed FPB files, otherwise read the contents
Returns:

a chemfp.fps_io.FPSReader or chemfp.arena.FingerprintArena

chemfp.open_from_string(content, format='fps', *, location=None)

Read fingerprints from a content string containing fingerprints in the given format

The supported format strings are:

  • “fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format
  • “fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format

If the format is ‘fps’ and not compressed then the content may be a text string. Otherwise content must be a byte string.

The optional location is a chemfp.io.Location instance. It will only be used if the source is in FPS format.

Parameters:
  • content (byte or text string) – The fingerprint data as a string.
  • format (string) – The file format and optional compression. Unicode strings may not be compressed.
  • location (a Location instance, or None) – a location object used to access parser state information
Returns:

a chemfp.fps_io.FPSReader or chemfp.arena.FingerprintArena

chemfp.open_fingerprint_writer(destination, metadata=None, format=None, *, alignment=8, reorder=True, level=None, include_metadata=True, tmpdir=None, max_spool_size=None, errors='strict', location=None)

Create a fingerprint writer for the given destination

The fingerprint writer is an object with methods to write fingerprints to the given destination. The output format is based on the format. If that’s None then the format depends on the destination, or is “fps” if the attempts at format detection fail.

The metadata, if given, is a Metadata instance, and used to fill the header of an FPS file or META block of an FPB file.

If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None for stdout. If the output format is “fpb” then destination must be a filename or seekable file object. A fingerprint writer with compressed FPB output is not supported; use arena.save() instead, or post-process the file.

Use level to change the compression level. The default is 9 for gzip and 3 for ztd. Use “min”, “default”, or “max” as aliases for the minimum, default, and maximum values for each range.

By default the metadata is included in the FPS output. Set include_metadata to False to disable writing the metadata.

Some options only apply to FPB output. The alignment specifies the arena byte alignment. By default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to False to preserve the input fingerprint order.

The default FPB writer stores everything into memory before writing the file, which may cause performance problems if there isn’t enough available free memory. In that case, set max_spool_size to the number of bytes of memory to use before spooling intermediate data to a file. (Note: there are two independent spools so this may use up to roughly twice as much memory as specified.)

Use tmpdir to specify where to write the temporary spool files if you don’t want to use the operating system default. You may also set the TMPDIR, TEMP or TMP environment variables.

Some options only apply to FPS output. errors specifies how to handle recoverable write errors. The value “strict” raises an exception if there are any detected errors. The value “report” sends an error message to stderr and skips to the next record. The value “ignore” skips to the next record. If include_metadata is false then the FPS metadata (the initial lines starting with ‘#’) are not included.

The location is a Location instance. It lets the caller access state information such as the number of records that have been written.

Parameters:
  • destination (a filename, file object, or None) – the output destination
  • metadata (a Metadata instance, or None) – the fingerprint metadata
  • format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format
  • alignment (positive integer) – arena byte alignment for FPB files
  • reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order
  • level (an integer, the strings "min", "default" or "max", or None for default) – True reorders the fingerprints by popcount, False leaves them in input order
  • include_metadata (a boolean) – if True, include the header metadata in the FPS output
  • tmpdir (string or None) – the directory to use for temporary files, when max_spool_size is specified
  • max_spool_size (integer, or None) – number of bytes to store in memory before using a temporary file. If None, use memory for everything.
  • location (a Location instance, or None) – a location object used to access output state information
Returns:

a chemfp.FingerprintWriter

chemfp.load_fingerprints(reader, metadata=None, reorder=True, alignment=None, format=None, allow_mmap=True, *, progress=False)

Load all of the fingerprints into an in-memory FingerprintArena data structure

The function reads all of the fingerprints and identifers from reader and stores them into an in-memory chemfp.arena.FingerprintArena data structure which supports fast similarity searches.

If reader is a string, the None object, or has a read attribute then it, the format, and allow_mmap will be passed to the chemfp.open() function and the result used as the reader. If that returns a FingerprintArena then the reorder and alignment parameters are ignored and the arena returned.

If reader is a FingerprintArena then the reorder and alignment parameters are ignored. If metadata is None then the input reader is returned without modifications, otherwise a new FingerprintArena is created, whose metadata attribue is metadata.

Otherwise the reader or the result of opening the file must be an iterator which returns (id, fingerprint) pairs. These will be used to create a new arena.

metadata specifies the metadata for all returned arenas. If not given the default comes from the source file or from reader.metadata.

The loader may reorder the fingerprints for better search performance. To prevent ordering, use reorder=False. The reorder parameter is ignored if the reader is an arena or FPB file.

The alignment option specifies the data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None will determine the best alignment based on the fingerprint size and available popcount methods. This parameter is ignored if the reader is an arena or FPB file.

The progress keyword argument, if True, enables a progress bar when reading from an FPS file. The default, False, shows no progress. If neither True nor False then it should be a callable which accepts the tqdm parameters and returns a tqdm-like instance.

Parameters:
  • reader (a string, file object, or (id, fingerprint) iterator) – An iterator over (id, fingerprint) pairs
  • metadata (Metadata) – The metadata for the arena, if other than reader.metadata
  • reorder (True or False) – Specify if fingerprints should be reordered for better performance
  • alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
  • format (None, "fps", "fps.gz", "fps.zst", "fpb", "fpb.gz" or "fpb.zst") – The file format name if the reader is a string
  • allow_mmap (True or False) – Allow chemfp to use mmap on FPB files, instead of reading the file’s contents into memory
  • progress (True, False, or a callable) – Enable or disable progress bars, optionally specifying the progress bar constructor
Returns:

chemfp.arena.FingerprintArena

chemfp.load_fingerprints_from_string(content, format='fps', *, reorder=True, alignment=None, progress=False)

Load the fingerprints from the content string, in the given format

The supported format strings are:

  • “fps”, “fps.gz”, or “fps.zst” for fingerprints in FPS format
  • “fpb”, “fpb.gz” or “fpb.zst” for fingerprints in FPB format

If the format is ‘fps’ and not compressed then the content may be a text string. Otherwise content must be a byte string.

If the content is not in FPB format then by default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set reorder to False to preserve the input fingerprint order.

If the content is not in FPB format then alignment specifies the data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None determines the best alignment based on the fingerprint size and available popcount methods.

The progress keyword argument, if True, enables a progress bar when reading from an FPS file. The default, False, shows no progress. If neither True nor False then it should be a callable which accepts the tqdm parameters and returns a tqdm-like instance.

Parameters:
  • content (byte or text string) – The fingerprint data as a string.
  • format (string) – The file format and optional compression. Unicode strings may not be compressed.
  • reorder (True or False) – True reorders the fingerprints by popcount, False leaves them in input order
  • alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
  • progress (True, False, or a callable) – Enable or disable progress bars, optionally specifying the progress bar constructor
Returns:

chemfp.arena.FingerprintArena

chemfp.count_tanimoto_hits(queries, targets, threshold=0.7, arena_size=100)

Count the number of targets within threshold of each query term

For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.9):
  print(query_id, "has", count, "neighbors with at least 0.9 similarity")

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.count_tanimoto_hits_fp() or chemfp.search.count_tanimoto_hits_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (a positive integer, or None) – The number of queries to process in a batch
Returns:

iterator of the (query_id, score) pairs, one for each query

Find all targets within threshold of each query term

For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, hits) in chemfp.id_threshold_tanimoto_search(queries, targets, threshold=0.8):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity")
    non_identical = [target_id for (target_id, score) in hits if score != 1.0]
    print("  The non-identical hits are:", non_identical)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.threshold_tanimoto_search_fp() or chemfp.search.threshold_tanimoto_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.

Find the k-nearest targets within threshold of each query term

For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.

This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.

Example:

# Use the first 5 fingerprints as the queries 
queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5))
targets = chemfp.load_fingerprints("pubchem_subset.fps")

# Find the 3 nearest hits with a similarity of at least 0.8
for (query_id, hits) in chemfp.id_knearest_tanimoto_search(queries, targets, k=3, threshold=0.8):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity")
    if hits:
        target_id, score = hits[-1]
        print("    The least similar is", target_id, "with score", score)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.knearest_tanimoto_search_fp() or chemfp.search.knearest_tanimoto_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.

chemfp.count_tanimoto_hits_symmetric(fingerprints, threshold=0.7)

Find the number of other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint_id, count) pairs.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, count) in chemfp.count_tanimoto_hits_symmetric(arena, threshold=0.6):
    print(fp_id, "has", count, "neighbors with at least 0.6 similarity")

You may also be interested in chemfp.search.count_tanimoto_hits_symmetric().

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, count) pairs, one for each fingerprint

chemfp.threshold_tanimoto_search_symmetric(fingerprints, threshold=0.7)

Find the other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hit order is arbitrary.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.threshold_tanimoto_search_symmetric(arena, threshold=0.75):
    print(fp_id, "has", len(hits), "neighbors:")
    for (other_id, score) in hits.get_ids_and_scores():
        print("   %s  %.2f" % (other_id, score))

You may also be interested in the chemfp.search.threshold_tanimoto_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

chemfp.knearest_tanimoto_search_symmetric(fingerprints, k=3, threshold=0.0)

Find the k-nearest fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hits are ordered from highest score to lowest, with ties broken arbitrarily.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.knearest_tanimoto_search_symmetric(arena, k=5, threshold=0.5):
    print(fp_id, "has", len(hits), "neighbors, with scores", end="")
    print(", ".join("%.2f" % x for x in hits.get_scores()))

You may also be interested in the chemfp.search.knearest_tanimoto_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

chemfp.count_tversky_hits(queries, targets, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100)

Count the number of targets within threshold of each query term

For each query in queries, count the number of targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, count) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, count) in chemfp.count_tversky_hits(
          queries, targets, threshold=0.9, alpha=0.5, beta=0.5):
  print(query_id, "has", count, "neighbors with at least 0.9 Dice similarity")

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.count_tversky_hits_fp() or chemfp.search.count_tversky_hits_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (a positive integer, or None) – The number of queries to process in a batch
Returns:

iterator of the (query_id, score) pairs, one for each query

Find all targets within threshold of each query term

For each query in queries, find all the targets in targets which are at least threshold similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs.

Example:

queries = chemfp.open("queries.fps")
targets = chemfp.load_fingerprints("targets.fps.gz")
for (query_id, hits) in chemfp.id_threshold_tanimoto_search(
           queries, targets, threshold=0.8, alpha=0.5, beta=0.5):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity")
    non_identical = [target_id for (target_id, score) in hits if score != 1.0]
    print("  The non-identical hits are:", non_identical)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.threshold_tversky_search_fp() or chemfp.search.threshold_tversky_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. ‘hits’ contains a list of (target_id, score) pairs.

Find the k-nearest targets within threshold of each query term

For each query in queries, find the k-nearest of all the targets in targets which are at least threshold similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted.

This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary.

Example:

# Use the first 5 fingerprints as the queries 
queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5))
targets = chemfp.load_fingerprints("pubchem_subset.fps")

# Find the 3 nearest hits with a similarity of at least 0.8
for (query_id, hits) in chemfp.id_knearest_tversky_search(
          queries, targets, k=3, threshold=0.8, alpha=0.5, beta=0.5):
    print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity")
    if hits:
        target_id, score = hits[-1]
        print("    The least similar is", target_id, "with score", score)

Internally, queries are processed in batches with arena_size elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch.

Note: an chemfp.fps_io.FPSReader may be used as a target but it will only process one batch and not reset for the next batch. It’s faster to search a chemfp.arena.FingerprintArena, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader.

If you know the targets are in an arena then you may want to use chemfp.search.knearest_tversky_search_fp() or chemfp.search.knearest_tversky_search_arena().

Parameters:
  • queries (any fingerprint container) – The query fingerprints.
  • targets (chemfp.arena.FingerprintArena or the slower chemfp.fps_io.FPSReader) – The target fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
  • arena_size (positive integer, or None) – The number of queries to process in a batch
Returns:

An iterator containing (query_id, hits) pairs, one for each query. The hits are a list of (target_id, score) pairs, sorted by score.

chemfp.count_tversky_hits_symmetric(fingerprints, threshold=0.7, alpha=1.0, beta=1.0)

Find the number of other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the number of other fingerprints in the same arena which are at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint_id, count) pairs.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, count) in chemfp.count_tversky_hits_symmetric(
        arena, threshold=0.6, alpha=0.5, beta=0.5):
    print(fp_id, "has", count, "neighbors with at least 0.6 Dice similarity")

You may also be interested in chemfp.search.count_tversky_hits_symmetric().

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, count) pairs, one for each fingerprint

chemfp.threshold_tversky_search_symmetric(fingerprints, threshold=0.7, alpha=1.0, beta=1.0)

Find the other fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the other fingerprints in the same arena which share at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hit order is arbitrary.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.threshold_tversky_search_symmetric(
           arena, threshold=0.75, alpha=0.5, beta=0.5):
    print(fp_id, "has", len(hits), "Dice neighbors:")
    for (other_id, score) in hits.get_ids_and_scores():
        print("   %s  %.2f" % (other_id, score))

You may also be interested in the chemfp.search.threshold_tversky_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

chemfp.knearest_tversky_search_symmetric(fingerprints, k=3, threshold=0.0, alpha=1.0, beta=1.0)

Find the k-nearest fingerprints within threshold of each fingerprint

For each fingerprint in the fingerprints arena, find the nearest k fingerprints in the same arena which have at least threshold similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself.

This function returns an iterator of (fingerprint, SearchResult) pairs. The chemfp.search.SearchResult hits are ordered from highest score to lowest, with ties broken arbitrarily.

Example:

arena = chemfp.load_fingerprints("targets.fps.gz")
for (fp_id, hits) in chemfp.knearest_tversky_search_symmetric(
        arena, k=5, threshold=0.5, alpha=0.5, beta=0.5):
    print(fp_id, "has", len(hits), "neighbors, with Dice scores", end="")
    print(", ".join("%.2f" % x for x in hits.get_scores()))

You may also be interested in the chemfp.search.knearest_tversky_search_symmetric() function.

Parameters:
  • fingerprints (a FingerprintArena with precomputed popcount_indices) – The arena containing the fingerprints.
  • k (positive integer) – The maximum number of nearest neighbors to find.
  • threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:

An iterator of (fp_id, SearchResult) pairs, one for each fingerprint

exception chemfp.ChemFPProblem(severity, category, description)

Bases: chemfp.ChemFPError

Information about a compatibility problem between a query and target.

Instances are generated by chemfp.check_fingerprint_problems() and chemfp.check_metadata_problems().

The public attributes are:

severity

one of “info”, “warning”, or “error”

error_level

5 for “info”, 10 for “warning”, and 20 for “error”

category

a string used as a category name. This string will not change over time.

description

a more detailed description of the error, including details of the mismatch. The description depends on query_name and target_name and may change over time.

The current category names are:
  • “num_bits mismatch” (error)
  • “num_bytes_mismatch” (error)
  • “type mismatch” (warning)
  • “aromaticity mismatch” (info)
  • “software mismatch” (info)
chemfp.check_fingerprint_problems(query_fp, target_metadata, query_name='query', target_name='target')

Return a list of compatibility problems between a fingerprint and a metadata

If there are no problems then this returns an empty list. If there is a bit length or byte length mismatch between the query_fp byte string and the target_metadata then it will return a list containing a ChemFPProblem instance, with a severity level “error” and category “num_bytes mismatch”.

This function is usually used to check if a query fingerprint is compatible with the target fingerprints. In case of a problem, the default message looks like:

>>> problems = check_fingerprint_problems("A"*64, Metadata(num_bytes=128))
>>> problems[0].description
'query contains 64 bytes but target has 128 byte fingerprints'

You can change the error message with the query_name and target_name parameters:

>>> import chemfp
>>> problems = check_fingerprint_problems("z"*64, chemfp.Metadata(num_bytes=128),
...      query_name="input", target_name="database")
>>> problems[0].description
'input contains 64 bytes but database has 128 byte fingerprints'
Parameters:
  • query_fp (byte string) – a fingerprint (usually the query fingerprint)
  • target_metadata (Metadata instance) – the metadata to check against (usually the target metadata)
  • query_name (string) – the text used to describe the fingerprint, in case of problem
  • target_name (string) – the text used to describe the metadata, in case of problem
Returns:

a list of ChemFPProblem instances

chemfp.check_metadata_problems(query_metadata, target_metadata, query_name='query', target_name='target')

Return a list of compatibility problems between two metadata instances.

If there are no probelms then this returns an empty list. Otherwise it returns a list of ChemFPProblem instances, with a severity level ranging from “info” to “error”.

Bit length and byte length mismatches produce an “error”. Fingerprint type and aromaticity mismatches produce a “warning”. Software version mismatches produce an “info”.

This is usually used to check if the query metadata is incompatible with the target metadata. In case of a problem the messages look like:

>>> import chemfp
>>> m1 = chemfp.Metadata(num_bytes=128, type="Example/1")
>>> m2 = chemfp.Metadata(num_bytes=256, type="Counter-Example/1")
>>> problems = chemfp.check_metadata_problems(m1, m2)
>>> len(problems)
2
>>> print(problems[1].description)
query has fingerprints of type 'Example/1' but target has fingerprints of type 'Counter-Example/1'

You can change the error message with the query_name and target_name parameters:

>>> problems = chemfp.check_metadata_problems(m1, m2, query_name="input", target_name="database")
>>> print(problems[1].description)
input has fingerprints of type 'Example/1' but database has fingerprints of type 'Counter-Example/1'
Parameters:
  • fp (byte string) – a fingerprint
  • metadata (Metadata instance) – the metadata to check against
  • query_name (string) – the text used to describe the fingerprint, in case of problem
  • target_name (string) – the text used to describe the metadata, in case of problem
Returns:

a list of ChemFPProblem instances

class chemfp.Metadata(num_bits=None, num_bytes=None, type=None, aromaticity=None, software=None, sources=None, date=None)

Bases: object

Store information about a set of fingerprints

The public attributes are:

num_bits

the number of bits in the fingerprint

num_bytes

the number of bytes in the fingerprint

type

the fingerprint type string

aromaticity

aromaticity model (only used with OEChem, and now deprecated)

software

software used to make the fingerprints

sources

list of sources used to make the fingerprint

date

a datetime timestamp of when the fingerprints were made

copy(num_bits=None, num_bytes=None, type=None, aromaticity=None, software=None, sources=None, date=None)

Return a new Metadata instance based on the current attributes and optional new values

When called with no parameter, make a new Metadata instance with the same attributes as the current instance.

If a given call parameter is not None then it will be used instead of the current value. If you want to change a current value to None then you will have to modify the new Metadata after you created it.

Parameters:
  • num_bits (an integer, or None) – the number of bits in the fingerprint
  • num_bytes (an integer, or None) – the number of bytes in the fingerprint
  • type (string or None) – the fingerprint type description
  • aromaticity (None) – obsolete
  • software (string or None) – a description of the software
  • sources (list of strings, a string (interpreted as a list with one string), or None) – source filenames
  • date (a datetime instance, or None) – creation or processing date for the contents
Returns:

a new Metadata instance

class chemfp.FingerprintReader(metadata)

Bases: object

Base class for all chemfp objects holding fingerprint records

All FingerprintReader instances have a metadata attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.

get_fingerprint_type()

Get the fingerprint type object based on the metadata’s type field

This uses self.metadata.type to get the fingerprint type string then calls chemfp.get_fingerprint_type() to get and return a chemfp.types.FingerprintType instance.

This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn’t available.

Returns:a chemfp.types.FingerprintType
iter_arenas(arena_size=1000)

iterate through arena_size fingerprints at a time, as subarenas

Iterate through arena_size fingerprints at a time, returned as chemfp.arena.FingerprintArena instances. The arenas are in input order and not reordered by popcount.

This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer.

If arena_size is None then this makes an iterator which returns a single arena containing all of the fingerprints.

Parameters:arena_size (positive integer, or None) – The number of fingerprints to put into each arena.
Returns:an iterator of chemfp.arena.FingerprintArena instances
load(*, reorder=True, alignment=None, progress=False)

Load all of the fingerprints into an arena and return the arena

Parameters:
  • reorder (True or False) – Specify if fingerprints should be reordered for better performance
  • alignment (a positive integer, or None) – Alignment size in bytes (both data alignment and padding); None autoselects the best alignment.
  • progress (True, False, or a callable) – Enable or disable progress bars, optionally specifying the progress bar constructor
Returns:

a chemfp.arena.FingerprintArena instance

save(destination, format=None, level=None)

Save the fingerprints to a given destination and format

The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.

If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None; None writes to stdout.

If the output format is “fpb” then destination must be a filename or seekable file object. Chemfp cannot save to compressed FPB files.

Parameters:
  • destination (a filename, file object, or None) – the output destination
  • format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format
  • level (an integer, or "min", "default", or "max" for compressor-specific values) – compression level when writing .gz or .zst files
Returns:

None

class chemfp.FingerprintIterator(metadata, id_fp_iterator, location=None, close=None)

Bases: chemfp.FingerprintReader

A chemfp.FingerprintReader for an iterator of (id, fingerprint) pairs

This is often used as an adapter container to hold the metadata and (id, fingerprint) iterator. It supports an optional location, and can call a close function when the iterator has completed.

The attributes are:

  • metadata - a Metadata describing the fingerprints
  • location - a Location describing file processing
  • closed - False if the underlying file is open, otherwise False

A FingerprintIterator is a context manager which will close the underlying iterator if it’s given a close handler.

Like all iterators you can use next() to get the next (id, fingerprint) pair.

close()

Close the iterator.

The call will be forwarded to the close callable passed to the constructor. If that close is None then this does nothing.

class chemfp.Fingerprints(metadata, id_fp_pairs)

Bases: chemfp.FingerprintReader

A chemfp.FingerprintReader containing a metadata and a list of (id, fingerprint) pairs.

This is typically used as an adapater when you have a list of (id, fingerprint) pairs and you want to pass it (and the metadata) to the rest of the chemfp API.

This implements a simple list-like collection of fingerprints. It supports:

  • iteration: for (id, fingerprint) in fingerprints: …
  • indexing: id, fingerprint = fingerprints[1]
  • length: len(fingerprints)

More features, like slicing, will be added as needed or when requested.

class chemfp.FingerprintWriter

Bases: object

Base class for the fingerprint writers

The three fingerprint writer classes are:

If the chemfp_converters package is available then its FlushFingerprintWriter will be used to write fingerprints in flush format.

Use chemfp.open_fingerprint_writer() to create a fingerprint writer class; do not create them directly.

All classes have the following attributes:

  • metadata - a chemfp.Metadata instance
  • format - a string describing the base format type (without compression); either ‘fps’ or ‘fpb’
  • closed - False when the file is open, else True

Fingerprint writers are also their own context manager, and close the writer on context exit.

close()

Close the writer

This will set self.closed to False.

format = None
write_fingerprint(id, fp)

Write a single fingerprint record with the given id and fp to the destination

Parameters:
  • id (string) – the record identifier
  • fp (byte string) – the fingerprint
write_fingerprints(id_fp_pairs)

Write a sequence of (id, fingerprint) pairs to the destination

Parameters:id_fp_pairs – An iterable of (id, fingerprint) pairs. id is a string and fingerprint is a byte string.
chemfp.get_num_threads()

Return the number of OpenMP threads to use in searches

Initially this is the value returned by omp_get_max_threads(), which is generally 4 unless you set the environment variable OMP_NUM_THREADS to some other value.

It may be any value in the range 1 to get_max_threads(), inclusive.

Returns:the current number of OpenMP threads to use
chemfp.set_num_threads(num_threads)

Set the number of OpenMP threads to use in searches

If num_threads is less than one then it is treated as one, and a value greater than get_max_threads() is treated as get_max_threads().

Parameters:num_threads (int) – the new number of OpenMP threads to use
chemfp.get_max_threads()

Return the maximum number of threads available.

WARNING: this likely doesn’t do what you think it does. Do not use!

If OpenMP is not available then this will return 1. Otherwise it returns the maximum number of threads available, as reported by omp_get_num_threads().

chemfp.has_toolkit(toolkit_name)

Return True if the named toolkit is available, otherwise False

If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” then this function will test to see if the given toolkit is available, and if so return True. Otherwise it returns False.

>>> import chemfp
>>> chemfp.has_toolkit("openeye")
True
>>> chemfp.has_toolkit("openbabel")
False

The initial test for a toolkit can be slow, especially if the underlying toolkit loads a lot of shared libraries. The test is only done once, and cached.

Parameters:toolkit_name (string) – the toolkit name
Returns:True or False
chemfp.get_toolkit(toolkit_name)

Return the named toolkit, if available, or raise a ValueError

If toolkit_name is one of “openbabel”, “openeye”, or “rdkit” and the named toolkit is available, then it will return chemfp.openbabel_toolkit, chemfp.openeye_toolkit, or chemfp.rdkit_toolkit, respectively.:

>>> import chemfp
>>> chemfp.get_toolkit("openeye")
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
>>> chemfp.get_toolkit("rdkit")
Traceback (most recent call last):
     ...
ValueError: Unable to get toolkit 'rdkit': No module named rdkit
Parameters:toolkit_name (string) – the toolkit name
Returns:the chemfp toolkit
Raises:ValueError if toolkit_name is unknown or the toolkit does not exist
chemfp.get_toolkit_names()

Return a set of available toolkit names

The function checks if each supported toolkit is available by trying to import its corresponding module. It returns a set of toolkit names:

>>> import chemfp
>>> chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
Returns:a set of toolkit names, as strings
chemfp.get_fingerprint_family(family_name)

Return the named fingerprint family, or raise a ValueError if not available

Given a family_name like OpenBabel-FP2 or OpenEye-MACCS166 return the corresponding chemfp.types.FingerprintFamily.

Parameters:family_name (string) – the family name
Returns:a chemfp.types.FingerprintFamily instance
chemfp.get_fingerprint_families(toolkit_name=None)

Return a list of available fingerprint families

Parameters:toolkit_name (string) – restrict fingerprints to the named toolkit
Returns:a list of chemfp.types.FingerprintFamily instances
chemfp.has_fingerprint_family(family_name)

Test if the fingerprint family is available

Return True if the fingerprint family_name is available, otherwise False. The family_name may be versioned or unversioned, like “OpenBabel-FP2/1” or “OpenEye-MACCS166”.

Parameters:family_name (string) – the family name
Returns:True or False
chemfp.get_fingerprint_family_names(include_unavailable=False, toolkit_name=None)

Return a set of fingerprint family name strings

The function tries to load each known fingerprint family. The names of the families which could be loaded are returned as a set of strings.

If include_unavailable is True then this will return a set of all of the fingerprint family names, including those which could not be loaded.

The set contains both the versioned and unversioned family names, so both OpenBabel-FP2/1 and OpenBabel-FP2 may be returned.

Parameters:include_unavailable (True or False) – Should unavailable family names be included in the result set?
Returns:a set of strings
chemfp.get_fingerprint_type(type, fingerprint_kwargs=None)

Get the fingerprint type based on its type string and optional keyword arguments

Given a fingerprint type string like OpenBabel-FP2, or RDKit-Fingerprint/1 fpSize=1024, return the corresponding chemfp.types.FingerprintType.

The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the fingerprint_kwargs dictionary, where the dictionary values are native Python values. If the same parameter is specified in the type string and the kwargs dictionary then the fingerprint_kwargs takes precedence.

For example:

>>> fptype = get_fingerprint_type("RDKit-Fingerprint fpSize=1024 minPath=3", {"fpSize": 4096})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

Use get_fingerprint_type_from_text_settings() if your fingerprint parameter values are all string-encoded, eg, from the command-line or a configuration file.

Parameters:
  • type (string) – a fingerprint type string
  • fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
Returns:

a chemfp.types.FingerprintType

chemfp.get_fingerprint_type_from_text_settings(type, settings=None)

Get the fingerprint type based on its type string and optional settings arguments

Given a fingerprint type string like OpenBabel-FP2, or RDKit-Fingerprint/1 fpSize=1024, return the corresponding chemfp.types.FingerprintType.

The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the settings dictionary, where the dictionary values are string-encoded values. If the same parameter is specified in the type string and the settings dictionary then the settings take precedence.

For example:

>>> fptype = get_fingerprint_type_from_text_settings("RDKit-Fingerprint fpSize=1024 minPath=3",
...                                                  {"fpSize": "4096"})
>>> fptype.get_type()
'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

This function is for string settings from a configuration file or command-line. Use get_fingerprint_type() if your fingerprint parameters are Python values.

Parameters:
  • type (string) – a fingerprint type string
  • fingerprint_kwargs (a dictionary of string names and Python types for values) – fingerprint type parameters
Returns:

a chemfp.types.FingerprintType

chemfp.simsearch(*, targets, query=None, query_fp=None, query_id=None, queries=None, NxN=None, query_format=None, target_format=None, type=None, k=None, threshold=None, alpha=None, beta=None, include_lower_triangle=True, ordering=None, progress=True)

High-level API for similarity searches in targets.

Several different search types are supported: - If query_fp is a byte string then use it as the query fingerprint to search targets and create a SearchResult. - If query_id is not None then get the corresponding fingerprint in targets (or raise a KeyError) and use it to search targets and create a SearchResult. - If query is not None then parse it as a molecule record in query_format format (default: ‘smi’) and create a SearchResult. - If queries is not None, use it as queries for an NxM search of targets and create a SearchResults`. - If NxN is true then do an NxN search of the targets. and create a SearchResults.

The function a SimsearchInfo instance with information about what happened. Its result attribute stores the SearchResult or SearchResults.

If queries or targets is not a fingerprint arena then use load_fingerprints() to load the arena. Use query_format or target_format to specify the format type.

If k is not None then do a k-nearest search, otherwise do a threshold search. If threshold is not None then the threshold is 0.0. If both are None the the defaults are k=3, threshold=0.0.

If alpha = beta = None or 1.0 then use a Tanimoto search, otherwise do a Tversky search with the given values of alpha and beta. If beta is not None then beta is set to alpha.

For NxN threshold search, if include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When False, only compute the upper triangle.

If ordering is not None then the hits will be reordered as specified. The available orderings are:

  • increasing-score - sort by increasing score
  • decreasing-score - sort by decreasing score
  • increasing-score-plus - sort by increasing score, break ties by increasing index
  • decreasing-score-plus - sort by decreasing score, break ties by increasing index
  • increasing-index - sort by increasing target index
  • decreasing-index - sort by decreasing target index
  • move-closest-first - move the hit with the highest score to the first position
  • reverse - reverse the current ordering

If progress is True then use a progress bar to show FPS load progress, and NxN and NxM search progress. If False then no progress bar is used. It may also a callable used to create the progress bar.

chemfp.convert2fps(source, destination, *, type, input_format=None, output_format=None, reader_args=None, id_tag=None, errors='ignore', fingerprint_kwargs=None, id_prefix=None, id_template=None, id_cleanup=True, overwrite=True, reorder=True, tmpdir=None, max_spool_size=None, progress=True)

convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in toolkit- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This can be a chemfp fingerprint type string or fingerprint type object. If it is a string then it is combined with fingerprint_kwargs to get the fingerprint type object.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

Handle structure processing errors based on the value of errors, which may be “ignore”, “report”, or “strict”.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

By default, use progress bars while processing each file. Use process=False to disable them.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

chemfp.rdkit2fps(source, destination, *, type='RDKit-Morgan', input_format=None, output_format=None, reader_args=None, id_tag=None, errors='ignore', id_prefix=None, id_template=None, id_cleanup=True, overwrite=True, reorder=True, tmpdir=None, max_spool_size=None, progress=True, bitFlags=None, branchedPaths=None, fpSize=None, fromAtoms=None, includeChirality=None, includeRedundantEnvironments=None, isQuery=None, isomeric=None, kekulize=None, maxLength=None, maxPath=None, minLength=None, minPath=None, min_radius=None, nBitsPerEntry=None, nBitsPerHash=None, radius=None, rings=None, targetSize=None, use2D=None, useBondOrder=None, useBondTypes=None, useChirality=None, useFeatures=None, useHs=None)

Use RDKit to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in RDKit- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “morgan” or a chemfp type name. Additional fingerprint- specific values may be passed as function call arguments.

Most short-hand names are available as attributes of the rdkit2fps function, eg, rdkit2fps.morgan or rdkit2fps.maccs.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

Handle structure processing errors based on the value of errors, which may be “ignore”, “report”, or “strict”.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. It may also a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

chemfp.oe2fps(source, destination, *, type='OpenEye-Path', input_format=None, output_format=None, reader_args=None, id_tag=None, errors='ignore', id_prefix=None, id_template=None, id_cleanup=True, overwrite=True, reorder=True, tmpdir=None, max_spool_size=None, progress=True, atype=None, btype=None, maxbonds=None, maxradius=None, minbonds=None, minradius=None, numbits=None)

Use OEChem and OEGraphSim to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a file-like object (if the toolkit supports it), a filename, or a list of filenames. If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in OEChem- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “circular” or a chemfp type name. Additional fingerprint- specific values may be passed as function call arguments.

Most short-hand names are available as attributes of the oe2fps function, eg, oe2fps.circular or oe2fps.maccs.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

Handle structure processing errors based on the value of errors, which may be “ignore”, “report”, or “strict”.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. It may also a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

chemfp.ob2fps(source, destination, *, type='OpenBabel-FP2', input_format=None, output_format=None, reader_args=None, id_tag=None, errors='ignore', id_prefix=None, id_template=None, id_cleanup=True, overwrite=True, reorder=True, tmpdir=None, max_spool_size=None, progress=True, nBits=None)

Use Open Babel to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a filename, or a list of filenames. (Chemfp does not support passing Python file-like objects to Open Babel). If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in Open Babel- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “FP2” or a chemfp type name. Additional fingerprint- specific values may be passed as function call arguments.

Most short-hand names are available as attributes of the ob2fps function, eg, ob2fps.fp2 or ob2fps.maccs.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

Handle structure processing errors based on the value of errors, which may be “ignore”, “report”, or “strict”.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. It may also a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

chemfp.cdk2fps(source, destination, *, type='CDK-Daylight', input_format=None, output_format=None, reader_args=None, id_tag=None, errors='ignore', id_prefix=None, id_template=None, id_cleanup=True, overwrite=True, reorder=True, tmpdir=None, max_spool_size=None, progress=True, hashPseudoAtoms=None, pathLimit=None, perceiveStereochemistry=None, searchDepth=None, size=None, implementation=None)

Use the CDK to convert a structure file or files to a fingerprint file

Use source to specify the input, which may be None for stdin, a filename, or a list of filenames. (Chemfp does not support passing Python file-like objects to the CDK). If input_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed SMILES. Use reader_args to pass in CDK- and format-specific configuration.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

Use type to specify the fingerprint type. This may be a short-hand name like “daylight” or a chemfp type name. Additional fingerprint- specific values may be passed as function call arguments.

Most short-hand names are available as attributes of the ob2fps function, eg, cdk2fps.daylight or cdk2fps.ecfp2.

If the input is an SD file then id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier.

Handle structure processing errors based on the value of errors, which may be “ignore”, “report”, or “strict”.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the input processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. It may also a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

chemfp.sdf2fps(source, destination, *, id_tag=None, fp_tag=None, input_format=None, output_format=None, metadata=None, pubchem=False, decoder=None, errors='report', id_prefix=None, id_template=None, id_cleanup=True, overwrite=True, reorder=True, tmpdir=None, max_spool_size=None, progress=True)

Extract and save fingerprints from tag data in an SD file

Use source to specify the input, which may be None for stdin, a file-like object, a filename, or a list of filenames. If input_format is not specified then the filename extension (if available) is used to determine the compression type, defaulting to uncompressed. Possible values for input_format include “sdf”, “sdf.gz”, and “sdf.zst”.

Use destination to specify the output, which may be None for stdout, a file-like object, or a filename. If output_format is not specified then the format type is based on the filename extension(s), including compression. The default format is uncompressed FPS.

The id_tag specifies the tag containing the identifier. If None, use the record’s title as the identifier. The fp_tag specifies the tag containing the encoded fingerprint. The decoding describes how to decode the fingerprints. It may be one of “binary”, “binary-msb”, “hex”, “hex-lsb”, “hex-msb”, “base64”, “cactvs”, or “daylight”, or a callable object which takes the fingerprint string and returns the (number of bits, fingerprint byte string), or raises a ValueError on failures.

Handle structure processing errors based on the value of errors, which may be “ignore”, “report”, or “strict”.

If metadata is not None then it is used to generate the metadata output in the output file.

If pubchem is true and metadata is None, then a new Metadata will be used, with software as “CACTVS/unknown”, type as “CACTVS-E_SCREEN/1.0 extended=2”, num_bits as 881, and sources containing any source terms which are filenames.

The pubchem option also sets fp_tag to “PUBCHEM_CACTVS_SUBSKEYS” and decoder to “cactvs”, but only if those values aren’t otherwise specified.

If destination is a string and overwrite is false then do not generate fingerprints if the file destination exists.

If progress is True then use a progress bar to show the SDF processing progress, based on the number of sources and the file size (if available). If False then no progress bar is used. It may also a callable used to create the progress bar.

The values of reorder, tmpdir, max_spool_size are passed to open_fingerprint_writer().

This function returns a ConversionInfo instance with information about the conversion.

chemfp.maxmin(candidates, *, references=None, initial_pick=None, candidates_format=None, references_format=None, num_picks=1000, threshold=1.0, all_equal=False, randomize=True, seed=-1, include_scores=True, progress=True)

Use the MaxMin algorithm to pick diverse fingerprints from candidates

The MaxMin algorithm iteratively picks fingerprints from a set of candidates such that the newly picked fingerprint has the smallest Tanimoto similarity compared to any previously picked fingerprint, and optionally also the smallest Tanimoto similarity to the reference fingerprints.

This process is repeated until num_picks fingerprints have been picked, or until the remaining candidates are greater than threshold similar to the picked fingerprints, or until no candidates are left. A num_picks value of None is an alias for len(candidates) and will select all candiates, from most dissimilar to least. For example, to select all fingerprints with a maximum Tanimoto score of 0.2 then use num_picks = None and threshold = 0.2.

The fingerprints are selected from candidates. If it is not a FingerprintArena then the value is passed to load_fingerprints(), along with values of candidates_format and progress to load the arena.

If initial_pick and references are not specified then the initial pick is selected using the heapsweep algorithm, which finds a fingerprint with the smallest maximum Tanimoto to any other fingerprint. Use initial_pick to specify the initial pick, either as a string (which is treated as a candidate id) or as an integer (which is treated as a fingerprint index).

If references is not None then any picked candidate fingerprint must also be dissimilar from all of the fingerprints in the reference fingerprints. The model behind the terms is that you want to pick diverse fingerprints from a vendor catalog which are also diverse from your in-house reference compounds. If references is not a FingerprintArena then it is passed to load_fingerprints(), along with the values of references_format and progress to load the arena.

If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tie-breaking.

The heapsweep and shuffle methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.

The function returns a MaxMinInfo object with information about what happened. Its picker attribute contains the MaxMinPicker used. If include_scores is true then its result attribute is a PicksAndScores instance, otherwise it is picker.picks.

If progress is True then a progress bar will be used to show any FPS file load progress and show the number of current picks, relative to num_picks. If False then no progress bar is used. It may also a callable used to create the progress bar.

chemfp.heapsweep(candidates, *, candidates_format=None, num_picks=1, threshold=1.0, all_equal=False, randomize=True, seed=-1, include_scores=True, progress=True)

Use the heapsweep algorithm to pick diverse fingerprints from candidates

The heapsweep algorithm picks fingerprints ordered by their respective maximum Tanimoto score to the rest of the arena, from smallest to largest. It uses a heap to keep track of the current score for each fingerprint (a lower bound to the global maximum score), and a flag specifying if the score is also the upper bound.

For each sweep, if the smallest heap entry is an upper bound, then pick it. Otherwise, find the similarity between the corresponding fingerprint and all other fingerprints in the arena. This sets the global maximum score for the heap entry, and may update the minimum score for the rest of the fingerprints. Update the heap and try again.

This process is repeated until num_picks fingerprints have been picked, or until maximum score for the remaining candidates is greater than threshold or until no candidates are left. A num_picks value of None is an alias for len(candidates) and will select all candidates.

If all_equal is True then additional fingerprints will be picked if they have the same score as pick num_pick.

The default num_picks = 1 and all_equal = False selects a fingerprint with the smallest maximum similarity. This is used as the initial pick for MaxMinPicker.from_candidates(). Use num_picks = 1 and all_equal = True to select all fingerprints with the smallest maximum similarity.

The fingerprints are selected from candidates. If it is not a FingerprintArena then the value is passed to load_fingerprints(), along with values of candidates_format and progress to load the arena.

If randomize is True (the default), the candidates are shuffled before the heapsweep algorithm starts. Shuffling should only affect the ordering of fingerprints with identical diversity scores. It is True by default so the first picked fingerprint is the same as MaxMin.from_candidates. Setting to False should generally be slightly faster.

The shuffle and heapsweep methods depend on a (shared) RNG, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.

The function returns a HeapSweepInfo object with information about what happened. Its picker attribute contains the HeapSweepPicker used. If include_scores is true then its result attribute is a PicksAndScores() instance, otherwise it is picker.picks.

If progress is True then a progress bar will be used to show any FPS file load progress and show the number of current picks, relative to num_picks. If False then no progress bar is used. It may also a callable used to create the progress bar.

chemfp.spherex(candidates, *, references=None, initial_picks=None, candidates_format=None, references_format=None, num_picks=1000, threshold=0.4, ranks=None, dise=False, dise_type=None, dise_references=None, dise_references_format=None, randomize=None, seed=-1, num_threads=-1, include_counts=False, include_neighbors=False, progress=True)

Use sphere picking to select diverse fingerprints from candidates

Sphere picking iteratively picks a fingerprint from a set of candidates such that the fingerprint is not at least threshold similar to any previously picked fingerprint. The process is repeated until num_picks fingerprints are selected or no pickable fingerprints are available.

Several varations of “picks a fingerprint” are supported. If directed sphere exclusion is NOT used, then:

1) The default (randomize = None), or if randomize = True, select the next available candidate at random.

2) If default = False, select the next candidate which has the smallest index in the arena. This biases the picks towards fingerprints with the fewer number of bits set, which are likely fingerprints with lower complexity. It doesn’t appear to be that useful.

Directed sphere exclusion (see the DISE paper by Gobbi and Lee), requires a rank for each fingerprint. The next pick is chosen from one of the fingerprints with the smallest rank. There are three ways to specify the ranks:

A) They can be passed in directly as the ranks array, which must be a list of integers between 0 and 2**64-1.

B) If dise is True then the structures from the DISE paper are used. This requires a chemistry toolkit to generate the reference fingerprints. Use dise_type to specify the fingerprint type to use instead of the one from the candidates.

C) The reference fingerprints for the DISE algorithm may be passed as dise_references. This may be an arena or a fingerprint filename. Use dise_references_format to specify the file format instead of using the extension.

If initial ranks are specified, then there are two additional ways to pick a fingerprint:

3) The default (randomize = None), or if randomize = False, selects the the candidate with the smallest rank, breaking ties by selecting the candidate with the smallest index in the arena.

4) If randomize = True, select randomly from all of the candidates with the smallest rank. NOTE: this method uses a linear search, which may cause quadratic behavior if many fingerprints have the same rank.

The fingerprints are selected from candidates. If it is not a FingerprintArena then the value is passed to load_fingerprints(), along with values of candidates_format and progress to load the arena.

If references is not None then any candidate fingerprints which are at least threshold similar to the reference fingerprints are removed before picking starts. If references is not a FingerprintArena then the value is passed to load_fingerprints(), along with the values of references_format and progress to load the arena.

If references is not specified then optionally use initial_picks to specify the initial picks. This may be a candidate id string or integer index into the candidate array, or a list of id strings or integer indices. The list may be in any order and may contain duplicates. (The neighbor sphere will be empty for any duplicates.)

Initial picks are not necessary. If initial_picks is None then the specified picking method is used.

Some of the pick methods use a random number generator, which requires an initial seed. If seed is -1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.

Sphere picking in the candidates may be multi-threaded. The default num_threads of -1 uses chemfp.get_num_threads() threads, which depends on the number of CPU cores in your system and is likely too small. My test suggest 30 threads or higher is more effective. The values of 0 and 1 both mean single-threaded.

The function returns a SpherexInfo object with information about what happened. The picker attribute is the SphereExclusionPicker used. By default the result element is a Picks() instance. If include_counts is true then it is the PicksAndCounts() returned calling the pickers pick_n_with_counts(). If include_neighbors is True then the result is the PicksAndNeighbors() returned from calling pick_n_with_neighbors(). include_counts and include_neighbors cannot both be true.

If progress is True then a progress bar will be used to show any FPS file load progress. If False then no progress bar is used. It may also a callable used to create the progress bar. The sphere picker search does not currently support progress bars.

chemfp.butina(fingerprints=None, *, fingerprints_format=None, matrix=None, matrix_format=None, NxN_threshold=0.7, butina_threshold=0.0, seed=-1, tiebreaker='randomize', false_singletons='follow-neighbor', num_clusters=None, rescore=True, progress=True, debug=0)

Use the Butina algorithm[1] to cluster fingerprints and/or a similarity matrix.

At least one of fingerprints or matrix must be specified.

fingerprints may be an arena or filename (use fingerprints_format if the format cannot be inferred by the filename extension). matrix may be the results of a chemfp NxN symmetric search or an npz filename containing a saved NxN search (the only supported matrix_format is “npz”).

If matrix is None then butina will compute the NxN similarity matrix of the fingerprints with threshold NxN_threshold. Otherwise the it will use the pre-computed matrix in matrix.

The butina_threshold specifies the threshold for the Butina algorithm. It is 0.0 by default, which makes clustering depend depend on the NxN_threshold. This is useful when testing different Butina threshold values because the NxN matrix can be computed once, at the lowest reasonable value, with butina_threshold at different, and higher thresholds.

If tiebreaker is “randomize” (the default) then the next picked center will be chosen at random from the available picks. (These are ranked by the total number of neighbors.) If “first” or “last” then the first or last neighbor, in arena index order, is picked.

Use seed to initialize the random number generator. If -1 (the default), butina will use Python’s RNG to get the initial seed. Otherwise this must be an integer between 0 and 2**64-1.

A “false singleton” is a fingerprint with neighbors within butina_threshold similarity but where all of its neighbors were assigned to another centroid. There are three options for how to handle false_singletons. The default, “follow-neighbor”, assigns the false singleton to the same centroid as its first nearest neighbor. (If there are ties, the first neighbor in the chemfp search is used. A future version of butina may switch to a randomly selected neighbor.) Use “keep” to keep the false singleton as its own centroid. If fingerprints are available then use “nearest-center” to assign false singletons to the nearest cluster centroid. [2]

Use num_clusters to reduce the number of clusters to the specified number. The method takes the smallest cluster and assigns all of its members, one-by-one, to the one of the remaining clusters. The fingerprint is assigned to the same cluster as one of its nearest neighbors, so long as that fingerprint isn’t part of the smallest cluster. The process iterates until enough clusters are pruned. This option requires fingerprints.

By default if a fingerprint is reassigned to a new cluster then then its similarity score is re-computed relative to the new cluster center. If rescore is False then the original score will be preserved.

Use progress to enable progress bars. By default it is True.

The debug option writes debug information to stderr. The two settings are 1 and 2. This will be likely be removed after the Butina implementation is better validated.

[1] Butina, JCICS 39.4, pp 747–750 (1999) doi:10.1021/ci9803381 (While Taylor, JCICS 35.1 pp59-67 (1995) doi:10.1021/ci00023a009 describes a similar algorithm, it is not applied to clustering.)

[2] Blomberg, Cosgrove, and Kenny, JCAMD 23, pp 513–525 (2009) doi:10.1007/s10822-009-9264-5 though chemfp’s implementation does not yet support a minimum required center threshold.