chemfp.arena module

Algorithms and data structure for working with a FingerprintArena.

This is an internal chemfp module. It should not be imported by programs which use the public API. (Let me know if anything else should be part of the public API.)

This module contains class definitions for a objects which are returned as part of the public API. A FingerprintArena stores fingerprints in a contiguous block of memory, along with their associated ids. A FingerprintList provides a list-like view to the fingerprints.

class chemfp.arena.FingerprintArena(metadata, alignment, start_padding, end_padding, storage_size, arena, popcount_indices, arena_ids, start=0, end=None, id_lookup=None, num_bits=None, num_bytes=None, license_key=b'')

Bases: chemfp.FingerprintReader

Store fingerprints in a contiguous block of memory for fast searches

A FingerprintArena implements the chemfp.FingerprintReader API.

The fingerprints in a continuous block of memory so the per-molecule overhead is very low. The block is named arena. The first fingerprint starts at the offset start_padding and each fingerprint takes storage_size bytes, so fingerprint i is located at:

self.arena[self.start_padding +   i   * self.storage_size:
           self.start_padding + (i+1) * self.storage_size ]

The fingerprints can be sorted by popcount, so the fingerprints with no bits set come first, followed by those with 1 bit, etc. If self.popcount_indices is a non-empty string then the string contains information about the start and end offsets for all the fingerprints with a given popcount. This information is used for the BitBound search algorithm.

The public attributes are:

  • metadata - a chemfp.Metadata with information about the fingerprints.
  • ids - list of identifiers, in index order
  • fingerprints - a FingerprintList list-like view of the fingerprints,
    in index order

Other attributes, which might be subject to change, and which I won’t fully explain, are:

  • arena - a contiguous block of memory, which contains the fingerprints
  • start_padding - number of bytes to the first fingerprint in the block
  • end_padding - number of bytes after the last fingerprint in the block
  • storage_size - number of bytes used to store a fingerprint
  • num_bytes - number of bytes in each fingerprint (must be <= storage_size)
  • num_bits - number of bits in each fingerprint
  • alignment - the fingerprint alignment
  • start - the index for the first fingerprint in the arena/subarena
  • end - the index for the last fingerprint in the arena/subarena
  • arena_ids - all of the identifiers for the parent arena

The FingerprintArena is its own context manager, but it does nothing on context exit. The derived FPBFingerprintArena may use a memory-mapped FPB file, which will be closed by the context manager or by an explicit call to close().

alignment = None

the fingerprint alignment

arena = None

a contiguous block of memory, which contains the fingerprints.

arena_ids = None

list of identifiers for the parent arena. You likely want to use ids, which contains the ids for this arena.

close()

Close any resources associated with this arena

If the arena uses a memory-mapped file (eg, an FPB file) then this will close the file.

copy(indices=None, reorder=None, metadata=None, ids=None)

Create a new arena using either all or some of the fingerprints in this arena

By default this create a new arena. The fingerprint data block and ids may be shared with the original arena, which makes this a shallow copy. If the original arena is a slice, or “sub-arena” of an arena, then the copy will allocate new space to store just the fingerprints in the slice and use its own list for the ids.

The indices parameter, if not None, is an iterable which contains the indicies of the fingerprint records to copy. Duplicates are allowed, though discouraged.

If indices are specified then the default reorder value of None, or the value True, will reorder the fingerprints for the new arena by popcount. This improves overall search performance. If reorder is False then the new arena will preserve the order given by the indices.

If indices are not specified, then the default is to preserve the order type of the original arena. Use reorder=True to always reorder the fingerprints in the new arena by popcount, and reorder=False to always leave them in the current ordering.

>>> import chemfp
>>> arena = chemfp.load_fingerprints("pubchem_queries.fps")
>>> arena.ids[1], arena.ids[5], arena.ids[10], arena.ids[18]
(b'9425031', b'9425015', b'9425040', b'9425033')
>>> len(arena)
19
>>> new_arena = arena.copy(indices=[1, 5, 10, 18])
>>> len(new_arena)
4
>>> new_arena.ids
[b'9425031', b'9425015', b'9425040', b'9425033']
>>> new_arena = arena.copy(indices=[18, 10, 5, 1], reorder=False)
>>> new_arena.ids
[b'9425033', b'9425040', b'9425015', b'9425031']

If metadata is not None then it will be the metadata of the new copy.

Use ids to specify the identifiers for the new copy. It is especially useful a way to preserve the initial fingerprint index in the original arena.

Parameters:
  • indices (iterable containing integers, or None) – indicies of the records to copy into the new arena
  • reorder (True to reorder, False to leave in input order, None for default action) – describes how to order the fingerprints
  • metadata (a chemfp.Metadata or None) – the metadata to use in the new copy
  • ids (a list of values, or None to keep the original identifiers) – replacement identifiers to use in the copy
Returns:

a new FingerprintArena

count_tanimoto_hits_arena(queries, threshold=0.7)

Count the fingerprints which are sufficiently similar to each query fingerprint

Returns a list containing a count for each query fingerprint in the queries arena. The count is the number of fingerprints in the arena which are at least threshold similar to the query fingerprint.

The order of results is the same as the order of the queries.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

list of integer counts, one for each query

count_tanimoto_hits_fp(query_fp, threshold=0.7)

Count the fingerprints which are sufficiently similar to the query fingerprint

Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

integer count

count_tversky_hits_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0)

Count the fingerprints which are sufficiently similar to the query fingerprint

Return the number of fingerprints in the arena which are at least threshold similar to the query fingerprint query_fp.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

integer count

end = None

if a subarena, one more than the index of the last fingerprint relative to the start of the parent arena. Will be the number of total fingerprints if this is not a subarena.

end_padding = None

the number of bytes after the last fingerprint in the block

fingerprints = None

list-list view of the fingerprints

get_bit_counts()

Count the number of on bits for each position in the fingerprint

This function returns an array.array of length num_bits integers. Use get_bit_counts_as_numpy() to return a NumPy array.

Returns:an array.array of length num_bits with 4-byte signed integers
get_bit_counts_as_numpy()

Count the number of on bits for each position in the fingerprint

This function returns an NumPy array of length num_bits integers. Use get_bit_counts() to return an array.array.

Returns:a NumPy array of length num_bits and type int32
get_by_id(id)

Given the record identifier, return the (id, fingerprint) pair,

If the id is not present then return None.

get_fingerprint(i)

Return the fingerprint at index i

Raises an IndexError if index i is out of range.

get_fingerprint_by_id(id)

Given the record identifier, return its fingerprint

If the id is not present then return None

get_index_by_id(id)

Given the record identifier, return the record index.

If the id is not present then return None.

ids

Return the identifiers in this arena or subarena.

iter_arenas(arena_size=1000)

Base class for all chemfp objects holding fingerprint records

All FingerprintReader instances have a metadata attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record.

knearest_tanimoto_search_arena(queries, k=3, threshold=0.0)

Find the k-nearest fingerprints which are sufficiently similar to each of the query fingerprints

For each fingerprint in the queries arena, find the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResults, where the hits in each SearchResult are sorted by similarity score.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
Returns:

a SearchResults

knearest_tanimoto_search_fp(query_fp, k=3, threshold=0.0)

Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResult, sorted from highest score to lowest.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
Returns:

a SearchResult

knearest_tversky_search_fp(query_fp, k=3, threshold=0.0, alpha=1.0, beta=1.0)

Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint, and of those, select the top k hits. The hits are returned as a SearchResult, sorted from highest score to lowest.

Parameters:
  • query_fp (byte string) – query fingerprint
  • k (positive integer) – maximum number of neighbors to find
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
  • alpha (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
  • beta (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.0)
Returns:

a SearchResult

metadata = None

a chemfp.Metadata with information about the fingerprints.

num_bits = None

the number of bits in each fingerprint

num_bytes = None

the number of bytes in each fingerprint (must be <= storage_size)

popcount_indices = None

encoded byte string containing the fingerprint index for the first fingerprint with a given popcount p.

random_choice(rng=None)

return a randomly selected (id, fp) pair

If rng is None then use Python’s random.sample() for the sampling. If rng is an integer then use random.Random(rng).sample(). Otherwise, use rng.sample().

Parameters:rng (None, int, or a random.Random()) – method to use for random sampling
Returns:a 2-element tuple of identifier string and fingerprint bytes
sample(num_samples, reorder=True, rng=None)

return a new arena containing num_samples randomly selected fingerprints, without replacement

If num_samples is an integer then it must be between 0 and the size of the arena. If num_samples is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include.

By default the new arena is sorted by popcount. Set reorder to False to return the fingerprints in random order.

If rng is None then use Python’s random.sample() for the sampling. If rng is an integer then use random.Random(rng).sample(). Otherwise, use rng.sample().

Parameters:
  • num_samples (int or float) – number of fingerprints to select
  • reorder (True to reorder, False to leave in the sampling order) – describes how to order the sampled fingerprints
  • rng (None, int, or a random.Random()) – method to use for random sampling
Returns:

a FingerprintArena

save(destination, format=None, level=None)

Save the fingerprints to a given destination and format

The output format is based on the format. If the format is None then the format depends on the destination file extension. If the extension isn’t recognized then the fingerprints will be saved in “fps” format.

If the output format is “fps”, “fps.gz”, or “fps.zst” then destination may be a filename, a file object, or None; None writes to stdout.

If the output format is “fpb” then destination must be a filename or seekable file object. Chemfp cannot save to compressed FPB files.

Parameters:
  • destination (a filename, file object, or None) – the output destination
  • format (None, "fps", "fps.gz", "fps.zst", or "fpb") – the output format
  • level (an integer, or "min", "default", or "max" for compressor-specific values) – compression level when writing .gz or .zst files
Returns:

None

start_padding = None

the number of bytes before the first fingerprint in the block

storage_size = None

the number of bytes used to store a fingerprint

threshold_tanimoto_search_arena(queries, threshold=0.7)

Find the fingerprints which are sufficiently similar to each of the query fingerprints

For each fingerprint in the queries arena, find all of the fingerprints in this arena which are at least threshold similar. The hits are returned as a SearchResults, where the hits in each SearchResult is in arbitrary order.

Parameters:
  • queries (a FingerprintArena) – query fingerprints
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResults

threshold_tanimoto_search_fp(query_fp, threshold=0.7)

Find the fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a SearchResult, in arbitrary order.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

threshold_tversky_search_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0)

Find the fingerprints which are sufficiently similar to the query fingerprint

Find all of the fingerprints in this arena which are at least threshold similar to the query fingerprint query_fp. The hits are returned as a SearchResult, in arbitrary order.

Parameters:
  • query_fp (byte string) – query fingerprint
  • threshold (float between 0.0 and 1.0, inclusive) – minimum similarity threshold (default: 0.7)
Returns:

a SearchResult

to_numpy_array()

Get the fingerprint bytes in a chemfp arena as NumPy uint8 array.

A chemfp arena stores fingerprints in a contiguous byte string. This function returns a 2D NumPy array which is a view of that string. The array has len(arena) rows and arena.storage_size columns.

The storage size may be larger than the minimum number of bytes in the fingerprint because of zero padding used to improve performance. For example, the 166-bit MACCS keys uses 24 bytes of storage when only 21 bytes are needed, because then chemfp can use the fast POPCNT instruction when computing the Tanimoto.

To remove extra padding bytes, use NumPy indexing to copy the fingerprint bytes to a new array:

arr[:,0:arena.num_bytes]

The last column of this new array may contain padding bits if the number of bits in a fingerprint is not a multiple of 8.

Warning

Do not attempt to access the contents of a NumPy view of a FPBFingerprintArena (the arena from an FPB file) after the FPB file has been closed as that will likely cause a segmentation fault or other severe failure.

Returns:a NumPy array of type uint8
to_numpy_bitarray(bitlist=None)

Get the fingerprint bits in a chemfp arena as NumPy uint8 array.

This function returns a 2D NumPy array with len(arena) rows and one column for each bit. The default returns arena.num_bits columns, where column 0 is the first bit, etc. Use bitlist to specify the indicies of which columns to return. Negative indices are supported; -1 is the last bit, -2 is the second to last. Out of range indices raise an IndexError.

Parameters:bitlist (iterable of integers) – bit column indices to use (default: all bits)
Returns:a NumPy array of type uint8
train_test_split(train_size=None, test_size=None, reorder=True, rng=None)

return arenas containing train_size and test_size randomly selected fingerprints, without replacement

If train_size is an integer then it must be between 0 and the size of the arena. If train_size is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include. If train_size is None then it is set to the complement of test_size. If both train_size and test_size are None then the default train_size is 0.75.

If test_size is an integer then it must be between 0 and the size of the arena. If test_size is a float then it must be between 0.0 and 1.0 and is interpreted as the proportion of the arena to include. If test_size is None then it is set to the complement of train_size. If both test_size and train_size are None then the default test_size is 0.25.

By default the new arena is sorted by popcount. Set reorder to False to return the fingerprints in random order.

If rng is None then use Python’s random.sample() for the sampling. If rng is an integer then use random.Random(rng).sample(). Otherwise, use rng.sample().

This method API is modelled on scikit-learn’s model_selection.train_test_split() function, described at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Parameters:
  • train_size (int, float, or None) – number of fingerprints for the training set arena
  • test_size (int, float, or None) – number of fingerprints for the test set arena
  • reorder (True to reorder, False to leave in the sampling order) – describes how to order the sampled fingerprints
  • rng (None, int, or a random.Random()) – method to use for random sampling
Returns:

a training set FingerprintArena and a test set FingerprintArena

class chemfp.arena.FingerprintList(start_padding, storage_size, arena, start, end, num_bytes)

Bases: collections.abc.Sequence

A read-only list-like view of the arena fingerprints

This implements the standard Python list API, including indexing and iteration.

Note: fingerprint searches like “fp in fingerprint_list” and “fingerprint_list.index(fp)” are not fast.

random_choice(rng=None)

Return a randomly selected fingerprint.

If rng is None then use Python’s random.sample() for the sampling. If rng is an integer then use random.Random(rng).sample(). Otherwise, use rng.sample().

Parameters:rng (None, int, or a random.Random()) – method to use for random sampling
Returns:a 2-element tuple of identifier string and fingerprint bytes