chemfp.search module¶
Search a FingerprintArena and work with the search results
This module implements the different ways to search a
FingerprintArena
. The search functions are:
Count the number of hits:
count_tanimoto_hits_fp()
- search an arena using a single fingerprintcount_tanimoto_hits_arena()
- search an arena using an arenacount_tanimoto_hits_symmetric()
- search an arena using itselfpartial_count_tanimoto_hits_symmetric()
- (advanced use; see the doc string)count_tversky_hits_fp()
- search an arena using a single fingerprintcount_tversky_hits_arena()
- search an arena using an arenacount_tversky_hits_symmetric()
- search an arena using itselfpartial_count_tversky_hits_symmetric()
- (advanced use; see the doc string)
Find all hits at or above a given threshold, sorted arbitrarily:
threshold_tanimoto_search_fp()
- search an arena using a single fingerprintthreshold_tanimoto_search_arena()
- search an arena using an arenathreshold_tanimoto_search_symmetric()
- search an arena using itselfpartial_threshold_tanimoto_search_symmetric()
- (advanced use; see the doc string)threshold_tversky_search_fp()
- search an arena using a single fingerprintthreshold_tversky_search_arena()
- search an arena using an arenathreshold_tversky_search_symmetric()
- search an arena using itselfpartial_threshold_tversky_search_symmetric()
- (advanced use; see the doc string)fill_lower_triangle()
- copy the upper triangle terms to the lower triangle
Find the k-nearest hits at or above a given threshold, sorted by decreasing similarity:
knearest_tanimoto_search_fp()
- search an arena using a single fingerprintknearest_tanimoto_search_arena()
- search an arena using an arenaknearest_tanimoto_search_symmetric()
- search an arena using itselfknearest_tversky_search_fp()
- search an arena using a single fingerprintknearest_tversky_search_arena()
- search an arena using an arenaknearest_tversky_search_symmetric()
- search an arena using itself
The threshold and k-nearest search results use a SearchResult
when
a fingerprint is used as a query, or a SearchResults
when an arena
is used as a query. These internally use a compressed sparse row format.
-
class
chemfp.search.
SearchResult
(search_results, row)¶ Bases:
object
Search results for a query fingerprint against a target arena.
The results contains a list of hits. Hits contain a target index, score, and optional target ids. The hits can be reordered based on score or index.
-
as_buffer
()¶ Return a Python buffer object for the underlying indices and scores.
This provides a byte-oriented view of the raw data. You probably want to use as_ctypes() or as_numpy_array() to get the indices and scores in a more structured form.
Warning
Do not attempt to access the buffer contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
Returns: a Python buffer object
-
as_ctypes
()¶ Return a ctypes view of the underlying indices and scores
Each (index, score) pair is represented as a ctypes structure named Hit with fields index (c_int) and score (c_double).
For example, to get the score of the 5th entry use:
result.as_ctypes()[4].scoreThis method returns an array of type (Hit*len(search_result)). Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the ctype array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
This method exists to make it easier to work with C extensions without going through NumPy. If you want to pass the search results to NumPy then use as_numpy_array() instead.
Returns: a ctypes array of type Hit*len(self)
-
as_numpy_array
()¶ Return a NumPy array view of the underlying indices and scores
The view uses a structured types with fields ‘index’ (i4) and ‘score’ (f8), mapped directly onto chemfp’s own data structure. For example, to get the score of the 4th entry use:
result.as_numpy_array()["score"][3] -or- result.as_numpy_array()[3][1]
Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the NumPy array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
As a short-hand to get just the indices or just the scores, use get_indices_as_numpy_array() or get_scores_as_numpy_array().
Returns: a NumPy array with a structured data type
-
clear
()¶ Remove all hits from this result
Deprecated since version 3.5: This function will likely be removed in a future version of chemfp as it doesn’t seem useful and because clearing the hits when there is a NumPy array view of the search results often causes chemfp to crash.
-
count
(min_score=None, max_score=None, interval='[]')¶ Count the number of hits with a score between min_score and max_score
Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: an integer count
-
cumulative_score
(min_score=None, max_score=None, interval='[]')¶ The sum of the scores which are between min_score and max_score
Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: a floating point value
-
format_ids_and_scores_as_bytes
(ids=None, precision=4)¶ Format the ids and scores as the byte string needed for simsearch output
If there are no hits then the result is the empty string b””, otherwise it returns a byte string containing the tab-seperated ids and scores, in the order ids[0], scores[0], ids[1], scores[1], …
If the ids is not specified then the ids come from self.get_ids(). If no ids are available, a ValueError is raised. The ids must be a list of Unicode strings.
The precision sets the number of decimal digits to use in the score output. It must be an integer value between 1 and 10, inclusive.
This function is 3-4x faster than the Python equivalent, which is roughly:
ids = ids if (ids is not None) else self.get_ids() formatter = ("%s\t%." + str(precision) + "f").encode("ascii") return b"\t".join(formatter % pair for pair in zip(ids, self.get_scores()))
Parameters: - ids (a list of Unicode strings, or None to use the default) – the identifiers to use for each hit.
- precision (an integer from 1 to 10, inclusive) – the precision to use for each score
Returns: a byte string
-
get_ids
()¶ The list of target identifiers (if available), in the current ordering
Returns: a list of strings
-
get_ids_and_scores
()¶ The list of (target identifier, target score) pairs, in the current ordering
Raises a TypeError if the target IDs are not available.
Returns: a Python list of 2-element tuples
-
get_indices
()¶ The list of target indices, in the current ordering.
This returns a copy of the scores. See
get_indices_as_numpy_array()
to get a NumPy array view of the indices.Returns: an array.array() of type ‘i’
-
get_indices_and_scores
()¶ The list of (target index, target score) pairs, in the current ordering
Returns: a Python list of 2-element tuples
-
get_indices_as_numpy_array
()¶ Return a NumPy array view of the underlying indices.
This is a short-cut for self.as_numpy_array()[“index”]. See that method documentation for details and warning.
Returns: a NumPy array of type ‘i4’
-
get_scores
()¶ The list of target scores, in the current ordering
This returns a copy of the scores. See
get_scores_as_numpy_array()
to get a NumPy array view of the scores.Returns: an array.array() of type ‘d’
-
get_scores_as_numpy_array
()¶ Return a NumPy array view of the underlying scores.
This is a short-cut for self.as_numpy_array()[“score”]. See that method documentation for details and warning.
Returns: a NumPy array of type ‘f8’
-
iter_ids
()¶ Iterate over target identifiers (if available), in the current ordering
-
max
()¶ Return the value of the largest score
Returns 0.0 if there are no results.
Returns: a float
-
min
()¶ Return the value of the smallest score
Returns 0.0 if there are no results.
Returns: a float
-
query_id
¶ Return the corresponding query id, if available, else None
-
reorder
(ordering='decreasing-score-plus')¶ Reorder the hits based on the requested ordering.
The available orderings are:
- increasing-score - sort by increasing score
- decreasing-score - sort by decreasing score
- increasing-score-plus - sort by increasing score, break ties by increasing index
- decreasing-score-plus - sort by decreasing score, break ties by increasing index
- increasing-index - sort by increasing target index
- decreasing-index - sort by decreasing target index
- move-closest-first - move the hit with the highest score to the first position
- reverse - reverse the current ordering
Parameters: ordering (string) – the name of the ordering to use
-
to_pandas
(*, columns=['target_id', 'score'])¶ Return a pandas DataFrame with the target ids and scores
The first column contains the ids, the second column contains the ids. The default columns headers are “target_id” and “score”. Use columns to specify different headers.
Parameters: columns (a list of two strings) – column names for the returned DataFrame Returns: a pandas DataFrame
-
-
class
chemfp.search.
SearchResults
(num_rows, num_cols, query_arena=None, query_ids=None, target_arena=None, target_ids=None, num_bits=2147483647, alpha=1.0, beta=1.0)¶ Bases:
chemfp.search.SearchResults
Search results for a list of query fingerprints against a target arena
This acts like a list of SearchResult elements, with the ability to iterate over each search results, look them up by index, and get the number of scores.
In addition, there are helper methods to iterate over each hit and to get the hit indicies, scores, and identifiers directly as Python lists, sort the list contents, and more.
-
query_ids
¶ A list of query ids, one for each result. This comes from the query arena’s ids.
-
clear_all
()¶ Remove all hits from all of the search results
Deprecated since version 3.5: This function will likely be removed in a future version of chemfp as it doesn’t seem useful and because clearing the hits when there is a NumPy array view of the search results often causes chemfp to crash.
-
count_all
(min_score=None, max_score=None, interval='[]')¶ Count the number of hits with a score between min_score and max_score
Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: an integer count
-
cumulative_score_all
(min_score=None, max_score=None, interval='[]')¶ The sum of all scores in all rows which are between min_score and max_score
Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to -infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The half-open/half-closed intervals “(]” and “[)” are also supported.
Parameters: - min_score (a float, or None for -infinity) – the minimum score in the range.
- max_score (a float, or None for +infinity) – the maximum score in the range.
- interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: a floating point count
-
iter_ids
()¶ For each hit, yield the list of target identifiers
-
iter_ids_and_scores
()¶ For each hit, yield the list of (target id, score) tuples
-
iter_indices
()¶ For each hit, yield the list of target indices
-
iter_indices_and_scores
()¶ For each hit, yield the list of (target index, score) tuples
-
iter_scores
()¶ For each hit, yield the list of target scores
-
reorder_all
(ordering='decreasing-score-plus')¶ Reorder the hits for all of the rows based on the requested order.
The available orderings are:
- increasing-score - sort by increasing score
- decreasing-score - sort by decreasing score
- increasing-score-plus - sort by increasing score, break ties by increasing index
- decreasing-score-plus - sort by decreasing score, break ties by increasing index
- increasing-index - sort by increasing target index
- decreasing-index - sort by decreasing target index
- move-closest-first - move the hit with the highest score to the first position
- reverse - reverse the current ordering
Parameters: ordering (string) – the name of the ordering to use
-
shape
¶ the tuple (number of rows, number of columns)
The number of columns is the size of the target arena.
-
to_csr
(dtype=None)¶ Return the results as a SciPy compressed sparse row matrix.
The returned matrix has the same shape as the SearchResult instance and can be passed into, for example, a scikit-learn clustering algorithm.
By default the scores are stored with the dtype is “float64”.
This method requires that SciPy (and NumPy) be installed.
Parameters: dtype (string or NumPy type) – a NumPy numeric data type
-
to_pandas
(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))¶ Return a pandas DataFrame with query_id, target_id and score columns
Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.
If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).
Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2-element tuple the first element is used as the target id and the second is used as the score.
Use the DataFrame’s groupby() method to group results by query id, for example:
>>> import chemfp >>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps", ... k=10, threshold=0.4, progress=False).to_pandas() >>> df.groupby("query_id").describe()
Parameters: - columns (a list of three strings) – column names for the returned DataFrame
- empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case
Returns: a pandas DataFrame
-
-
chemfp.search.
count_tanimoto_hits_fp
(query_fp, target_arena, threshold=0.7)¶ Count the number of hits in target_arena at least threshold similar to the query_fp
Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tanimoto_hits_fp(query_fp, targets, threshold=0.1))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
FingerprintArena
) – the target arena - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: an integer count
-
chemfp.search.
count_tanimoto_hits_arena
(query_arena, target_arena, threshold=0.7)¶ For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it
Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_arena(queries, targets, threshold=0.1) print(counts[:10])
The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: an array of counts
- query_arena (a
-
chemfp.search.
count_tanimoto_hits_symmetric
(arena, threshold=0.7, *, batch_size=100, batch_callback=None)¶ For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it
A fingerprint never matches itself.
The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Example:
arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_symmetric(arena, threshold=0.2) print(counts[:10])
The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- batch_size (integer) – the number of rows to process before checking for a
^C
Returns: an array of counts
- arena (a
-
chemfp.search.
partial_count_tanimoto_hits_symmetric
(counts, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None)¶ Compute a portion of the symmetric Tanimoto counts
For most cases, use
chemfp.search.count_tanimoto_hits_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.
The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit(chemfp.search.partial_count_tanimoto_hits_symmetric, counts, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) print(counts)
Parameters: - counts (a contiguous block of integer) – the accumulated Tanimoto counts
- arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
Returns: None
-
chemfp.search.
count_tversky_hits_fp
(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0)¶ Count the number of hits in target_arena least threshold similar to the query_fp (Tversky)
Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tversky_hits_fp(query_fp, targets, threshold=0.1))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena – the target arena
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
Returns: an integer count
-
chemfp.search.
count_tversky_hits_arena
(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0)¶ For each fingerprint in query_arena, count the number of hits in target_arena at least threshold similar to it
Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_arena(queries, targets, threshold=0.1, alpha=0.5, beta=0.5) print(counts[:10])
The result is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctypes array of longs, but it could be an array.array or Python list in the future.
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
Returns: an array of counts
- query_arena (a
-
chemfp.search.
count_tversky_hits_symmetric
(arena, threshold=0.7, alpha=1.0, beta=1.0, batch_size=100, batch_callback=None)¶ For each fingerprint in the arena, count the number of other fingerprints at least threshold similar to it
A fingerprint never matches itself.
The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Example:
arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) print(counts[:10])
The result object is implementation specific. You’ll always be able to get its length and do an index lookup to get an integer count. Currently it’s a ctype array of longs, but it could be an array.array or Python list in the future.
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
- batch_size (integer) – the number of rows to process before checking for a
^C
Returns: an array of counts
- arena (a
-
chemfp.search.
partial_count_tversky_hits_symmetric
(counts, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None)¶ Compute a portion of the symmetric Tversky counts
For most cases, use
chemfp.search.count_tversky_hits_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
counts is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls.
The function adds counts for counts[query_start:query_end] based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end* and using symmetry to fill in the lower half.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit(chemfp.search.partial_count_tversky_hits_symmetric, counts, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) print(counts)
Parameters: - counts (a contiguous block of integer) – the accumulated Tversky counts
- arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
Returns: None
-
chemfp.search.
threshold_tanimoto_search_fp
(query_fp, target_arena, threshold=0.7)¶ Search for fingerprint hits in target_arena which are at least threshold similar to query_fp
The hits in the returned
chemfp.search.SearchResult
are in arbitrary order.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tanimoto_search_fp(query_fp, targets, threshold=0.15)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
chemfp.arena.FingerprintArena
) – the target arena - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:
-
chemfp.search.
threshold_tanimoto_search_arena
(query_arena, target_arena, threshold=0.7, batch_size=None, batch_callback=None)¶ Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tanimoto_search_arena(queries, targets, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns: - query_arena (a
-
chemfp.search.
threshold_tanimoto_search_symmetric
(arena, threshold=0.7, include_lower_triangle=True, batch_size=100, batch_callback=None)¶ Search for the hits in the arena at least threshold similar to the fingerprints in the arena
When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
.Example:
arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tanimoto_search_symmetric(arena, threshold=0.2) upper_triangle = chemfp.search.threshold_tanimoto_search_symmetric( arena, threshold=0.2, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a
-
chemfp.search.
partial_threshold_tanimoto_search_symmetric
(results, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0)¶ Compute a portion of the symmetric Tanimoto search results
For most cases, use
chemfp.search.threshold_tanimoto_search_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
results is a
chemfp.search.SearchResults
instance which is at least as large as the arena. It should be reused for successive updates.The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.
It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = chemfp.search.SearchResults(n, n, query_ids=arena.ids, target_ids=arena.ids) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit(chemfp.search.partial_threshold_tanimoto_search_symmetric, results, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results)
The hits in the
chemfp.search.SearchResults
are in arbitrary order.Parameters: - results (a
chemfp.search.SearchResults
instance) – the intermediate search results - arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
- results_offset – use results[results_offset] as the base for the results
- results_offset – an integer
Returns: None
- results (a
-
chemfp.search.
threshold_tversky_search_fp
(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0)¶ Search for fingerprint hits in target_arena which are at least threshold similar to query_fp
The hits in the returned
chemfp.search.SearchResult
are in arbitrary order.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tversky_search_fp( query_fp, targets, threshold=0.15, alpha=0.5, beta=0.5)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
chemfp.arena.FingerprintArena
) – the target arena - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
Returns:
-
chemfp.search.
threshold_tversky_search_arena
(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0, batch_size=None, batch_callback=None)¶ Search for the hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tversky_search_arena( queries, targets, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
Returns: - query_arena (a
-
chemfp.search.
threshold_tversky_search_symmetric
(arena, threshold=0.7, alpha=1.0, beta=1.0, include_lower_triangle=True, batch_size=100, batch_callback=None)¶ Search for the hits in the arena at least threshold similar to the fingerprints in the arena
When include_lower_triangle is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When include_lower_triangle is False, only compute the upper triangle.
The hits in the returned
chemfp.search.SearchResults
are in arbitrary order.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C
Example:
arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) upper_triangle = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
- include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a
-
chemfp.search.
partial_threshold_tversky_search_symmetric
(results, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0)¶ Compute a portion of the symmetric Tversky search results
For most cases, use
chemfp.search.threshold_tversky_search_symmetric()
instead of this function!This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1.
results is a
chemfp.search.SearchResults
instance which is at least as large as the arena. It should be reused for successive updates.The function adds hits to results[query_start:query_end], based on computing the upper-triangle portion contained in the rectangle query_start:query_end and target_start:target_end.
It does not fill in the lower triangle. To get the full matrix, call fill_lower_triangle.
You know, this is pretty complicated. Here’s the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:
import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = chemfp.search.SearchResults(n, n, query_ids=arena.ids, target_ids=arena.ids) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in range(0, n, 10): executor.submit(chemfp.search.partial_threshold_tversky_search_symmetric, results, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results)
The hits in the
chemfp.search.SearchResults
are in arbitrary order.Parameters: - counts (a SearchResults instance) – the intermediate search results
- arena (a
chemfp.arena.FingerprintArena
) – the fingerprints. - threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
- query_start (an integer) – the query start row
- query_end (an integer, or None to mean the last query row) – the query end row
- target_start (an integer) – the target start row
- target_end (an integer, or None to mean the last target row) – the target end row
- results_offset – use results[results_offset] as the base for the results
- results_offset – an integer
Returns: None
-
chemfp.search.
fill_lower_triangle
(results)¶ Duplicate each entry of results to its transpose
This is used after the symmetric threshold search to turn the upper-triangle results into a full matrix.
Parameters: results (a chemfp.search.SearchResults
) – search results
-
chemfp.search.
knearest_tanimoto_search_fp
(query_fp, target_arena, k=3, threshold=0.0)¶ Search for k-nearest hits in target_arena which are at least threshold similar to query_fp
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tanimoto_search_fp(query_fp, targets, k=3, threshold=0.0)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena (a
chemfp.arena.FingerprintArena
) – the target arena - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
Returns:
-
chemfp.search.
knearest_tanimoto_search_arena
(query_arena, target_arena, k=3, threshold=0.0, query_thresholds=None, batch_size=None, batch_callback=None)¶ Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tanimoto_search_arena(queries, targets, k=3, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids()))
Use query_thresholds to specify per-query thresholds instead of using the global threshold. The global threshold must still be in range 0.0 to 1.0.
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_thresholds (None or a list of Python floats, or an array of C doubles) – optionally specify per-query thresholds
Returns: - query_arena (a
-
chemfp.search.
knearest_tanimoto_search_symmetric
(arena, k=3, threshold=0.0, query_thresholds=None, batch_size=100, batch_callback=None)¶ Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena
The hits in the
SearchResults
are ordered by decreasing similarity score.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C.
Example:
arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tanimoto_search_symmetric(arena, k=3, threshold=0.8) for (query_id, hits) in zip(arena.ids, results): print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in hits.get_ids_and_scores()))
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- query_thresholds (None or a list of Python floats, or an array of C doubles) – optionally specify per-query thresholds
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a
-
chemfp.search.
knearest_tversky_search_fp
(query_fp, target_arena, k=3, threshold=0.0, alpha=1.0, beta=1.0)¶ Search for k-nearest hits in target_arena which are at least threshold similar to query_fp
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tversky_search_fp( query_fp, targets, k=3, threshold=0.0, alpha=0.5, beta=0.5)))
Parameters: - query_fp (a byte string) – the query fingerprint
- target_arena – the target arena
- k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
Returns:
-
chemfp.search.
knearest_tversky_search_arena
(query_arena, target_arena, k=3, threshold=0.0, alpha=1.0, beta=1.0, query_thresholds=None, batch_size=None, batch_callback=None)¶ Search for the k nearest hits in the target_arena at least threshold similar to the fingerprints in query_arena
The hits in the
chemfp.search.SearchResults
are ordered by decreasing similarity score.Example:
queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tversky_search_arena( queries, targets, k=3, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids()))
Parameters: - query_arena (a
chemfp.arena.FingerprintArena
) – The query fingerprints. - target_arena (a
chemfp.arena.FingerprintArena
) – The target fingerprints. - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
Returns: - query_arena (a
-
chemfp.search.
knearest_tversky_search_symmetric
(arena, k=3, threshold=0.0, alpha=1.0, beta=1.0, query_thresholds=None, batch_size=100, batch_callback=None)¶ Search for the k-nearest hits in the arena at least threshold similar to the fingerprints in the arena
The hits in the
SearchResults
are ordered by decreasing similarity score.The computation can take a long time. Python won’t check check for a
^C
until the function finishes. This can be irritating. Instead, process only batch_size rows at a time before checking for a^C.
Example:
arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tversky_search_symmetric( arena, k=3, threshold=0.8, alpha=0.5, beta=0.5) for (query_id, hits) in zip(arena.ids, results): print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in hits.get_ids_and_scores()))
Parameters: - arena (a
chemfp.arena.FingerprintArena
) – the set of fingerprints - k (positive integer) – the number of nearest neighbors to find.
- threshold (float between 0.0 and 1.0, inclusive) – The minimum score threshold.
- alpha (a value between 0.0 and 100.0, inclusive) – the Tversky alpha value
- beta (a value between 0.0 and 100.0, inclusive) – the Tversky beta value
- include_lower_triangle (boolean) – if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix
- batch_size (integer) – the number of rows to process before checking for a ^C
Returns: - arena (a