chemfp.highlevel.similarity module¶
This module should not be imported directly.
It contains internal implementation details of the highlevel API available from the toplevel chemfp module.
This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.

class
chemfp.highlevel.similarity.
BaseSimsearch
(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶ Bases:
object
This is the base class for the objects returned by
simsearch()
It contains the query parameters, search results, and timings.
In addition, it is a context manager for any files which may have been opened.

close
()¶ Close any associated files

get_description
()¶ Return a humanreadable description of the simsearch run

matrix_type
¶

matrix_type_name
¶

target_ids
¶ Return the target identifiers


class
chemfp.highlevel.similarity.
MultiQuerySimsearch
(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶ Bases:
chemfp.highlevel.similarity.BaseSimsearch

count_all
(min_score=None, max_score=None, interval='[]')¶ Count the number of hits with a score between min_score and max_score
Shortcut for obj.result.count_all(). See
SearchResults.count_all()
.Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The halfopen/halfclosed intervals “(]” and “[)” are also supported.
Parameters:  min_score (a float, or None for infinity) – the minimum score in the range.
 max_score (a float, or None for +infinity) – the maximum score in the range.
 interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: an integer count

cumulative_score_all
(min_score=None, max_score=None, interval='[]')¶ The sum of all scores in all rows which are between min_score and max_score
Shortcut for obj.result.cumulative_score_all(). See
SearchResults.cumulative_score_all()
.Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The halfopen/halfclosed intervals “(]” and “[)” are also supported.
Parameters:  min_score (a float, or None for infinity) – the minimum score in the range.
 max_score (a float, or None for +infinity) – the maximum score in the range.
 interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: a floating point count

iter_ids
()¶ For each hit, yield the list of target identifiers
Shortcut for obj.result.iter_ids(). See
SearchResults.iter_ids()
.

iter_ids_and_scores
()¶ For each hit, yield the list of (target id, score) tuples
Shortcut for obj.result.iter_ids_and_scores(). See
SearchResults.iter_ids_and_scores()
.

iter_indices
()¶ For each hit, yield the list of target indices
Shortcut for obj.result.iter_indices(). See
SearchResults.iter_indices()
.

iter_indices_and_scores
()¶ For each hit, yield the list of (target index, score) tuples
Shortcut for obj.result.iter_indices_and_scores(). See
SearchResults.iter_indices_and_scores()
.

iter_scores
()¶ For each hit, yield the list of target scores
Shortcut for obj.result.iter_scores(). See
SearchResults.iter_scores()
.

query_ids
¶

reorder_all
(order='decreasingscoreplus')¶ Reorder the hits for all of the rows based on the requested order.
Shortcut for obj.result.reorder_all(). See
SearchResults.reorder_all()
.The available orderings are:
 increasingscore  sort by increasing score
 decreasingscore  sort by decreasing score
 increasingscoreplus  sort by increasing score, break ties by increasing index
 decreasingscoreplus  sort by decreasing score, break ties by increasing index
 increasingindex  sort by increasing target index
 decreasingindex  sort by decreasing target index
 moveclosestfirst  move the hit with the highest score to the first position
 reverse  reverse the current ordering
Parameters: ordering (string) – the name of the ordering to use

save
(destination, format=None, compressed=True)¶ Save the SearchResults to the given destination
Shortcut for obj.result.save(). See
SearchResults.save()
.Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResults results are stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a SearchResults.
Use chemfp.search.load_npz() to read the similarity search results back into a SearchResults instance.
Parameters:  destination (a filename, binary file object, or None for stdout) – where to write the results
 format (None or 'npz') – the output format name (default: always ‘npz’)
 compressed – if True (the default), use zipfile compression

shape
¶ the tuple (number of rows, number of columns)
Shortcut for obj.result.shape(). See
SearchResults.shape()
.The number of columns is the size of the target arena.

to_csr
(dtype=None)¶ Return the results as a SciPy compressed sparse row matrix.
Shortcut for obj.result.to_csr(). See
SearchResults.to_csr()
.The returned matrix has the same shape as the SearchResult instance and can be passed into, for example, a scikitlearn clustering algorithm.
By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int > double > float.
This method requires that SciPy (and NumPy) be installed.
Parameters: dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)

to_pandas
(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))¶ Return a pandas DataFrame with query_id, target_id and score columns
Shortcut for obj.result.to_pandas(). See
SearchResults.to_pandas()
.Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.
If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).
Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2element tuple the first element is used as the target id and the second is used as the score.
Use the DataFrame’s groupby() method to group results by query id, for example:
>>> import chemfp >>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps", ... k=10, threshold=0.4, progress=False).to_pandas() >>> df.groupby("query_id").describe()
Parameters:  columns (a list of three strings) – column names for the returned DataFrame
 empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case
Returns: a pandas DataFrame


class
chemfp.highlevel.similarity.
NxNSimsearch
(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶ Bases:
chemfp.highlevel.similarity.BaseSimsearch

count_all
(min_score=None, max_score=None, interval='[]')¶ Count the number of hits with a score between min_score and max_score
Shortcut for obj.result.count_all(). See
SearchResults.count_all()
.Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The halfopen/halfclosed intervals “(]” and “[)” are also supported.
Parameters:  min_score (a float, or None for infinity) – the minimum score in the range.
 max_score (a float, or None for +infinity) – the maximum score in the range.
 interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: an integer count

cumulative_score_all
(min_score=None, max_score=None, interval='[]')¶ The sum of all scores in all rows which are between min_score and max_score
Shortcut for obj.result.cumulative_score_all(). See
SearchResults.cumulative_score_all()
.Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The halfopen/halfclosed intervals “(]” and “[)” are also supported.
Parameters:  min_score (a float, or None for infinity) – the minimum score in the range.
 max_score (a float, or None for +infinity) – the maximum score in the range.
 interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: a floating point count

iter_ids
()¶ For each hit, yield the list of target identifiers
Shortcut for obj.result.iter_ids(). See
SearchResults.iter_ids()
.

iter_ids_and_scores
()¶ For each hit, yield the list of (target id, score) tuples
Shortcut for obj.result.iter_ids_and_scores(). See
SearchResults.iter_ids_and_scores()
.

iter_indices
()¶ For each hit, yield the list of target indices
Shortcut for obj.result.iter_indices(). See
SearchResults.iter_indices()
.

iter_indices_and_scores
()¶ For each hit, yield the list of (target index, score) tuples
Shortcut for obj.result.iter_indices_and_scores(). See
SearchResults.iter_indices_and_scores()
.

iter_scores
()¶ For each hit, yield the list of target scores
Shortcut for obj.result.iter_scores(). See
SearchResults.iter_scores()
.

query_ids
¶

reorder_all
(order='decreasingscoreplus')¶ Reorder the hits for all of the rows based on the requested order.
Shortcut for obj.result.reorder_all(). See
SearchResults.reorder_all()
.The available orderings are:
 increasingscore  sort by increasing score
 decreasingscore  sort by decreasing score
 increasingscoreplus  sort by increasing score, break ties by increasing index
 decreasingscoreplus  sort by decreasing score, break ties by increasing index
 increasingindex  sort by increasing target index
 decreasingindex  sort by decreasing target index
 moveclosestfirst  move the hit with the highest score to the first position
 reverse  reverse the current ordering
Parameters: ordering (string) – the name of the ordering to use

save
(destination, format=None, compressed=True)¶ Save the SearchResults to the given destination
Shortcut for obj.result.save(). See
SearchResults.save()
.Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResults results are stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a SearchResults.
Use chemfp.search.load_npz() to read the similarity search results back into a SearchResults instance.
Parameters:  destination (a filename, binary file object, or None for stdout) – where to write the results
 format (None or 'npz') – the output format name (default: always ‘npz’)
 compressed – if True (the default), use zipfile compression

shape
¶ the tuple (number of rows, number of columns)
Shortcut for obj.result.shape(). See
SearchResults.shape()
.The number of columns is the size of the target arena.

to_csr
(dtype=None)¶ Return the results as a SciPy compressed sparse row matrix.
Shortcut for obj.result.to_csr(). See
SearchResults.to_csr()
.The returned matrix has the same shape as the SearchResult instance and can be passed into, for example, a scikitlearn clustering algorithm.
By default the scores are stored with the dtype of “float64”. You may also use “float32” though mind the double rounding from int/int > double > float.
This method requires that SciPy (and NumPy) be installed.
Parameters: dtype (string or NumPy type, or None for float64) – a NumPy numeric data type (either “float64” or “float32”)

to_pandas
(*, columns=['query_id', 'target_id', 'score'], empty=('*', None))¶ Return a pandas DataFrame with query_id, target_id and score columns
Shortcut for obj.result.to_pandas(). See
SearchResults.to_pandas()
.Each query has zero or more hits. Each hit becomes a row in the output table, with the query id in the first column, the hit target id in the second, and the hit score in the third.
If a query has no hits then by default a row is added with the query id, ‘*’ as the target id, and None as the score (which pandas will treat as a NA value).
Use empty to specify different behavior for queries with no hits. If empty is None then no row is added to the table. If empty is a 2element tuple the first element is used as the target id and the second is used as the score.
Use the DataFrame’s groupby() method to group results by query id, for example:
>>> import chemfp >>> df = chemfp.simsearch(queries="queries.fps", targets="targets.fps", ... k=10, threshold=0.4, progress=False).to_pandas() >>> df.groupby("query_id").describe()
Parameters:  columns (a list of three strings) – column names for the returned DataFrame
 empty (a list of two strings, or None) – the target id and score used for queries with no hits, or None to not include a row for that case
Returns: a pandas DataFrame


class
chemfp.highlevel.similarity.
SingleQuerySimsearch
(*, num_queries, num_targets, k, threshold, alpha, beta, NxN, times, result, query_fp=None, queries=None, targets=None, queries_close=None, targets_close=None)¶ Bases:
chemfp.highlevel.similarity.BaseSimsearch

as_buffer
()¶ Return a Python buffer object for the underlying indices and scores.
Shortcut for obj.result.as_buffer(). See
SearchResult.as_buffer()
.This provides a byteoriented view of the raw data. You probably want to use as_ctypes() or as_numpy_array() to get the indices and scores in a more structured form.
Warning
Do not attempt to access the buffer contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
Returns: a Python buffer object

as_ctypes
()¶ Return a ctypes view of the underlying indices and scores
Shortcut for obj.result.as_ctypes(). See
SearchResult.as_ctypes()
.Each (index, score) pair is represented as a ctypes structure named Hit with fields index (c_int) and score (c_double).
For example, to get the score of the 5th entry use:
result.as_ctypes()[4].scoreThis method returns an array of type (Hit*len(search_result)). Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the ctype array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
This method exists to make it easier to work with C extensions without going through NumPy. If you want to pass the search results to NumPy then use as_numpy_array() instead.
Returns: a ctypes array of type Hit*len(self)

as_numpy_array
()¶ Return a NumPy array view of the underlying indices and scores
Shortcut for obj.result.as_numpy_array(). See
SearchResult.as_numpy_array()
.The view uses a structured types with fields ‘index’ (i4) and ‘score’ (f8), mapped directly onto chemfp’s own data structure. For example, to get the score of the 4th entry use:
result.as_numpy_array()["score"][3] or result.as_numpy_array()[3][1]
Modifications to this view will change chemfp’s data values and vice versa. USE WITH CARE!
Warning
Do not attempt to access the NumPy array contents after the search result has been deallocated as that will likely cause a segmentation fault or other severe failure.
As a shorthand to get just the indices or just the scores, use get_indices_as_numpy_array() or get_scores_as_numpy_array().
Returns: a NumPy array with a structured data type

count
(min_score=None, max_score=None, interval='[]')¶ Count the number of hits with a score between min_score and max_score
Shortcut for obj.result.count(). See
SearchResult.count()
.Using the default parameters this returns the number of hits in the result.
The default min_score of None is equivalent to infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The halfopen/halfclosed intervals “(]” and “[)” are also supported.
Parameters:  min_score (a float, or None for infinity) – the minimum score in the range.
 max_score (a float, or None for +infinity) – the maximum score in the range.
 interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: an integer count

cumulative_score
(min_score=None, max_score=None, interval='[]')¶ The sum of the scores which are between min_score and max_score
Shortcut for obj.result.cumulative_score(). See
SearchResult.cumulative_score()
.Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score.
The default min_score of None is equivalent to infinity. The default max_score of None is equivalent to +infinity.
The interval parameter describes the interval end conditions. The default of “[]” uses a closed interval, where min_score <= score <= max_score. The interval “()” uses the open interval where min_score < score < max_score. The halfopen/halfclosed intervals “(]” and “[)” are also supported.
Parameters:  min_score (a float, or None for infinity) – the minimum score in the range.
 max_score (a float, or None for +infinity) – the maximum score in the range.
 interval (one of "[]", "()", "(]", "[)") – specify if the end points are open or closed.
Returns: a floating point value

format_ids_and_scores_as_bytes
(ids=None, precision=4)¶ Format the ids and scores as the byte string needed for simsearch output
Shortcut for obj.result.format_ids_and_scores_as_bytes(). See
SearchResult.format_ids_and_scores_as_bytes()
.If there are no hits then the result is the empty string b””, otherwise it returns a byte string containing the tabseperated ids and scores, in the order ids[0], scores[0], ids[1], scores[1], …
If the ids is not specified then the ids come from self.get_ids(). If no ids are available, a ValueError is raised. The ids must be a list of Unicode strings.
The precision sets the number of decimal digits to use in the score output. It must be an integer value between 1 and 10, inclusive.
This function is 34x faster than the Python equivalent, which is roughly:
ids = ids if (ids is not None) else self.get_ids() formatter = ("%s\t%." + str(precision) + "f").encode("ascii") return b"\t".join(formatter % pair for pair in zip(ids, self.get_scores()))
Parameters:  ids (a list of Unicode strings, or None to use the default) – the identifiers to use for each hit.
 precision (an integer from 1 to 10, inclusive) – the precision to use for each score
Returns: a byte string

get_ids
()¶ The list of target identifiers (if available), in the current ordering
Shortcut for obj.result.get_ids(). See
SearchResult.get_ids()
.Returns: a list of strings

get_ids_and_scores
()¶ The list of (target identifier, target score) pairs, in the current ordering
Shortcut for obj.result.get_ids_and_scores(). See
SearchResult.get_ids_and_scores()
.Raises a TypeError if the target IDs are not available.
Returns: a Python list of 2element tuples

get_indices
()¶ The list of target indices, in the current ordering.
Shortcut for obj.result.get_indices(). See
SearchResult.get_indices()
.This returns a copy of the scores. See
get_indices_as_numpy_array()
to get a NumPy array view of the indices.Returns: an array.array() of type ‘i’

get_indices_and_scores
()¶ The list of (target index, target score) pairs, in the current ordering
Shortcut for obj.result.get_indices_and_scores(). See
SearchResult.get_indices_and_scores()
.Returns: a Python list of 2element tuples

get_indices_as_numpy_array
()¶ Return a NumPy array view of the underlying indices.
Shortcut for obj.result.get_indices_as_numpy_array(). See
SearchResult.get_indices_as_numpy_array()
.This is a shortcut for self.as_numpy_array()[“index”]. See that method documentation for details and warning.
Returns: a NumPy array of type ‘i4’

get_scores
()¶ The list of target scores, in the current ordering
Shortcut for obj.result.get_scores(). See
SearchResult.get_scores()
.This returns a copy of the scores. See
get_scores_as_numpy_array()
to get a NumPy array view of the scores.Returns: an array.array() of type ‘d’

get_scores_as_numpy_array
()¶ Return a NumPy array view of the underlying scores.
Shortcut for obj.result.get_scores_as_numpy_array(). See
SearchResult.get_scores_as_numpy_array()
.This is a shortcut for self.as_numpy_array()[“score”]. See that method documentation for details and warning.
Returns: a NumPy array of type ‘f8’

iter_ids
()¶ Iterate over target identifiers (if available), in the current ordering
Shortcut for obj.result.iter_ids(). See
SearchResult.iter_ids()
.

max
()¶ Return the value of the largest score
Shortcut for obj.result.max(). See
SearchResult.max()
.Returns 0.0 if there are no results.
Returns: a float

min
()¶ Return the value of the smallest score
Shortcut for obj.result.min(). See
SearchResult.min()
.Returns 0.0 if there are no results.
Returns: a float

query_id
¶ Return the corresponding query id, if available, else None
Shortcut for simsearch.result.query_id. See
SearchResult.query_id
.

reorder
(ordering='decreasingscoreplus')¶ Reorder the hits based on the requested ordering.
Shortcut for obj.result.reorder(). See
SearchResult.reorder()
.The available orderings are:
 increasingscore  sort by increasing score
 decreasingscore  sort by decreasing score
 increasingscoreplus  sort by increasing score, break ties by increasing index
 decreasingscoreplus  sort by decreasing score, break ties by increasing index
 increasingindex  sort by increasing target index
 decreasingindex  sort by decreasing target index
 moveclosestfirst  move the hit with the highest score to the first position
 reverse  reverse the current ordering
Parameters: ordering (string) – the name of the ordering to use

save
(destination, format=None, compressed=True)¶ Save the SearchResult to the given destination
Shortcut for obj.result.save(). See
SearchResult.save()
.Currently only the “npz” format is supported, which is a NumPy format containing multiple arrays, each stored as a file entry in a zipfile. The SearchResult is stored in the same structure as a SciPy compressed sparse row (‘csr’) matrix, which means they can be read with scipy.sparse.load_npz().
Chemfp also stores the query and target identifiers in the npz file, the chemfp search parameters like the number of bits in the fingerprint or the values for alpha and beta, and a value indicating the array contains a single SearchResult.
Use chemfp.search.load_npz() to read the similarity search result back into a SearchResult instance.
Parameters:  destination (a filename, binary file object, or None for stdout) – where to write the results
 format (None or 'npz') – the output format name (default: always ‘npz’)
 compressed – if True (the default), use zipfile compression

to_pandas
(*, columns=['target_id', 'score'])¶ Return a pandas DataFrame with the target ids and scores
Shortcut for obj.result.to_pandas(). See
SearchResult.to_pandas()
.The first column contains the ids, the second column contains the ids. The default columns headers are “target_id” and “score”. Use columns to specify different headers.
Parameters: columns (a list of two strings) – column names for the returned DataFrame Returns: a pandas DataFrame
