chemfp.diversity module¶
This module contains interfaces to chemfp’s diversity selection algorithms.
Terminology¶
The selection algorithms uses different concepts of “dissimilar” to iteratively pick one or more dissimilar fingerprints from an arena containing candidate fingerprints.
The picked fingerprints are dissimilar to all other candidate fingerprints, and optionally also dissimilar to fingerprints in an arena of “reference” fingerprints.
This latter case may be used to select diverse fingerprints from a vendor catalog (“the candidates”) which are also dissimilar to an inhouse compound library (“the references”).
To create a given picker, use one of the get_*_picker
functions
or, alternatively one of the picker class’s from_
methods. Do not
call the class constructor directly.
Each picker implements a pick_n()
method, along with some variations,
to pick an additional n
items. They also implement several iter_*()
methods to iteratively get the next pick.
MaxMin picker¶
The MaxMinPicker implements the MaxMin algorithm[1][2]. This algorithm iteratively picks fingerprints from a set of candidates such that the newly picked fingerprint has the smallest Tanimoto similarity compared to any previously picked fingerprint, and optionally also the smallest Tanimoto similarity to the reference fingerprints.
The MaxMin diversity score for a given pick is the maximum Tanimoto score between that pick and all previous picks and the reference arena. If there is no reference arena then the diversity score of the first pick is 0.0.
HeapSweep picker¶
The HeapSweepPicker implements a sweepbased algorithm to pick fingerprints based on their maximum Tanimoto similarity to any other fingerprint in the arena, from least maximum similarity to most. This method uses a heap to track the current highestknown score for each fingerprints. Each sweep compares a fingerprint with the smallest score to all other fingerprints, while also updating the highestknown score for each other fingerprint.
The heapsweep algorithm is used to find the initial pick for the MaxMin picker if references fingerprints or an initial pick are not specified. This algorithm is significantly slower than MaxMin, and is mostly here to find all initial picks with same minimum maximum score. While it can be used to find the diversity score for all fingerprints, a k=1 NxN nearestneighbor search will be faster and can make use of multiple cores.
The heapsweep diversity score for a given pick is the maximum Tanimoto score between that pick and all other fingerprints in the arena.
The heapsweep algorithm appears to be novel to chemfp. It is strongly influenced by the “Sweep” family of algorithms. See the SumSweep paper [3] for a description of many of those heuristics.
Sphere exclusion picker¶
The SphereExclusionPicker implements the sphere exclusion algorithm[4] with optional ranking for directed sphere exclusion[5]. This method iteratively picks fingerprints from a set of candidates such that the fingerprint is not within a given threshold of similarity to any previously selected fingerprint.
By default it picks fingerprints with the smallest number of set bits. It can also be configured to pick a fingerprint, or to pick a fingerprint by the smallest associated rank (again, either by the smallest number of set bits or randomly).
The DISERanker class implements the Gobbi and Lee[5] ranking algorithm to generate ranks that can be passed to the SphereExclusionPicker.
[1] Ashton M., Barnard J., Casset F., Charlton M., Downs G., Gorse D., Holliday J., Lahana R., Willett P. (2002). Identification of diverse database subsets using propertybased and fragmentbased molecular descriptions. Quantitative StructureActivity Relationships 21 (6) 598604. https://doi.org/10.1002/qsar.200290002
[2] Sayle, R. (2017). Recent Improvements to the RDKit. https://github.com/rdkit/UGM_2017/blob/master/Presentations/Sayle_RDKitDiversity_Berlin17.pdf
[3] Borassi, M., Crescenzi, P., Habib, M., Kosters, W. A., Marino, A., and Takes, F. W. (2015). Fast diameter and radius BFSbased computation in (weakly connected) realworld graphs: With an application to the six degrees of separation games. Theoretical Computer Science 586 (2015) 59–80. http://dx.doi.org/10.1016/j.tcs.2015.02.033
[4] Hudson, B. D., Hyde, R. M., Rahr, E., Wood, J., Osman, J. (1996). Parameter based methods for compound selection from chemical databases. Quantitative Structure‐Activity Relationships, 15(4), 285289. https://doi.org/10.1002/qsar.19960150402
[5] Gobbi, A., Lee, M. L. (2003). DISE: directed sphere exclusion. Journal of Chemical Information and Computer Sciences, 43(1), 317323. https://doi.org/10.1021/ci025554v

class
chemfp.diversity.
MaxMinPicker
¶ Bases:
chemfp.diversity.BaseMaxMinPicker
An implementation of the MaxMin picker algorithm (Ashton, et al.)
The constructor must not be called directly. Instead, use one of:
 MaxMinPicker.from_candidates()
 MaxMinPicker.from_candidates_and_initial_pick()
 MaxMinPicker.from_candidates_and_references()
Once you have a picker, use pick_n() or pick_n_with_scores() to pick the the next n candidates, optionally also with its MaxMin diversity score.
Alternatively, use iter_indices(), iter_ids(), to pick the next candidate, yielding either the pick index or pick id; or use iter_indices_and_scores(), or iter_id_and_scores() to also include the MaxMin score.

static
from_candidates
()¶ Use MaxMin to pick diverse fingerprints from the candidate arena
The initial pick is determined by the heapsweep algorithm, which selects a fingerprint with the globally smallest maximum Tanimoto score to any other fingerprint. This may take a few seconds so use from_candidates_and_initial_pick if you know the initial pick.
If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tiebreaking.
The heapsweep and shuffle methods depend on a (shared) RNG, which requires an initial seed. If seed is 1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Parameters:  candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing the candidate fingerprints to pick from  randomize (True to shuffle, False to leave asis) – shuffle the candidates before picking?
 seed (a value between 0 and 2**641, or 1) – initial RNG seed, or 1 (the default) to seed from Python’s RNG
Returns:  candidate_arena (a

static
from_candidates_and_initial_pick
()¶ Use MaxMin to pick diverse fingerprints from the candidate arena, starting with an initial pick
This method lets you specify the initial pick as an initial_pick index into the candidate arena.
There are several strategies for the initial MaxMin pick: use the “middle” fingerprint, use a randomly selected fingerprint, or, if heapsweep identifies that multiple fingerprints have the same smallest maximum Tanimoto score, then try each of those as starting point.
If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tiebreaking.
Shuffling depends on a RNG, which requires an initial seed. If seed is 1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Parameters:  candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing the candidate fingerprints to pick from  initial_pick (an integer) – the index of the initial pick, which must be a nonempty fingerprint
 randomize (True to shuffle, False to leave asis) – shuffle the candidates before picking?
 seed (a value between 0 and 2**641, or 1) – initial RNG seed, or 1 (the default) to seed from Python’s RNG
Returns:  candidate_arena (a

static
from_candidates_and_references
()¶ Use MaxMin to pick diverse fingerprints from the candidate arena, which are also diverse from the reference arena
The fingerprints in candidate_arena are ranked according to their most similar fingerprint in reference_arena. A fingerprint with the the smallest maximum score is used as the initial pick when applying the MaxMin algorithm to the remaining fingerprint in the candidate_arena.
If randomize is True (the default), the candidates are shuffled before the MaxMin algorithm starts. Shuffling gives a sense of how MaxMin is affected by arbitrary tiebreaking.
Shuffling depends on a RNG, which requires an initial seed. If seed is 1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Parameters:  candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing the candidate fingerprints to pick from  reference_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing reference fingerprints  randomize (True to shuffle, False to leave asis) – shuffle the candidates before picking?
 seed (a value between 0 and 2**641, or 1) – initial RNG seed, or 1 (the default) to seed from Python’s RNG
Returns:  candidate_arena (a

pick_n
()¶ Pick up to n candidates with a maximum similarity of max_similarity to any previous pick
The picks are appended to the MaxMinPicker’s picks data and the pick information (picked candidate fingerprint indices and corresponding ids) is returned in a Picks instance.
Use pick_n_with_scores() if you also need the maximum similarity score.
n may zero, in which case an empty Picks instance is returned.
Parameters:  n (an integer) – the maximum number of remaining candidates to pick
 max_similarity (a float) – the maximum allowed pick similarity
Returns: a
chemfp.diversity.Picks

pick_n_with_scores
()¶ Pick up to n candidates with a maximum similarity of max_similarity to any previous pick
The picks are appended to the MaxMinPicker’s picks data and the pick information (picked candidate fingerprint indices, maximum score, and corresponding ids, an) is returned in a PicksAndScores instance.
Use pick_n() if you do not need the maximum similarity score.
n may zero, in which case an empty PicksAndScores instance is returned. This may be useful in combination with the result parameter to accumulate successive picks.
If result is a PicksAndScores instance returned from a previous pick_n_with_scores() call then the pick information will be stored in that instances instead of creating a new one.
Parameters:  n (an integer) – the maximum number of remaining candidates to pick
 max_similarity (a chemfp.diversity.PicksAndScores) – the maximum allowed pick similarity
 result – store picks in the given object instead of creating a new result object
Returns: a
chemfp.diversity.PicksAndScores

class
chemfp.diversity.
SphereExclusionPicker
¶ Bases:
object
An implementation of the sphere picker algorithm, optionally directed
The constructor must not be called directly. Instead, use one of:
 SphereExclusionPicker.from_candidates()
 SphereExclusionPicker.from_candidates_and_initial_pick()
 SphereExclusionPicker.from_candidates_and_initial_picks()
 SphereExclusionPicker.from_candidates_and_references()
Once you have a picker, use pick_n(), pick_n_with_counts() or pick_n_with_neighbors() to pick the the next n candidates, optionally also with the number of fingerprints within its sphere, or with the information about those fingerprints stored in a Neighbors object.
Alternatively, use iter_indices(), iter_ids(), to pick the next candidate, yielding either the pick index or pick id; or use iter_indices_and_counts() or iter_ids_and_counts() to also include the counts; or use iter_indices_and_neighbors() or iter_ids_and_neighbors() to also include the Neighbors for each sphere.

candidates
¶ Get access to the remaining candidates as a
chemfp.diversity.SphereExclusionCandidates
NOTE: This is not part of the public API.

static
from_candidates
()¶ Use sphere exclusion to pick diverse fingerprints from the candidate arena
Each new pick from candidate_arena will be less than threshold similar to any previous pick. The effective sphere radius = 1  threshold
By default randomize = None because the appropriate default value depends on if ranks is specified. If ranks is None the randomize = None is interpreted as randomize = True. If ranks is not None then randomize is interpreted as False.
The default method (with ranks = None and randomize = None or randomize = True) picks the next fingerprint at random from the remaining candidates. This is undirected sphere picking.
If ranks = None and randomize = False then the next pick is the available candidate with the smallest index in the arena. Since the candidate arena is ordered by popcount, this directs sphere picking to select fingerprints with the smallest number of on bits. (In practice this does not seem that useful.)
If ranks is specified then it must be an array of unsigned integers, with one rank value for each fingerprint. The ranks are used for directed sphere exclusion; a candidate with a lower rank is chosen before one with a higher rank.
If ranks is not None and randomize = None or randomize = False then the next pick is the fingerprint with the lowest rank, with ties broken by the smallest index in the candidate arena.
If ranks is not None and randomize = True then the next pick is chosen at random from all of the fingerprints with the same lowest rank. The current implementation assumes ranks are nearly all distinct, and takes O(number of duplicates) time if there are duplicates, which may take quadratic time if there are only a few distinct ranks.
The random methods require an initial seed for the RNG. If seed is 1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Parameters:  candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing the candidate fingerprints to pick from  threshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
 seed (a value between 0 and 2**641, or 1) – initial RNG seed, or 1 (the default) to seed from Python’s RNG
 ranks (None, or an array of unsigned 32bit integers) – rank values for each candidate (optional)
Returns:  candidate_arena (a

static
from_candidates_and_initial_pick
()¶ Use sphere exclusion to pick diverse fingerprints from the candidate arena, starting with an intial pick
This is a shortcut for:
from_candidates_and_initial_picks(candidate_arena, [initial_pick], …)See SphereExclusionPicker.from_candidates_and_initial_picks for full details.
Parameters:  candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing the candidate fingerprints to pick from  candidate_indices (a list or array of integers) – the initial picks, as indicies into the candidate arena
 threshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
 seed (a value between 0 and 2**641, or 1) – initial RNG seed, or 1 (the default) to seed from Python’s RNG
 ranks (None, or an array of unsigned 32bit integers) – rank values for each candidate (optional)
Returns:  candidate_arena (a

static
from_candidates_and_initial_picks
()¶ Use sphere exclusion to pick diverse fingerprints from the candidate arena, starting with an intial pick list
Each new pick from candidate_arena will be less than threshold similar to any previous pick. The effective sphere radius = 1  threshold
Use initial_picks to specify the initial picks. If a specified candidate index was picked by an ealier candidate index then pick will still occur but the new candidate index will not be included in the count nor the neighbors.
By default randomize = None because the appropriate default value depends on if ranks is specified. If ranks is None the randomize = None is interpreted as randomize = True. If ranks is not None then randomize is interpreted as False.
The default method (with ranks = None and randomize = None or randomize = True) picks the next fingerprint at random from the remaining candidates. This is undirected sphere picking.
If ranks = None and randomize = False then the next pick is the available candidate with the smallest index in the arena. Since the candidate arena is ordered by popcount, this directs sphere picking to select fingerprints with the smallest number of on bits. (In practice this does not seem that useful.)
If ranks is specified then it must be an array of unsigned integers, with one rank value for each fingerprint. The ranks are used for directed sphere exclusion; a candidate with a lower rank is chosen before one with a higher rank.
If ranks is not None and randomize = None or randomize = False then the next pick is the fingerprint with the lowest rank, with ties broken by the smallest index in the candidate arena.
If ranks is not None and randomize = True then the next pick is chosen at random from all of the fingerprints with the same lowest rank. The current implementation assumes ranks are nearly all distinct, and takes O(number of duplicates) time if there are duplicates, which may take quadratic time if there are only a few distinct ranks.
The random methods require an initial seed for the RNG. If seed is 1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Parameters:  candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing the candidate fingerprints to pick from  initial_picks (a list or array of integers) – the initial picks, as indicies into the candidate arena (duplicates are ignored)
 threshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
 seed (a value between 0 and 2**641, or 1) – initial RNG seed, or 1 (the default) to seed from Python’s RNG
 ranks (None, or an array of unsigned 32bit integers) – rank values for each candidate (optional)
Returns:  candidate_arena (a

static
from_candidates_and_references
()¶ Use sphere exclusion to pick diverse fingerprints from the candidate arena which are also diverse from the reference arena
Each new pick from candidate_arena will be less than threshold similar any previous pick and any fingerprint in reference_arena. The effective sphere radius = 1  threshold.
By default randomize = None because the appropriate default value depends on if ranks is specified. If ranks is None the randomize = None is interpreted as randomize = True. If ranks is not None then randomize is interpreted as False.
The default method (with ranks = None and randomize = None or randomize = True) picks the next fingerprint at random from the remaining candidates. This is undirected sphere picking.
If ranks = None and randomize = False then the next pick is the available candidate with the smallest index in the arena. Since the candidate arena is ordered by popcount, this directs sphere picking to select fingerprints with the smallest number of on bits. (In practice this does not seem that useful.)
If ranks is specified then it must be an array of unsigned integers, with one rank value for each fingerprint. The ranks are used for directed sphere exclusion; a candidate with a lower rank is chosen before one with a higher rank.
If ranks is not None and randomize = None or randomize = False then the next pick is the fingerprint with the lowest rank, with ties broken by the smallest index in the candidate arena.
If ranks is not None and randomize = True then the next pick is chosen at random from all of the fingerprints with the same lowest rank. The current implementation assumes ranks are nearly all distinct, and takes O(number of duplicates) time if there are duplicates, which may take quadratic time if there are only a few distinct ranks.
The random methods require an initial seed for the RNG. If seed is 1 (the default) then use Python’s own RNG to generate the initial seed, otherwise use the value as the seed.
Parameters:  candidate_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing the candidate fingerprints to pick from  reference_arena (a
chemfp.arena.FingerprintArena
with popcount indices and at least one nonempty fingerprint) – an arena containing reference fingerprints  threshold (a double between 0.0 and 1.0, inclusive) – the Tanimoto similarity threshold used to identify sphere exclusion
 randomize (True for random selection, False for deterministic) – select the next candidate at random from the possible candidates
 seed (a value between 0 and 2**641, or 1) – initial RNG seed, or 1 (the default) to seed from Python’s RNG
 ranks (None, or an array of unsigned 32bit integers) – rank values for each candidate (optional)
Returns:  candidate_arena (a

iter_ids
()¶ Iteratively make a pick, yielding the candidate id each time

iter_ids_and_counts
()¶ Iteratively make a pick, yielding (candidate id, sphere membership count) each time

iter_ids_and_neighbors
()¶ Iteratively make a pick, yielding (candidate id, sphere neigbhors) each time
The neighbors are a Neighbors instance describing the unpicked fingerprints within the given sphere.

iter_indices
()¶ Iteratively make a pick, yielding the candidate index each time

iter_indices_and_counts
()¶ Iteratively make a pick, yielding (candidate index, sphere membership count) each time

iter_indices_and_neighbors
()¶ Iteratively make a pick, yielding (candidate index, sphere neigbhors) each time
The neighbors are a Neighbors instance describing the unpicked fingerprints within the given sphere.

pick_n
()¶ Pick up to n candidate fingerprints
Parameters: n (an nonnegative integer) – the number of candidates to pick Returns: a chemfp.diversity.Picks

pick_n_with_counts
()¶ Pick up to n candidate fingerprints, and the number of fingerprints in its sphere
The count includes the candidate fingerprint.
Parameters: n (an nonnegative integer) – the number of candidates to pick Returns: a chemfp.diversity.PicksAndCounts

pick_n_with_neighbors
()¶ Pick up to n candidate fingerprints, and the neighbor fingerprints in its sphere
The fingerprints in the sphere will include the candidate fingerprint unless it was an initial pick and found in an earlier initial pick.
Parameters: n (an nonnegative integer) – the number of candidates to pick Returns: a chemfp.diversity.PicksAndNeighbors

picks
¶ Get access to all of the picks so far (including initial picks) as a
chemfp.diversity.Picks

threshold
¶ Return the specified threshold value

class
chemfp.diversity.
DISERanker
¶ Bases:
object
Generate a fingerprint ranking based on the DISE algorithm
In directed sphere exclusion, the next pick is the candidate fingerprint closest to a given set of reference fingerprints.
This class can be used to generate values passed to SphereExclusionPicker’s ranks parameter.
The class variable DISE_SMILES_LIST contains the SMILES strings for the three reference compounds used in the DISE paper by Gobbi and Lee.

DISE_SMILES_LIST
= ['CCCC1=NN(C2=C1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C.O=C(O)CC(O)(C(O)=O)CC(O)=O', 'O=C(OC)\\C3=C(\\N\\C(=C(\\C(=O)OC(C)C)C3c1cccc2nonc12)C)C', 'O=C(OCC)[C@@H](N[C@@H]2C(=O)N(c1ccccc1CC2)CC(=O)O)CCc3ccccc3']¶

from_dise_paper
¶ Use the structures from the DISE paper to create a DISERanker for a given fingerprint type
The structures are the SMILES strings in DISE_SMILES_LIST.
Parameters:  fptype (a string or a
chemfp.types.FingerprintType
) – the fingerprint type used to process the SMILES strings  reader_args (None, or a dictionary) – optional reader arguments for SMILES processing
Returns:  fptype (a string or a

from_fingerprints
¶ Use a list of fingerprints to create a DISERanker
This is a shorthand for:
arena = load_fingerprints(fingerprints, metadata=metadata, reorder=False) return DISERanker(arena)See func:chemfp.load_fingerprints for full details.
Parameters:  fingerprints – the fingerprints to use
 metadata (a
chemfp.Metata
) – the metadata used if fingerprints is an (id, fp) iterator
Returns:

from_smiles_list
¶ Use a list of SMILES string to create a DISERanker for a given fingerprint type
Parameters:  fptype (a string or a
chemfp.types.FingerprintType
) – the fingerprint type used to process the SMILES strings  smiles_list (a list of strings) – the list of SMILES strings
 reader_args (None, or a dictionary) – optional reader arguments for SMILES processing
Returns:  fptype (a string or a

rank_arena
¶ Return an array of ranks, one for each fingerprint.
The algorithm starts by ranking each arena fingerprint to the first reference fingerprint. Fingerprints with a low rank value are more similar to the reference fingerprint than fingerprints with a high rank value.
Ties are broken by similarity to each successive reference fingerprint (in self.dise_arena).
If rng is None then any final ties are left asis, otherwise ties are broken by the passedin rng using the rng.shuffle() method.
If rng is an integer then use Python’s
random.Random(rng)
to create the rng.Parameters:  arena (a
chemfp.arena.FingerprintArena
) – a fingerprint arena  rng (None, an integer, or an object with a shuffle method.) – an RNG used to break any final ties
Returns: an array.array of ranks, one for each arena fingerprint
 arena (a
