chemfp.highlevel.clustering module

This module should not be imported directly.

It contains internal implementation details of the high-level API available from the top-level chemfp module.

This module is included in the documentation because parts of this module are returned to the user, and are part of the public API.

class chemfp.highlevel.clustering.ButinaClusters(arena, matrix, seed, NxN_threshold, butina_threshold, tiebreaker, false_singletons, num_butina_clusters, rescore, picker, result, times, _arena_close, fingerprints_filename, matrix_filename)

Bases: object

The results of chemfp.butina() with query details, search results, and timing information.

The available properties are:

  • arena - the fingerprint arena, based on the input fingerprints
  • matrix - the NxN similarity matrix, based on the input matrix
  • seed - the seed for the RNG
  • NxN_threshold - the NxN similarity threshold
  • butina_threshold - the Butina algorithm minimum similarity threshold
  • tiebreaker - the specified tiebreaker method
  • false_singletons - the specified method for handling false singletons
  • num_butina_clusters - the specified maximum number of clusters, or None
  • rescore - the flag value used to request that reassigned fingerprints be re-scored
  • picker - the underlying Butina picker object
  • result - the underlying Butina clustering results
  • times - a breakdown of the times for the search as a dictonary mapping task to elapsed time in seconds, or None if it wasn’t relevant. “load_arena” and “load_matrix” are times needed to load the arena and matrix, respectively, with “load” as the total load time. “NxN” is the time needed to compute the NxN matrix. The “cluster”, “prune”, and “rescore” times are self-explanatory. “total” is the total time for the butina call.
  • fingerprints_filename - the value of fingerprints, if it is a string
  • matrix_filename - the value of matrix, if it is a string

The full list of Butina clusters, ordered by cluster index.

The clusters are ordered by cluster index and may include empty clusters, due to moving false singletons or pruning the number of clusters.


Return the assignments as a ctypes array


Return the assignments as a NumPy array


Return the assignments as a ButinaAssignments


Release any assigned resources, like a memory-mapped FPB arena


The final list of clusters.

This list is ordered from largest to smallest.


Return a human-readable description of the Butina clustering


Return a dictionary containing entries for output metadata lines


Return a human-readable break-down of the Butina compute times


Get the ‘type’ string describing the Butina search parameters

save(destination=None, *, format=None, renumber=True, rename=True, include_members=True, metadata=None, include_metadata=True, precision=None)

Save the clusters to destination in one of several formats.

The supported formats are “centroid”, “flat”, “csv”, and “tsv”. If unspecified, infer the format from the destination filename extension. If the extensions is not known, use “centroid”.

If renumber is True (the default) then the clusters are renumbered sequentially starting from 1. If False then used internal cluster index, which starts from 0 and skips empty clusters.

If rename is True (the default) then rename the member types to either “CENTER” or “MEMBER”. If False, use the internal type names.

If include_members is True (the default) then include cluster members in the output.

If metadata is not None then it must a dictionary used for the metadata lines. The keys and values must be encoded appropriately. (No tab, NUL, or newline character, and the key must not contain an equals sign.)

If include_metadata is True (the default) then include metadata information in the output file.

If precision is None then use the minimum number of decimal places needed to distinguish between two scores. This value depends on the number of bits in the fingerprint. Otherwise it must be an integer between 1 and 10, inclusive.

to_pandas(*, columns=['cluster', 'id', 'type', 'score'], rename=True, renumber=True, sort=True)

Return the assignments as a pandas DataFrame

The DataFrame contains four columns, one for each input fingerprint:

  • cluster is the cluster index
  • id is the identifier from the input matrix
  • type is a string like CENTER” or “MEMBER”
  • score the Tanimoto score

Use columns to specify different column labels.

By default the assignment types are relabled to use only “CENTER” and “MEMBER”. If rename is False then the full internal labels are used.

By default the cluster indices are renumbered to the contiguous values 1..N where N is the number of clusters. If renumber is False then the internal cluster indices are used, which start from 0 and may skip indices for empty clusters whose elements were moved to other clusters.

  • columns (a list of two strings) – column names for the returned DataFrame
  • rename (bool) – if False use the internal type names rather then using only “CENTER” and “MEMBER”
  • renumber (bool) – if False use the internal cluster ids

a pandas DataFrame