.. _chemfp_butina: chemfp butina command-line options ========================================================== The following comes from ``chemfp butina --help``: .. code-block:: none Usage: chemfp butina [OPTIONS] FILENAME Cluster using the Butina/leader-follower algorithm. If FILENAME is not specified the read the fingerprints or similarity matrix from stdin. Options: --in [fps|fps.gz|fps.zst|fpb|flush|npz] Specify the input file format, either fingerprint or similarity matrix(default is based on filename extension, or 'fps') --matrix FILE --matrix-format [npz] File format for --matrix (only 'npz' is supported) -o, --output PATH Output filename --out TEXT Output format. Must be one of 'centroid' (the default), 'csv', 'tsv', or 'flat', with optional compression -j, --num-threads INT The number of threads to use. -1 means all available cores. This option overrides $OMP_NUM_THREADS. (default: -1) --precision [1|2|3|4|5|6|7|8|9|10] Number of digits in Tanimoto score (default: based on the fingerprint size) --progress / --no-progress Show a progress bar (default: show unless the output is a terminal) -t, --NxN-threshold, --threshold FLOAT Threshold when generating the NxN similarity matrix from fingerprints (default: 0.7) --seed N Specify the random number generator seed between 0 and 2**64-1, inclusive, or use -1 to have one picked at random (default: -1) --include-members / --no-members The default writes all cluster members. With --no-members only write the cluster centers. --rescore / --no-rescore Rescore moved false singletons and merged fingerprints to their new cluster center --renumber / --no-renumber By default, use sequential cluster ids starting from 1. With --no-renumber use the internal cluster ids. --rename / --no-rename Use --no-rename to use the internal member type names instead of renaming them to use only 'CENTER' and 'MEMBER' --include-metadata / --no-metadata With --no-metadata, do not include header metadata in 'chemfp' and 'flat' output formats. --times / --no-times Write timing information to stderr -d, --debug Print debug information to stderr. Use twice for more debug output. --help Show this message and exit. NxN matrix options (for fingerprint input): --save-matrix, --save FILE If specified, save the intermediate NxN matrix to the named file --save-format [npz] File format for --save-matrix (only 'npz' is supported) Butina clustering options: --tiebreaker [randomize|first|last] When multiple candidates have the same number of neighbors, 'randomize' picks the next cluster center at random while 'first' and 'last' picks next candidate in increasing or decreasing index order. -n, --num-clusters N After clustering, merge smallest cluster member to other clusters until there are only N clusters [x>=1] --butina-threshold FLOAT Minimum Butina cluster threshold (default: 0.0, uses the threshold from the similarity matrix) --false-singletons, --fs [keep|follow-neighbor|nearest-center] If 'follow-neighbor' (the default) move false singletons to the cluster of its nearest neighbor. If 'nearest-center' move to the closest center (required fingerprints). If 'keep' leave as a singleton group. This program implements several variations of the Butina clustering method described in Darko Butina's "Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets", J. Chem. Inf. Comput. Sci. 1999, 39, 747-750. The general approach is: 1) Generate an NxN Tanimoto similarity matrix for a given threshold 2) Sort rows by the number of neighbors in each row, from most number of neighbors to least. By default chemfp will randomize the order of the rows for a given number of neighbors. This means that re-running Butina clustering will almost certainly give different results. Either specify the initial RNG seed using `--seed` or use a tiebreaker of 'first' or 'last' to use a fixed sort order. 3) Apply sphere exclusion, in sorted order, to the sorted rows. The first row is the center of the first cluster, and its neighbors are members of the first cluster. 4) Repeat the process until done. A fingerprint can only be assigned to a single cluster, and will not be used to create a new cluster center, nor be added to another cluster, even if it is sufficiently similar. = False Singletons = This process can lead to "false singletons" when a fingerprint forms a new cluster center but all of its neighbors are already assigned to another cluster. The chemfp butina implementation offers three possiblities for handling false singletons: * 'keep' - leave the false singleton in its own cluster. * 'follow-neighbor' - move the false singleton to the same cluster as its first nearest-neighor. * 'nearest-center' - move the false singleton to the nearest cluster center. Note: there may be multiple neighors with the same similarity as the nearest neighbor. Chemfp currently always arbitrarily uses the first nearest neighbor. A future version may support choosing the neighbor at random from all equally-similar neighbors. Note: there may be multiple cluster centers which are equally similar to a false singleton. Chemfp currently always arbitrarily uses one of these nearest neighbors. A future version may support choosing the neighbor at random from all equally-similar neighbors. = Pruning = If --num-clusters / -n is specified, and is smaller than the number of identified clusters, then chemfp will use a post-processing step to reduce the number of clusters. The clusters are ordered by size, from smallest to largest. The smallest cluster is selected, with ties broken by selecting the first created cluster. Each member is processed (from last to first) to find a nearest- neighbor in another cluster, with the member then added to that cluster before processing the next member. It is posssible that a fingerprint may be reassigned multiple times during the pruning process. = Fingerprints and/or npz similarity matrix = The "chemfp butina" command accepts a fingerprint dataset, an npz similarity matrix, or both. When given a fingerprint dataset, it generates a sparse NxN Tanimoto similarity matrix with the similarity threshold given by --NxN-threshold / --threshold / -t. Use --save-matrix to save the matrix to an npz file. When given a similarity matrix, it carries out the Butina clustering on the matrix but operations which require fingerprints, like pruning and the "nearest-center" method for false singleton assignment, are not supported. The default --butina-threshold of 0.0 means all neighbors in the matrix will be used. Matrix values smaller than the Butina threshold are ignored, which is useful for parameter turning as a matrix can be generated once at a lower threshold then re-used at higher Butina thresholds. When given both a fingerprint data set and a sparse matrix using --matrix, the NxN matrix is used for the Butina clustering, and the methods which require fingerprints are also supported. = Output formats = By default the clusters are written in "centroid" format to stdout. The format writes one line per cluster, along with a cluster member count and optionally including the member ids and scores. Use "--out" to specify alternate formats. The "flat" format is a tab- delimited description of the fingerprint members, one member per line in fingerprint order. The "csv" and "tsv" format are similar, but include the cluster size for each row, and are in cluster order. If "--out" is not specified then the format is based on the --output / -o filename, or "centroid" if that doesn't work. = Examples = 1) Cluster fingerprints at a threshold of 0.4 (could also use '-t' or '-- threshold'): chemfp butina input.fps --NxN-threshold 0.4 2) Cluster fingerprints at a threshold of 0.4, keep false singletons as false singletons, write the output in 'flat' format, and use the full internal names, to see which centers are false singletons: chemfp butina input.fps -t 0.4 --false-singletons keep --no-rename --out flat 3) Cluster fingerprints at a threshold of 0.45, move false singletons to the nearest cluster center, reduce the number of clusters to 20, and write the output in 'tsv' format: chemfp butina benzodiazepines.fps --threshold 0.45 \ --fs nearest-center --num-clusters 20 --out tsv 4) Cluster fingerprints at a threshold of 0.6, use an initial seed, save the intermediate NxN matrix to the file 'chembl_33_60.npz', and write the Butina cluster to 'chembl_30_60.centroids': chemfp butina chembl_33.fpb -t 0.6 --save-matrix chembl_33_50.npz \ -o chembl_33_60.centroids 5) Use the saved matrix as the input to Butina clustering, with a Butina threshold of 0.7. Save the results to 'chembl_33_70.centroids': chemfp butina chembl_33_60.npz --butina-threshold 0.7 \ -o chembl_33_70.centroids 6) Reduce the number of identified clusters (at 0.6 threshold) from ~330K to 250K using the pre-computed NxN similarity matrix for the clustering, and fingerprint searches to merge clusters: chemfp butina chembl_33.fpb --matrix chembl_33_60.npz \ --num-clusters 250000 -o chembl_33_70_pruned.centroids