.. _chemfp_butina:

chemfp butina command-line options
==========================================================

The following comes from ``chemfp butina --help``:

.. code-block:: none
                
  Usage: chemfp butina [OPTIONS] FILENAME
  
    Cluster using the Butina/leader-follower algorithm.
  
    If FILENAME is not specified the read the fingerprints or similarity matrix
    from stdin.
  
  Options:
    --in [fps|fps.gz|fps.zst|fpb|flush|npz]
                                    Specify the input file format, either
                                    fingerprint or similarity matrix(default is
                                    based on filename extension, or 'fps')
    --matrix FILE
    --matrix-format [npz]           File format for --matrix (only 'npz' is
                                    supported)
    -o, --output PATH               Output filename
    --out TEXT                      Output format. Must be one of 'centroid'
                                    (the default), 'csv', 'tsv', or 'flat', with
                                    optional compression
    -j, --num-threads INT           The number of threads to use. -1 means all
                                    available cores. This option overrides
                                    $OMP_NUM_THREADS. (default: -1)
    --precision [1|2|3|4|5|6|7|8|9|10]
                                    Number of digits in Tanimoto score (default:
                                    based on the fingerprint size)
    --progress / --no-progress      Show a progress bar (default: show unless
                                    the output is a terminal)
    -t, --NxN-threshold, --threshold FLOAT
                                    Threshold when generating the NxN similarity
                                    matrix from fingerprints (default: 0.7)
    --seed N                        Specify the random number generator seed
                                    between 0 and 2**64-1, inclusive, or use -1
                                    to have one picked at random (default: -1)
    --include-members / --no-members
                                    The default writes all cluster members. With
                                    --no-members only write the cluster centers.
    --rescore / --no-rescore        Rescore moved false singletons and merged
                                    fingerprints to their new cluster center
    --renumber / --no-renumber      By default, use sequential cluster ids
                                    starting from 1. With --no-renumber use the
                                    internal cluster ids.
    --rename / --no-rename          Use --no-rename to use the internal member
                                    type names instead of renaming them to use
                                    only 'CENTER' and 'MEMBER'
    --include-metadata / --no-metadata
                                    With --no-metadata, do not include header
                                    metadata in 'chemfp' and 'flat' output
                                    formats.
    --times / --no-times            Write timing information to stderr
    -d, --debug                     Print debug information to stderr. Use twice
                                    for more debug output.
    --help                          Show this message and exit.
  
  NxN matrix options (for fingerprint input):
    --save-matrix, --save FILE  If specified, save the intermediate NxN matrix
                                to the named file
    --save-format [npz]         File format for --save-matrix (only 'npz' is
                                supported)
  
  Butina clustering options:
    --tiebreaker [randomize|first|last]
                                    When multiple candidates have the same
                                    number of neighbors, 'randomize' picks the
                                    next cluster center at random while 'first'
                                    and 'last' picks next candidate in
                                    increasing or decreasing index order.
    -n, --num-clusters N            After clustering, merge smallest cluster
                                    member to other clusters until there are
                                    only N clusters  [x>=1]
    --butina-threshold FLOAT        Minimum Butina cluster threshold (default:
                                    0.0, uses the threshold from the similarity
                                    matrix)
    --false-singletons, --fs [keep|follow-neighbor|nearest-center]
                                    If 'follow-neighbor' (the default) move
                                    false singletons to the cluster of its
                                    nearest neighbor. If 'nearest-center' move
                                    to the closest center (required
                                    fingerprints). If 'keep' leave as a
                                    singleton group.
  
    This program implements several variations of the Butina clustering method
    described in Darko Butina's "Unsupervised Data Base Clustering Based on
    Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way To
    Cluster Small and Large Data Sets", J. Chem. Inf. Comput. Sci. 1999, 39,
    747-750.
  
    The general approach is:
  
    1) Generate an NxN Tanimoto similarity matrix for a given threshold
  
    2) Sort rows by the number of neighbors in each row, from most number of
    neighbors to least.
  
    By default chemfp will randomize the order of the rows for a given number of
    neighbors. This means that re-running Butina clustering will almost
    certainly give different results. Either specify the initial RNG seed using
    `--seed` or use a tiebreaker of 'first' or 'last' to use a fixed sort order.
  
    3) Apply sphere exclusion, in sorted order, to the sorted rows. The first
    row is the center of the first cluster, and its neighbors are members of the
    first cluster.
  
    4) Repeat the process until done.
  
    A fingerprint can only be assigned to a single cluster, and will not be used
    to create a new cluster center, nor be added to another cluster, even if it
    is sufficiently similar.
  
    = False Singletons =
  
    This process can lead to "false singletons" when a fingerprint forms a new
    cluster center but all of its neighbors are already assigned to another
    cluster.
  
    The chemfp butina implementation offers three possiblities for handling
    false singletons:
  
    * 'keep' - leave the false singleton in its own cluster.
  
    * 'follow-neighbor' - move the false singleton to the same cluster as its
    first nearest-neighor.
  
    * 'nearest-center' - move the false singleton to the nearest cluster center.
  
    Note: there may be multiple neighors with the same similarity as the nearest
    neighbor. Chemfp currently always arbitrarily uses the first nearest
    neighbor. A future version may support choosing the neighbor at random from
    all equally-similar neighbors.
  
    Note: there may be multiple cluster centers which are equally similar to a
    false singleton. Chemfp currently always arbitrarily uses one of these
    nearest neighbors. A future version may support choosing the neighbor at
    random from all equally-similar neighbors.
  
    = Pruning =
  
    If --num-clusters / -n is specified, and is smaller than the number of
    identified clusters, then chemfp will use a post-processing step to reduce
    the number of clusters.
  
    The clusters are ordered by size, from smallest to largest. The smallest
    cluster is selected, with ties broken by selecting the first created
    cluster. Each member is processed (from last to first) to find a nearest-
    neighbor in another cluster, with the member then added to that cluster
    before processing the next member.
  
    It is posssible that a fingerprint may be reassigned multiple times during
    the pruning process.
  
    = Fingerprints and/or npz similarity matrix  =
  
    The "chemfp butina" command accepts a fingerprint dataset, an npz similarity
    matrix, or both.
  
    When given a fingerprint dataset, it generates a sparse NxN Tanimoto
    similarity matrix with the similarity threshold given by --NxN-threshold /
    --threshold / -t. Use --save-matrix to save the matrix to an npz file.
  
    When given a similarity matrix, it carries out the Butina clustering on the
    matrix but operations which require fingerprints, like pruning and the
    "nearest-center" method for false singleton assignment, are not supported.
    The default --butina-threshold of 0.0 means all neighbors in the matrix will
    be used. Matrix values smaller than the Butina threshold are ignored, which
    is useful for parameter turning as a matrix can be generated once at a lower
    threshold then re-used at higher Butina thresholds.
  
    When given both a fingerprint data set and a sparse matrix using --matrix,
    the NxN matrix is used for the Butina clustering, and the methods which
    require fingerprints are also supported.
  
    = Output formats  =
  
    By default the clusters are written in "centroid" format to stdout. The
    format writes one line per cluster, along with a cluster member count and
    optionally including the member ids and scores.
  
    Use "--out" to specify alternate formats. The "flat" format is a tab-
    delimited description of the fingerprint members, one member per line in
    fingerprint order. The "csv" and "tsv" format are similar, but include the
    cluster size for each row, and are in cluster order.
  
    If "--out" is not specified then the format is based on the --output / -o
    filename, or "centroid" if that doesn't work.
  
    = Examples =
  
    1) Cluster fingerprints at a threshold of 0.4 (could also use '-t' or '--
    threshold'):
  
      chemfp butina input.fps --NxN-threshold 0.4
  
    2) Cluster fingerprints at a threshold of 0.4, keep false singletons as
    false singletons, write the output in 'flat' format, and use the full
    internal names, to see which centers are false singletons:
  
      chemfp butina input.fps -t 0.4 --false-singletons keep --no-rename --out flat
  
    3) Cluster fingerprints at a threshold of 0.45, move false singletons to the
    nearest cluster center, reduce the number of clusters to 20, and write the
    output in 'tsv' format:
  
     chemfp butina benzodiazepines.fps --threshold 0.45 \
        --fs nearest-center --num-clusters 20 --out tsv
  
    4) Cluster fingerprints at a threshold of 0.6, use an initial seed, save the
    intermediate NxN matrix to the file 'chembl_33_60.npz', and  write the
    Butina cluster to 'chembl_30_60.centroids':
  
     chemfp butina chembl_33.fpb -t 0.6 --save-matrix chembl_33_50.npz \
        -o chembl_33_60.centroids
  
    5) Use the saved matrix as the input to Butina clustering, with a Butina
    threshold of 0.7. Save the results to 'chembl_33_70.centroids':
  
     chemfp butina chembl_33_60.npz --butina-threshold 0.7 \
        -o chembl_33_70.centroids
  
    6) Reduce the number of identified clusters (at 0.6 threshold) from ~330K to
    250K using the pre-computed NxN similarity matrix for the clustering, and
    fingerprint searches to merge clusters:
  
     chemfp butina chembl_33.fpb --matrix chembl_33_60.npz \
        --num-clusters 250000 -o chembl_33_70_pruned.centroids