.. _whats_new_41: What’s new in chemfp 4.1 ======================== Released 17 May 2023. The main themes for chemfp 4.1 are "CXSMILES", "Butina clustering", "CSV support", and "embrace click and required third-party dependencies." Required dependencies on click and tqdm ----------------------------------------- Chemfp 4.1 switched to `click `_ for command-line processing. Previously it used `argparse `_ from the Python standard library, but I found it easier to develop and maintain click-based processing. The change resulted in a lot of minor changes. The error messages have changed. Command-line arguments are processed in a different order, which may change when errors and how are detected. Argparse supported ``-h`` as an alias for ``--help`` but click does not. Some of the exit codes have changed. There is a subtle change in how the ``--delimiter`` and ``--has-header`` options work with the ``-R`` option for reader\_args. Previously these were processed in left-to-right order. In chemfp 4.1 the ``delimiter`` and ``has-header`` options are processed after ``-R``, and override those settings. The change took a lot of work, but also led to some improvements. For example, rdkit2fps now accepts ``--morgan1`` through ``-morgan4`` as ways to set the Morgan radius (the old ``--morgan`` is now an alias for ``--morgan2``), and the click API makes it easier to group ``--help`` options. Chemfp specifies click as a required dependency . If click is not installed then tools like pip should install it automatically. In fact, this is chemfp's first required dependency. (chemfp optionally supports NumPy, SciPy and other packages if they are installed, but those are not required for core operations). In order to avoid install-time dependencies, chemfp 4.0 included a copy of the progress bar package `tqdm `_ with the distribution, which was used if tqdm was not already installed. With chemfp 4.1, tqdm is the second required dependency, and the vendored copy of tqdm has been removed. CXSMILES ------------ Chemfp 4.1 by default now reads SMILES file as containing `CXSMILES `_. A line of a CXSMILES file looks like: .. code-block:: none C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example cxsmiles structure with the SMILES followed by an optional CXSMILES extension followed by the title or additional columns. There must be a single space between the SMILES and the CXSMILES extension, and the extension must start and end with the pipe character ``|``. Use ``--no-cxsmiles`` to parse the record as a SMILES. This keeps any ``|`` term as part of the record id/title. (The default is ``--cxsmiles``, which parses the input as CXSMILES. These options are equivalent to the reader arg ``-R cxsmiles=0`` (to disable CXSMILES parsing) and ``-R cxsmiles=1`` (to enable), though they are processed after the ``-R`` option. An alternative method to separate the structure from the identifier is to use tab-separated fields and specify ``---delimiter tab``. Toolkit-specific details ~~~~~~~~~~~~~~~~~~~~~~~~~~~ CXSMILES support proved tricky to handle. For one, there is no easy way to identify the end of a CXSMILES extension without a full CXSMILES parser. Simply looking for a pipe character followed by a space isn't enough, because a CXSMILES can recursively embed other CXSMILES, which can in turn contain spaces. In addition, the underlying cheminformatics toolkits have different levels of support for the extensions, and don't all handle identifiers the same way that chemfp does. - Open Babel does not support CXSMILES. Instead, chemfp uses hueristics to attempt to identify and remove the extension, and pass only the SMILES to Open Babel for processing. - CDK expects CXSMILES by default, with no option to disable the option. If ``--no-cxsmiles`` is specified then chemfp will extract the SMILES from the rest of the line and have CDK only parse the SMILES. - RDKit has the option to enable or disable CXSMILES processing, but doesn't handle identifiers the same way chemfp does, so chemfp tries to process the identifier and pass the rest to RDKit. - OEChem uses OEFormat_SMI for SMILES files and OEFormat_CXSMILES for CXSMILES files. Chemfp will set the appropriate OEChem format flag depending on the cxsmiles option. - Chemfp's "text" toolkit is not a cheminformatics-based toolkit. It can only parse records from a SMILES and SDF files and, for SMILES, figure out which is the SMILES and which is the identifier. If the "cxsmiles" reader argument is true then it will use the same hueristics as the Open Babel parser to identify the end of the CXSMILES extension. CXSMILES output ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The toolkit wrappers can write SMILES files. By default they do not include the CXSMILES extensions. Pass the writer argument ``cxsmiles=True`` to save with the CXSMILES extensions. .. code-block:: python with T.open_molecule_writer( "ouput.smi", writer_args={"cxsmiles": True}) as writer: writer.write_molecules(mols) New structure format specifiers ----------------------------------------- Chemfp 4.1 added the "cxsmi", "cxsmistring", and "sdf3k" format specifiers. The "cxsmi" format can be used to read and write files in SMILES format. It is the same as the "smi" format except that "cxsmi" writes CXSMILES by default while "smi" does not. The "cxsmi" format is the default for filenames which use the ".cxsmi" extension. The "cxsmistring" format parses a SMILES string with optional CXSMILES extension, but does not process any identifier which may be present. The "sdf3k" format is the same as the "sdf" format except that "sdf3k" always writes records in V3000 format, while "sdf" only uses V3000 if required. The "sdf3k" format is the default for fileanmes which use the ".sdf3k" extension. New SearchResult(s) attributes ------------------------------------ The :class:`.SearchResults` class now has the following new attributes: - num_bits: the number of bits in the fingerprint - alpha: the Tversky alpha value, which is 1.0 for Tanimoto - beta: the Tversky beta value, which is 1.0 for Tanimoto - matrix_type: an integer enum value describing the matrix type (see below) - matrix_type_name: a string name describing the matrix type - target_start: the start position in the target arena (non-zero only if the targets was a subarena) These were added so the SearchResults could be saved and loaded from an npz file without losing important chemfp metadata. Previously the ``target_ids`` attribute returned all of the ids for the target areana, even if the targets was a subarena slice of the main arena. This is useful because the search result indices are given in arena coordinates, not in subarean coordinates. However, it proved confusing when, for example, wanting a list of just the relevant target ids. In chemfp 4.1 the ``target_ids`` now returns only the relevant target identifiers, and the new ``target_arena_ids`` returns the full identifiers from the target arena. The `matrix_type` value is from a new :class:`.MatrixType` `enum `_ in the :mod:`chemfp.search` module: .. code-block:: pycon >>> from chemfp.search import MatrixType >>> list(MatrixType) [, , , , ] It describes the expected content of the underlying matrix type. These are used in the new Butina implementation as a check that an imported npz file or csr matrix is the expected type. The values are: * "NxM" - can contain a score for any combination of query and target * "NxN" - the same, except the queries and targets are identical * "NO-DIAGONAL" - same as "NxN" except the diagonal is not included * "UPPER-TRIANGULAR" - same as "NxN" except limited to the upper triangle * "STRICTLY-UPPER-TRIANGULAR" - "upper-triangular" but without the diagonal The corresponding ``matrix_type_name`` values are "NxM", "NxN", "no-diagonal", "upper-triangular", and "strictly-upper-triangular". .. code-block:: pycon >>> results.matrix_type >>> results.matrix_type_name 'NxM' Chemfp currently does not generate an "upper-triangular" matrix type. The :class:`.SearchResult` also has the new ``target_start`` and ``target_ids`` properties, matching the SearchResults. Save simsearch results to "npz" files -------------------------------------------- Similarity search results can now be saved as a NumPy `npz `_ files, as a SciPy `sparse csr matrix `_. This makes it easier to import search results into other analysis tools. For example, I'll do an NxN threshold search of a set of benzodiazepine fingerprints, with a cutoff of 0.4, and save the results to `benzodiazepine.npz`: .. code-block:: console % simsearch --NxN -t 0.4 ../benzodiazepine.fps.gz -o benzodiazepine.npz queries: 100%|██████████████████████████████████████| 12386/12386 [00:00<00:00, 32889.98 fps/s] From the Python shell I'll import the output file into SciPy and show the target indices and scores for row index 686, which has 5 elements: .. code-block:: pycon >>> from scipy import sparse >>> arr = sparse.load_npz("benzodiazepine.npz") >>> arr <12386x12386 sparse matrix of type '' with 9541836 stored elements in Compressed Sparse Row format> >>> arr[686] <1x12386 sparse matrix of type '' with 5 stored elements in Compressed Sparse Row format> >>> arr[686].indices array([ 744, 950, 6083, 77, 196], dtype=int32) >>> arr[686].data array([0.49090909, 0.48214286, 0.53225806, 0.49019608, 0.625 ]) The indices aren't that useful on their own. You'll likely want to map them back to the original identifiers, which you can do because chemfp stores them in the npz file as well. This is possible because NumPy npz file is a zipfile containing NumPy arrays in "npy" format. By convention the SciPy package stores the sparse matrix information in the zipfile entries "format.npy", "shape.npy", "indptr.npy", "indices.npy", and "data.npy", which you can see on the command-line using the zip program: .. code-block:: console % unzip -l benzodiazepine.npz Archive: benzodiazepine.npz Length Date Time Name --------- ---------- ----- ---- 600 01-01-1980 00:00 chemfp.npy 38167472 01-01-1980 00:00 indices.npy 49676 01-01-1980 00:00 indptr.npy 140 01-01-1980 00:00 format.npy 144 01-01-1980 00:00 shape.npy 76334816 01-01-1980 00:00 data.npy 396480 01-01-1980 00:00 ids.npy --------- ------- 114949328 7 files Chemfp adds two or three file entries to the list. The "chemfp.npy" contains chemfp-specific metadata, which I'll get to later. If the file "ids.npy" exists then it contains an array of identifiers used for both the queries and the targets. The file entries "query_ids.npy" and "target_ids.npy" store the query and target identifiers, respectively, but the are not used here because an NxN search has the same ids for the queries and targets. The NumPy `load function `_ returns an NpzFile object with ways to get a list of the available files and load the contents. I'll show how that can be used to get the target ids for ths hits of row 686: .. code-block:: pycon >>> import numpy as np >>> z = np.load("benzodiazepine.npz") >>> z.files ['chemfp', 'indices', 'indptr', 'format', 'shape', 'data', 'ids'] >>> ids = z["ids"] >>> ids array(['76175', '2802905', '18631850', ..., '16198616', '16153475', '11979873'], dtype='>> len(ids) 12386 >>> [ids[i] for i in arr[686].indices] ['21327973', '21327970', '13106808', '12991903', '21817691'] >>> z.close() (The explicit NpzFile close ensures the underlying zipfile gets closed at that point, rather than depend on Python's garbage collector. The NpzFile is also a context manager, which means you could use it in a "with" statement instead.) chemfp.npy file entry in an npz file -------------------------------------------- The "chemfp.npy" entry in the npz file contains chemfp-specific metadata. Its a JSON dictionary, stored as a string, in a NumPy array, encoded in "npy" format, in a zipfile. While that sounds complicated, its meant to be relatively easy to load using NumPy's load command, which the previous section used to load the array identifiers: .. code-block:: pycon >>> import json >>> import numpy as np >>> with np.load("benzodiazepine.npz") as z: ... d = json.loads(z["chemfp"].item()) ... >>> import pprint >>> pprint.pprint(d) {'alpha': 1.0, 'beta': 1.0, 'format': 'multiple', 'matrix_type': 'no-diagonal', 'method': 'Tversky', 'num_bits': 2048} The most important of these is `format`, which can be "single" or "multiple". It is "single" if the npz file stores a single chemfp :class:`.SearchResult`, and "multiple" if it stores a chemfp :class:`.SearchResults`. The `method` indicates the scores were calculated with Tversky similarity, using the `alpha` and `beta` values of 1.0. This setting is much better known as "the Tanimoto". The `matrix_type` describes the expected matrix type, described previously. The `num_bits` is the number of bits in the fingerprints used to generate the Tanimoto/Tversky scores. Chemfp uses it in two places. First, it determines the default number of digits to display when printing a Tanimoto score as a decimal. All 166-bit fingerprints can be distinguished with only 5 bits of precision, while a 2048-bit fingerprint needs 7. These values are available from the :func:`.get_tanimoto_precision` function: .. code-block:: pycon >>> import chemfp.bitops >>> chemfp.bitops.get_tanimoto_precision(166) 5 >>> chemfp.bitops.get_tanimoto_precision(2048) 7 Second, Tanimoto scores don't use the full range of the 64-bit double that chemfp uses internally to store the score. If the number of fingerprint bits is small enough then it's possible to sort the numbers by only comparing parts of the bit-representation of the score, which improves the overall sort performance. Chemfp also doesn't use special values like the 64-bit double representations for -0.0, infinite, or "NaN"/"not a number" values, which further improves perfomance. If you plan to generate Tanimoto similarity scores from another tool, and import them into chemfp then you should set this value appropriately. If the scores come from another sort of similarity calculation, or don't know how the number of bits affects things, then it is safe to use the value 2**31-1, which is the default chemfp uses if `num_bits` is not specified." NaN and infinity are not supported. Working with npz files in chemfp ------------------------------------------------------------------------ The new :func:`.load_npz` and :func:`.save_npz` functions in the :mod:`chemfp.search` module will load and save, respectively similarity search results to an npz file, along with the identifiers and the "chemfp.npy" metadata. The search results can be a :class:`.SearchResults` returned when using multiple queries against multiple targets, or a single :class:`.SearchResult` returned when using a single fingerprint to search the targets, and returned as a row of a SearchResults. They can also be the high-level objects returned from :func:`chemfp.simsearch`, though only the low-level SearchResult or SearchResults are loaded, without the timing information and other metadata in the high-level objects: .. code-block:: pycon >>> import chemfp >>> results = chemfp.simsearch(queries="queries.fps", targets="targets.fps", k=5) targets.fps: 100%|█████████████████████████████████| 26.7k/26.7k [00:00<00:00, 33.7Mbytes/s] queries.fps: 100%|█████████████████████████████████| 26.7k/26.7k [00:00<00:00, 75.7Mbytes/s] queries: 100%|█████████████████████████████████████| 100/100 [00:00<00:00, 63138.70 fps/s] >>> results MultiQuerySimsearch('5-nearest Tanimoto search. #queries=100, #targets=100 (search: 1.87 ms total: 39.59 ms)', result=SearchResults(#queries=100, #targets=100)) >>> >>> import chemfp.search >>> chemfp.search.save_npz("k5.npz", results) >>> k5_results = chemfp.search.load_npz("k5.npz") >>> k5_results SearchResults(#queries=100, #targets=100) You can also save the results using the new ``save()`` method for the similarity search result objects: .. code-block:: pycon >>> results.save("k5.npz") Importing SciPy csr matrices to chemfp ------------------------------------------------------- The new :func:`.from_csr` function in the :mod:`chemfp.search` module converts a SciPy csr ("compressed sparse row") array into a chemfp :class:`.SearchResults` instance. In the following I'll start with a regular NumPy array, turn it into a sparse matrix, then convert that to a SearchResults: .. code-block:: pycon >>> import numpy >>> import scipy.sparse >>> import chemfp.search >>> arr = numpy.array([[0.0, 0.0, 0.0, 0.9, 0.8], [0.7, 0.0, 0.0, 0.6, 0.0]]) >>> arr array([[0. , 0. , 0. , 0.9, 0.8], [0.7, 0. , 0. , 0.6, 0. ]]) >>> csr = scipy.sparse.csr_array(arr) >>> csr <2x5 sparse array of type '' with 4 stored elements in Compressed Sparse Row format> >>> sr = chemfp.search.from_csr(csr) >>> sr SearchResults(#queries=2, #targets=5) >>> sr[0].get_indices_and_scores() [(3, 0.9), (4, 0.8)] >>> sr[1].get_indices_and_scores() [(0, 0.7), (3, 0.6)] Additional `from_csr` parameters let you specify the query and target ids, matrix type, the number of fingerprint bits, and the Tversky alpha and beta parameters. See the warnings earlier about how `num_bits` affects chemfp. Butina clustering --------------------- Chemfp 4.1 includes Butina clustering, both from the command-line and as a high-level API function. More complete Examples of both are given below. Here's an idea to get you started, first on the command-line: .. code-block:: pycon % chemfp butina chembl_30.fpb --threshold 0.7 -o chembl_30.centroids and then using the Python API: .. code-block:: python import chemfp clusters = chemfp.butina(fingerprints="chembl_30.fpb", NxN_threshold=0.7) print(f"#clusters: {len(clusters)}") clusters.save("chembl_30.centroids") The basic Butina clustering algorithm is from: Darko Butina, "Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets", Journal of Chemical Information and Computer Sciences 1999 39 (4), 747-750 `DOI: 10.1021/ci9803381 `_. (This is sometimes referred to as "Taylor" or "Taylor-Butina" clustering due to Robin Taylor, "Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals”, Journal of Chemical Information and Computer Sciences 1995 35 (1) 59–67. `DOI: 10.1021/ci00023a009 `_. The methods are similar, but Taylor did not apply it to clustering.) In short, compute the NxN similarity matrix. For each fingerprint, count the number of neighbors which are within some minimum similarity threshold. Sort those by count, from most neighbors to least. Pick the first fingerprint as the first cluster center, and include all of its neighbors as cluster members. Iteratively repeat until done for all remaining fingerprints which are not already assigned as members of another cluster. Chemfp supports several variants of the Butina algorithm: 1) breaking ties in the Butina ordering, 2) dealing with false singletons, and 3) pruning the final number of clusters. tiebreaker ~~~~~~~~~~~~~ The initial ordering by count, from largest to smallest, nearly always results in ties, where many fingerprints have the same number of neighbors. How are ties broken? By default chemfp will "randomize" the order, using an initial seed from Python's random number generator, which means each time you run the algorithm you are likely to get different results. You can specify the seed if you want consistent output, or you can choose to break ties by order in chemfp's internal fingerprint arena, either "first" or "last". The fingerprints are ordered from the smallest number of bits to the largest, so "first" will bias the initial pick towards small molecules, while "last" will bias it towards larger ones with more internal diversity. Here's an example how to specify the tiebreaker on the command-line: .. code-block:: console % chemfp butina dataset.fps --tiebreaker first and an example using the API, which also sets the seed: .. code-block:: python import chemfp chemfp.butina("dataset.fps", tiebreaker="randomize", seed=12345) false-singletons ~~~~~~~~~~~~~~~~~~~~ If there is only one fingerprint in the cluster, then that fingerprint (which is the cluster center) is termed a "singleton." A "false singleton" is a singleton with one or more neighbors within the minimum similarity threshold but where those neighbors were all assigned to one or more other clusters. Chemfp supports three ways to deal with a false singleton. The default is "follow-neighbor": move the false singleton to the same cluster as its first nearest-neighor. (If there are multiple equally-nearest neighbors then currently chemfp makes an arbitrary, implementation-defined choice. A future version may break ties using a random choice.) The second option is to "keep" the false singleton as a singleton. The third is to move the false singleton to the cluster with the "nearest-center". (Ties are currently broken by an arbitrary, implementation-defined choice. A future version may break ties using a random choice.) This method was described by Niklas Blomberg, David A. Cosgrove, and Peter W. Kenny, JCAMD 23, pp 513–525 (2009) `doi:10.1007/s10822-009-9264-5 `_ though chemfp's implementation does not yet support a user-specified minimum required center threshold. The "nearest-center" variant requires the original fingerprints in order to find the nearest cluster center. The other two options only require the NxN similarity matrix. Here's an example of keeping false singletons using the command-line tool, and the ``--fs`` short form of writing ``--false-singletons``: .. code-block:: console % chemfp butina chembl_30.fpb --threshold 0.7 --fs keep Here's what it looks like in the Python API: .. code-block:: python import chemfp clusters = chemfp.butina("chembl_30.fpb", NxN_threshold=0.7, false_singletons="keep") clusters.save(None) # write to stdout Pruning ~~~~~~~~~~~~ Chemfp also implements a post-processing step to reduce the number of clusters. The clusters are ordered by size, from smallest to largest. The smallest cluster is selected, with ties broken by selecting the first created cluster. Each fingerprint in the cluster is processed (currently from last to first) to find its nearest-neighbor in another cluster, with the fingerprint added to that cluster before processing the next fingerprint in the smallest cluster. (If fingerprint X moved to cluster C and fingerprint Y's nearest neigbhor is X then Y wil also be moved to cluster C.) The post-processing step requires the original fingerprints and cannot be done with the NxN similarity matrix. This algorithm was devised by a chemfp customer, who also funded its integration into chemfp. Here's a command-line example of pruning a dataset down to 1,000 clusters: .. code-block:: console % chemfp butina dataset.fpb -t 0.6 -n 1000 -o dataset.centroids and the equivalent version using the API: .. code-block:: python import chemfp with chemfp.butina("dataset.fpb", NxN_threshold=0.6, num_clusters=1000) as clusters: clusters.save("daset.centroids") Other options ~~~~~~~~~~~~~~~~~ If fingerprints are available then the default will rescore moved fingerprints (ie, false singletons moved to a "nearest-center", and fingerprints moved by the post-processing pruning step) to their cluster center. Rescoring can be disabled. By default the output clusters are ordered and numbered so the largest cluster (after any moves) are first, though there's an option to show the original Butina ordering, which chemfp uses internally. By default the cluster center is labeled "CENTER" and the others labeled "MEMBER". If renaming is not used then other labels are "FALSE_SINGLETON" (for singletons which are cluster centers), "MOVED_FALSE_SINGLETON" for false singletons which were reassigned to another cluster), and "MERGED" for fingerprints which were moved to another cluster as part of the post-processing cluster pruning step. Butina on the command-line ------------------------------- The Butina algorithm is available as a subcommand of :command:`chemfp`. Using a small set of fingerprints as an example: .. code-block:: console % cat distinct.fps 1200ea4311018100 id1 894dc00a870b0800 id2 82b180a0943d0254 id3 d04a450013d8a479 id4 8025719c2c907c2e id5 918125cf96a2a360 id6 6b0d4215ca47fc03 id7 2eb000ce2e4337ef id8 849a6de873554c59 id9 b67121e96f48a365 id10 2ab0f5b64cb4b8cf id11 e7fedcefa4e7560e id12 I'll cluster at a threshold of 0.41: .. code-block:: console % chemfp butina distinct.fps -t 0.41 #Centroid/1 include-members=1 #type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1 #software=chemfp/4.1 #fingerprints=distinct.fps i center_id count members 1 id8 5 id8 1.0000 id10 0.4651 id12 0.4400 id11 0.4130 id5 0.3256 2 id9 2 id9 1.0000 id4 0.4103 3 id3 1 id3 1.0000 4 id7 1 id7 1.0000 5 id1 1 id1 1.0000 6 id2 1 id2 1.0000 7 id6 1 id6 1.0000 This output follows the usual chemfp output format style where the first line describes the format, followed by header lines starting with '#', then the output data proper, using tab-separated columns. The default uses a variable number of columns, with one line per cluster. The first column is the column id (an integer), the second is the fingerprint id for the center fingerprint, and the third is the number of elements in the cluster (including the center). After the count are the id and score of every fingerprint in the cluster, starting with the cluster center. Pruning the number of clusters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use ``--num-clusters`` or simply ``-n`` option to prune the clusters down to a smaller value. In the following I've specified I want only 3 clusters, so four of the clusters will be pruned, and their fingerprints moved to other clusters: .. code-block:: console % chemfp butina distinct.fps -t 0.41 --num-clusters 3 #Centroid/1 include-members=1 #type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=3601395311 false-singletons=follow-neighbor num-clusters=3 rescore=1 #software=chemfp/4.1 #fingerprints=distinct.fps i center_id count members 1 id8 8 id8 1.0000 id10 0.4651 id12 0.4400 id11 0.4130 id5 0.3256 id3 0.2381 id4 0.1702 id9 0.2917 2 id6 2 id6 1.0000 id1 0.2353 3 id7 2 id7 1.0000 id2 0.2973 Alternate output formats ~~~~~~~~~~~~~~~~~~~~~~~~~ Some other options include disabling the metadata in the output (the lines starting with "#"), only showing the cluster centers, and specifying the initial seed for random number generation: .. code-block:: console % chemfp butina distinct.fps -t 0.41 --no-members --no-metadata --seed 19573998 i center_id count 1 id8 5 2 id9 2 3 id3 1 4 id7 1 5 id1 1 6 id2 1 7 id6 1 The "flat" variant of the centroid format writes one output line for each fingerprint, with the centroid id in the first column, the fingerprint id in the second, the cluster element label in the third, and the score in the fourth: .. code-block:: console % chemfp butina distinct.fps -t 0.41 --out flat --seed 19573998 #Centroid-flat/1 include-members=1 #type=Butina NxN-threshold=0.41 tiebreaker=randomize seed=19573998 false-singletons=follow-neighbor rescore=1 #software=chemfp/4.1 #fingerprints=distinct.fps centroid id type score 5 id1 CENTER 1.0000 6 id2 CENTER 1.0000 3 id3 CENTER 1.0000 2 id4 MEMBER 0.4103 1 id5 MEMBER 0.3256 7 id6 CENTER 1.0000 4 id7 CENTER 1.0000 1 id8 CENTER 1.0000 2 id9 CENTER 1.0000 1 id10 MEMBER 0.4651 1 id11 MEMBER 0.4130 1 id12 MEMBER 0.4400 The ``-no-rename`` option tells the "flat" output to use the full internal labels for each fingerprint, which in this case shows that "id5" was a false singleton moved to cluster 1, and "id1", "id2", "id3", and "id4" were moved by the pruning step: .. code-block:: console % chemfp butina distinct.fps -t 0.41 --num-clusters 3 \ --seed 3601395311 --no-metadata --out flat --no-rename centroid id type score 2 id1 MERGED 0.2353 3 id2 MERGED 0.2973 1 id3 MERGED 0.2381 1 id4 MERGED 0.1702 1 id5 MOVED_FALSE_SINGLETON 0.3256 2 id6 CENTER 1.0000 3 id7 CENTER 1.0000 1 id8 CENTER 1.0000 1 id9 MERGED 0.2917 1 id10 MEMBER 0.4651 1 id11 MEMBER 0.4130 1 id12 MEMBER 0.4400 The "csv" and "tsv" formats are comma- and tab-separated formats which are a hybrid between the "centroid" and "flat" formats. They have one output line for each fingerprint, ordered by cluster number (largest first) and further ordered by score, with the cluster center always first. The output fields are the cluster id, the fingerprint id, the number of elements in the cluster, the fingerprint type ("CENTER", "MEMBER", and more if using ``--no-rename``), and the score. .. code-block:: console % chemfp butina distinct.fps -t 0.41 --num-clusters 3 \ --seed 3601395311 --out csv cluster,id,count,type,score 1,id8,8,CENTER,1.0000 1,id10,8,MEMBER,0.4651 1,id12,8,MEMBER,0.4400 1,id11,8,MEMBER,0.4130 1,id5,8,MEMBER,0.3256 1,id3,8,MEMBER,0.2381 1,id4,8,MEMBER,0.1702 1,id9,8,MEMBER,0.2917 2,id6,2,CENTER,1.0000 2,id1,2,MEMBER,0.2353 3,id7,2,CENTER,1.0000 3,id2,2,MEMBER,0.2973 The output was formatted using Python's `csv `_ module, with the "excel" and "excel-tab" dialects, respectively, which means the results should be easy to import into Excel. The writer may use special quoting rules should the identifier contain a comma, newline, or other special character. Butina parameter tuning ------------------------------- The Butina parameters almost certainly require some tuning to find the appropriate threshold, and to decide if cluster pruning is appropriate. The Butina algorithm is quite fast. It takes just a few seconds, in the default case, to cluster the NxN similarity matrix from the 2+ million fingerprints in ChEMBL. Most of the time is spent formatting the output correctly! On the other hand, even with chemfp it can take 30 minutes or even a couple of hours to generate the matrix in the first place - making tuning a practice in patience! This is where the new "npz" format comes in handy. You can generate the NxN matrix once, using the lower-bound similarity threshold for your search space, and save the result to an npz file. This pre-computed matrix can then be used as input to Butina clustering. All chemfp need to do then is count the number of entries at or above the Butina threshold, which is must faster then computing milions of Tanimoto calculations. (Note: currently only a "no-diagonal" NxN matrix is supported. A future version may allow a full NxN matrix. Also, the entries should be ordered by decreasing score. If they are not in that order, chemfp will reorder them.) You can use the `simsearch` command to generate the npz file but it's easier to use the ``--save-matrix`` option of the Butina command. In the following I'll write the output to /dev/null and I'll show timing details for the individual steps: .. code-block:: console % chemfp butina --threshold 0.4 benzodiazepine.fps.gz \ --save-matrix benzodiazepine.npz --times -o /dev/null load_fingerprints 0.06 NxN 0.50 save_matrix 2.79 cluster 0.00 rescore 0.00 output 0.01 total 3.43 What took the longest was the time to write the matrix to an npz file. The relatively low threshold of 0.4, combined with the high similarity of the benzodiazepines, mean there are 9,541,836 non-zero entries out of 153,400,610 possible non-zero values (the diagonals are not included), giving a density of 6.2%. I can now cluster with the saved NxN matrix instead of using fingerprints, first, using all of the entries in the matrix: .. code-block:: console % chemfp butina benzodiazepine.npz --times -o /dev/null load_matrix 0.25 cluster 0.00 output 0.01 total 0.34 and then using ``--butina-threshold`` to change the minimum similarity threshold for the Butina algorithm: .. code-block:: console % chemfp butina benzodiazepine.npz --times -o /dev/null \ --butina-threshold 0.5 load_matrix 0.25 cluster 0.00 output 0.02 total 0.35 Number of Butina clusters in ChEMBL ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Those timing numbers probably don't look that impressive for the small benzodiazepine data set. I'll switch to "chembl_30_60.npz", which contains the NxN similarity matrix for CheMBL 30 at threshold of 0.60 using RDKit's default Morgan2 fingerprints, and see how the number of clusters changes for for 0.60, 0.61, 0.62, ... 0.70. I can get this number by looking at the last line of the output. I'll also show the breakdown of the times for each clustering run: .. code-block:: console % chemfp butina chembl_30_60.npz --butina-threshold 0.6 --times | tail -1 load_matrix 0.77 cluster 0.35 output 3.11 total 4.51 327022 CHEMBL4790513 1 CHEMBL4790513 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.61 --times | tail -1 load_matrix 0.73 cluster 0.37 output 2.96 total 4.35 354362 CHEMBL1980515 1 CHEMBL1980515 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.62 --times | tail -1 load_matrix 0.75 cluster 0.39 output 2.96 total 4.40 377894 CHEMBL1477846 1 CHEMBL1477846 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.63 --times | tail -1 load_matrix 0.74 cluster 0.42 output 3.00 total 4.47 405647 CHEMBL1497171 1 CHEMBL1497171 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.64 --times | tail -1 load_matrix 0.78 cluster 0.45 output 3.12 total 4.67 433390 CHEMBL287360 1 CHEMBL287360 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.65 --times | tail -1 load_matrix 0.75 cluster 0.45 output 3.05 total 4.56 464418 CHEMBL3472504 1 CHEMBL3472504 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.66 --times | tail -1 elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4235635.50/s] load_matrix 0.75 cluster 0.50 output 3.22 total 4.79 499199 CHEMBL1387287 1 CHEMBL1387287 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.67 --times | tail -1 elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4124939.25/s] load_matrix 0.74 cluster 0.52 output 3.13 total 4.70 536382 CHEMBL471678 1 CHEMBL471678 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.68 --times | tail -1 elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4162832.16/s] load_matrix 0.74 cluster 0.51 output 3.17 total 4.73 573835 CHEMBL1402501 1 CHEMBL1402501 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.69 --times | tail -1 elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 4030396.68/s] load_matrix 0.74 cluster 0.53 output 3.24 total 4.81 617398 CHEMBL4552562 1 CHEMBL4552562 1.0000000 % chemfp butina chembl_30_60.npz --butina-threshold 0.70 --times | tail -1 elements: 100%|█████████████████████████| 2136187/2136187 [00:00<00:00, 3597378.39/s] load_matrix 0.76 cluster 0.59 output 3.34 total 4.99 657984 CHEMBL1724866 1 CHEMBL1724866 1.0000000 The progress bars appear once clustering takes more than 0.5 seconds. You can use ``--progress`` or ``--no-progress`` to always enable or diable them, or set the environment variable "CHEMFP_PROGRESS" to "1" or "0". Each of these clustering jobs took about 5 seconds to run, with most of the time spent writing the output (!). That's a lot of overhead when I only want the count. High-level Butina API ------------------------------- You can also do Butina clustering using chemfp's Python API, using the new :func:`chemfp.butina()` function, as in the following: .. code-block:: pycon >>> import chemfp >>> clusters = chemfp.butina(matrix="chembl_30_60.npz" , butina_threshold=0.70, progress=False) >>> clusters ButinaClusters('2136187x2136187 matrix, butina-threshold=0.7 tiebreaker=randomize seed=4095063231 false-singletons=follow-neighbor, took 694.85 ms', result=[657865 clusters], picker=...) There's a lot going on there! I loaded the npz file to use as the search matrix, then carried out a Butina clustering with a threshold of 0.7. I also disabled the progress bars, which currently don't have the 0.5 second delay. The ButinaClusters includes timing information. This analysis did not save the results, so it was much faster at about 0.7 seconds. For a more detailed description, see The returned :class:`.ButinaClusters` repr(), which is designed for interactive use like this, summarizes the analysis parameters, and gives a list-like API to the 657,870 clusters found. .. code-block:: pycon >>> clusters.butina_threshold 0.7 >>> clusters.seed 4095063231 >>> clusters[0] ButinaCluster(cluster_idx=0, #members=1540) >>> clusters[0].members[:10] [2123772, 2119185, 2119187, 2119188, 2119190, 2119192, 2119199, 2119202, 2119206, 2119215] >>> [m.target_ids[i] for i in clusters[0].members[:3]] ['CHEMBL3958999', 'CHEMBL3907845', 'CHEMBL3908293'] The API makes it easy (and significantly faster!) to produce the parameter scan of the previous section. One optimization is to load the matrix once, so it can be re-used for each of the butina calls: .. code-block:: pycon >>> import chemfp.search >>> m = chemfp.search.load_npz("chembl_30_60.npz") >>> m SearchResults(#queries=2136187, #targets=2136187) >>> m.matrix_type >>> min(row.min() for row in m if row) 0.6 >>> max(row.max() for row in m if row) 1.0 >>> chemfp.butina(matrix=m, progress=False) ButinaClusters('2136187x2136187 matrix, tiebreaker=randomize seed=2054201120 false-singletons=follow-neighbor, took 513.50 ms', result=[327076 clusters], picker=...) With that in place, I'll count the number of Butina clusters for every threshold from 0.6 to 0.7: .. code-block:: pycon >>> import time >>> if 1: ... start_time = time.time() ... for threshold in range(60, 71, 1): ... t = threshold/100.0 ... print(t, len(chemfp.butina(matrix=m, butina_threshold=t, progress=False))) ... print("Elapsed time:", time.time()-start_time) ... 0.6 326954 0.61 354204 0.62 377855 0.63 405589 0.64 433320 0.65 464492 0.66 499330 0.67 536273 0.68 573910 0.69 617330 0.7 657888 Elapsed time: 6.543123960494995 That's significantly faster! Use the :meth:`.ButinaClusters.save` method to save the results to a file. The output format can be specified directly, or inferred by the filename extension: .. code-block:: pycon >>> clusters.save("output.txt") >>> open("output.txt").readline() '#Centroid/1 include-members=1\n' >>> clusters.save("output.flat") '#Centroid-flat/1 include-members=1\n' >>> clusters.save("output.txt", format="flat") >>> open("output.txt").readline() '#Centroid-flat/1 include-members=1\n' >>> clusters.save("output.csv") >>> open("output.csv").readline() 'cluster,id,count,type,score\n' Butina clustering with fingerprints ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When fingerprints are passed to the butina function, it will generate the NxN similarity matrix at the given threshold (default: 0.7) before doing the clustering: .. code-block:: pycon >>> with chemfp.butina(fingerprints="benzodiazepine.fps.gz", NxN_threshold=0.9) as clusters: ... print(f"#clusters: {len(clusters)}") ... print(f"#elements in largest cluster: {len(clusters[0])}") ... ids = clusters.matrix.target_ids ... print("ids:", ", ".join(ids[i] for i in clusters[0])) ... #clusters: 9845 #elements in largest cluster: 13 ids: 9806545, 15978353, 18346686, 18346717, 21941544, 9914920, 21941549, 23376564, 9806982, 9828811, 9850353, 21941556, 21941589 Some of the Butina clustering variants require fingerprints and cannot work with only a similarity matrix. These can both be passed to the function, in this case using filenames: .. code-block:: pycon >>> chemfp.butina(matrix="benzodiazepine.npz", false_singletons="nearest-center") Traceback (most recent call last): ... ValueError: Cannot use false_singletons='nearest-center' without fingerprints >>> chemfp.butina(matrix="benzodiazepine.npz", fingerprints="benzodiazepine.fps.gz", ... false_singletons="nearest-center") ButinaClusters('12386 fingerprints, NxN-threshold=0.7 tiebreaker=randomize seed=3668915671 false-singletons=nearest-center rescore=1, took 333.22 ms', result=[314 clusters], picker=...) Butina clusters and pandas ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The :meth:`.ButinaClusters.to_pandas` method for the clusters returned by the :func:`chemfp.butina` function exports the fingerprint details to a pandas data frame, with one row per fingerprint: .. code-block:: pycon >>> import chemfp >>> clusters = chemfp.butina(matrix="benzodiazepine.npz") >>> clusters.to_pandas() cluster id type score 1632 1 10572033 CENTER 1.000000 1720 1 18968942 MEMBER 1.000000 5632 1 18665677 MEMBER 0.814815 5637 1 18968923 MEMBER 0.814815 4441 1 10223808 MEMBER 0.811321 ... ... ... ... ... 9995 225 20360404 CENTER 1.000000 9645 225 20360405 MEMBER 0.984127 11260 226 20360429 CENTER 1.000000 11069 226 20360430 MEMBER 0.985294 12385 227 11979873 CENTER 1.000000 [12386 rows x 4 columns] The method takes several parameters, including the column titles to use, and the choice to rename the type to "CENTER" and "MEMBER" or to keep the internal names, like the following which shows only the moved false singletons: .. code-block:: pycon >>> clusters.to_pandas(columns=["i", "ID", "type", "score"], ... rename=False).query("type == 'MOVED_FALSE_SINGLETON'") i ID type score 7097 1 16001255 MOVED_FALSE_SINGLETON 0.786885 2421 1 10980601 MOVED_FALSE_SINGLETON 0.711538 12088 1 20711717 MOVED_FALSE_SINGLETON 0.636364 12341 1 9854335 MOVED_FALSE_SINGLETON 0.631579 3046 1 20063249 MOVED_FALSE_SINGLETON 0.629630 ... ... ... ... ... 12210 118 21797426 MOVED_FALSE_SINGLETON 0.509091 7844 124 25262692 MOVED_FALSE_SINGLETON 0.690141 7245 144 21489312 MOVED_FALSE_SINGLETON 0.588235 3526 149 19865801 MOVED_FALSE_SINGLETON 0.560606 7302 162 22153741 MOVED_FALSE_SINGLETON 0.746032 [86 rows x 4 columns] If you want to be really hard-core, you can dig down to the low-level fingerprint assignments:: >>> clusters.assignments.to_pandas() assignment_type cluster_idx score 0 2 11 0.477273 1 2 11 0.413043 2 2 11 0.500000 3 2 11 0.500000 4 2 11 0.404255 ... ... ... ... 12381 2 121 0.402878 12382 2 121 0.402878 12383 2 134 0.493243 12384 2 134 0.439490 12385 1 226 1.000000 [12386 rows x 3 columns] or work with them directly, in-place, as NumPy array with a complex dtype: .. code-block:: pycon >>> clusters.assignments.as_numpy() array([(2, 11, 0.47727273), (2, 11, 0.41304348), (2, 11, 0.5 ), ..., (2, 134, 0.49324324), (2, 134, 0.43949045), (1, 226, 1. )], dtype={'names': ['assignment_type', 'cluster_idx', 'score'], 'formats': ['>> import chemfp >>> with chemfp.open_fingerprint_writer(None, include_metadata=False) as writer: ... writer.write_fingerprint("AB123", b"\xfa\x49") ... fa49 AB123 There are new command-line for the "date" line in the metadata. The default saves the current time (in UTC) to the output, which means the output is not bytewise reproducible, which is sometimes useful. The new ``--date`` option lets you specify the date to use, as an ISO date string. The new ``--no-date`` omits the date line from the metadata. NOTE: A future version of chemfp will support local time zones in FPS files. This will use the `zoneinfo `_ module added to Python 3.9, with fallback to `tzdata `_ for Python 3.8. (Note: Python 3.8 end-of-life for Python core maintainer support is October 2024.) spherex changes ------------------------------- There are several changes related to chemfp's spherical exclusion implementation, called ":command:`spherex`". Multi-threaded ~~~~~~~~~~~~~~~~~~ For each iteration, the spherex algorithm first picks a fingerprint (using one of several methods) then finds all other remaining fingerprints which are sufficiently close to the newly picked fingerprint, and assigns them to a new cluster. In chemfp 4.1 the "finds all other remaining fingerprints" step has been parallelized using OpenMP. The default number of threads depends on the OpenMP configuration, and likely the number of available cores. I've found that you can use several times more threads than cores and still get useful speedup, so if performance is important you should try larger values. As with simsearch, you can specify the number of threads using the OMP_NUM_THREADS environment variable. You can also set it on the :command:`spherex` command-line using the ``--num-threads``/``-j`` option, and with the :func:`chemfp.spherex` API using the ``num_threads`` parameter. Parallelism introduces some non-determinism in the order that sphere members are added to a cluster. The diversity result classes now implement :meth:`.PicksAndScores.reorder` and :meth:`.Neighbors.reorder` with the same orderings as :meth:`.SearchResult.reorder`, which can be used to sort the list by decreasing score, and generate a canonical ordering. The new :meth:`.PicksAndScores.move_pick_index_to_first` function moves the given member to the first element. Together with the reorder function these can be used to put the list in canonical order, with the cluster center as the first element. Output format ~~~~~~~~~~~~~~~~~ By default the :command:`spherex` command generates output in "spherex" format (previously called "chemfp" format). While the format hasn't changed, there are additional guarantees about the default output order. The sphere center is always the first element in the members list (if present), and the remaining members are always given in a canonical order, sorted by decreasing score. Spherex now also supports the "centroid" output format, matching the same format for Butina clustering. The idea is to make it easier to switch between the two algorithms. In this release "centroid" output by default only includes the picked sphere centers. Use ``--include-members`` to include all of the sphere members in the output. NOTE! the default "centroid" output from :command:`chemfp butina` is currently different than the one from :command:`chemfp spherex`. The butina output includes both the center and the members, and requires the ``--no-members`` option to only output the centers. The long-term goal is to make these the same, where the default includes the members. For future compatibility, please use ``--no-members`` if you really want the current default. The ``--include-members`` and ``--no-members`` options are aliases for the older ``--include-hits`` and ``--no-hits`` options. The older options are currently supported, for backwards compatibility, but hidden from the ``--help``. They should not be used and will likely eventually be removed. :command:`spherex` does not yet implement the "flat" centroid format of the :command:`butina` command. New ranking formats ~~~~~~~~~~~~~~~~~~~~~~ There are two new ways to specify fingerprint ranks for directed sphere exclusion when using the ``--ranks`` option. Instead of specifying a rank order or weight they order the picks by file position - the first pick is the first id in the file, the second pick is the next available id, and so on. The "txt" format contains one line per id, with an optional header. Use ``--ranks-has-header`` if the first line is a header. Alternatively you can specify a fingerprint file, which will be processed in input order. (As a reminder, by default FPB files are sorted in popcount order. It's possible to store the FPB file in input order using the :command:`fpcat` command-line tool, but in that case it's probably better to use the original FPS directly.) The fingerprint ranking option exists for cases where the fingerprints in the candidate file are in rank order. In that case, the file is specified once for the candidates and again for ``--ranks``: .. code-block:: console chemfp spherex dataset.fps --ranks dataset.fps csv background ------------------------------- Here is some background about what "csv" means in a chemfp context. It sets the stage for the next section. The term `csv `_ comes from the initialism for "Comma-Separated Values", but is also used to describe several other related formats, like tab-separated values. I will use to mean the files that can be parsed with Python's `csv module `_. In chemfp, the "csv" format is the dialect of the comma-separated format following the quoting rules the Excel uses. (Quoting rules are used to, for example, allow a comma in a field without it being intepreted as a separated.) It is the same as Python's "`excel `_" dialect, which chemfp accepts as an alias. In chemfp, the "tsv" format is the dialect of the tab-separated format following the quoting rules the Excel uses. It is the same as Python's "`excel-tab `_" format, which chemfp accepts as an alias. Python's csv module lets you specify `your own csv dialect `_, with options for which delimiter character to use, how escaping and quoting work, and whether or not initial spaces should be ignored. These are: - delimiter: The character used to separate fields - escapechar: On reading, the escapechar removes any special meaning from the following character - quotechar: A one-character string used to quote fields containing special characters - quoting: Controls if the reader should recognize quoting - skipinitialspace: If True, ignore leading spaces in a field. For details see `the Python documentation `_, The :command:`csv2fps` command-line tool lets you specify these via command-line options. It starts with the specified ``--dialect`` (defaulting to "csv"), then applies the other specified terms, like ``--separator``. Certain characters are difficult to write on the command-line so the ``--separator``, ``--quotechar`` and ``--escapechar`` also accept the aliases "tab", "backslash", "space", "quote", "doublequote", "singlequote", or "bang". The :func:`.get_dialect` function in the new :mod:`chemfp.csv_readers` module is the equivalent in the Python API. The returned :class:`.CSVDialect` can be used with both the chemfp API and Python's csv module. "sniffing" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The csv module has a way to "`sniff `_" the dialect. It reads the start of the file and uses a number of heuristics to determine the dialect parameters, and to determine if the first line is a header. Chemfp implements the special "sniff" dialect which will use these method to get the (hopefully) correct settings. My experience has been a bit iffy, and it strongly depends on the dataset you have. It's best used with :command:`chemfp csv2fps --describe` to get an initial idea of the actual format than to use it as the default dialect. MolPort csv file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some datasets are in csv format with a structure (usually SMILES or InChI) stored in one of the columns, and an identifier stored in another. One such example is the MolPort distribution, which I'll use for these examples. It is a tab-delimited format with a header line containing titles, followed by the data. Here's an example, which unfortunately overflows the available display space, so I've trimmed it the first 70 or so characters: .. code-block:: console % gzcat fulldb_smiles-009-000-000--009-499-999.txt.gz | head -2 SMILES SMILES_CANONICAL MOLPORTID STANDARD_INCHI INCHIKEY... Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1 Cl.CCC(C(=O)OC1CC2CCC(C... That's still enough to see there are a couple of SMILES fields, a MolPort id, and an InChI. You can also see it's a tab-separated file, though I've converted them to multiple spaces for this output. There are a lot of tools for working with CSV files, including the popular `xsv `_ and its variants. Those are very fast and flexible, with ways to filter, join, and manipulate the data. For example, the following line for the Bash shell will select the SMILES and MOLPORTID columns (from a tab-delimited file), then re-format the comma-delimited output to tab-delimited, to get a tab-delimited SMILES file with a header: .. code-block:: console % gzcat fulldb_smiles-009-000-000--009-499-999.txt.gz | \ xsv select SMILES,MOLPORTID -d $'\t' | xsv fmt -t $'\t' | head -3 SMILES MOLPORTID Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1 MolPort-009-675-630 CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1 MolPort-009-675-631 This uses the Bash's ANSI-C quoting to insert the tab character. There are other ways to insert the tab character, depending on your shell and/or terminal emulator. csv2fps command ------------------------------- While it's not hard to convert a CSV file to SMILES for input to chemfp fingerprint generation, with chemfp 4.1 you can now process CSV files directly using the :command:`chemfp cvs2fps` command. I'll use "fulldb_smiles-009-000-000--009-499-999.txt.gz" from the MolPort distribution, but because that's rather long I'll refer to it using the "$FILENAME" environment variable. csv2fps \-\-describe ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When getting started with a new CSV file, use the special ``--describe`` option to have chemfp describe what it thinks the file contains. By default it parses the file as the "csv" dialect, that is comma-delimited, with a header. It tries to show the titles and columns for the first data line: .. code-block:: console % chemfp csv2fps --describe $FILENAME === Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' === Dialect: csv delimiter: ',' (default) quotechar: '"' (default) escapechar: None (default) doublequote: True (default) skipinitialspace: False (default) quoting: minimal has_header: True Titles: #1: 'SMILES\tSMILES_CANONICAL\tMOLPORTID\tSTANDARD_INCHI\tINCHIK ... ' Columns for the first row: #1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1\tCl.CCC(C(=O)OC1CC ... ' #2: '14-17H' #3: '3' #4: '9-12H2' #5: '1-2H3;1H\tVEUYDINLFCGWRM-UHFFFAOYSA-N\tin stock' Long strings are trimmed, and replace with the "...", so they easily fit into an 80 column display. The output fields are printed using Python's string encoding. The many "\\t" characters is strong evidence that the file is tab-separated. I can use the "sniff" dialect to have Python's csv module use heuristics to identify the dialect: .. code-block:: console % chemfp csv2fps --describe --dialect sniff $FILENAME === Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' === Dialect: delimiter: '\t' quotechar: '"' (default) escapechar: None (default) doublequote: False skipinitialspace: False (default) quoting: minimal has_header: True Titles: #1: 'SMILES' #2: 'SMILES_CANONICAL' #3: 'MOLPORTID' #4: 'STANDARD_INCHI' #5: 'INCHIKEY' #6: 'AVAILABILITY' Columns for the first row: #1: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1' #2: 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)c1ccccc1' #3: 'MolPort-009-675-630' #4: 'InChI=1S/C18H25NO2.ClH/c1-3-17(13-7-5-4-6-8-13)18(20)21-16- ... ' #5: 'VEUYDINLFCGWRM-UHFFFAOYSA-N' #6: 'in stock' or I can specify the "tsv" dialect, which gives the same titles and same columns for the first line. Use ``--no-header`` if the CSV file does not start with a header. For this file that option produces the following: .. code-block:: console % chemfp csv2fps --describe --dialect tsv $FILENAME --no-header === Description for 'fulldb_smiles-009-000-000--009-499-999.txt.gz' === Dialect: tsv delimiter: '\t' quotechar: '"' (default) escapechar: None (default) doublequote: True (default) skipinitialspace: False (default) quoting: minimal has_header: False Titles: not available Columns for the first row: #1: 'SMILES' #2: 'SMILES_CANONICAL' #3: 'MOLPORTID' #4: 'STANDARD_INCHI' #5: 'INCHIKEY' #6: 'AVAILABILITY' csv2fps and column specifiers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The default for the :command:`csv2fps` command is to parse the file in "csv" dialect (or "tsv" if the file ends in ".tsv"), treat the first line as a header, get the identifers from the first column and SMILES from the second column, and generate RDKit Morgan2 fingerprints. I know the file is tab-delimited (see the previous section), so I'll specify "tsv" format and show the first two lines of fingerprint output: .. code-block:: console % chemfp csv2fps -d tsv $FILENAME | head -8 | fold -w 70 #FPS1 #num_bits=2048 #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1 #software=RDKit/2021.09.4 chemfp/4.1 #source=fulldb_smiles-009-000-000--009-499-999.txt.gz #date=2023-05-15T09:10:17 0200000000000000000001000000800000000800000000000000000000000000000000 0040000000000004000000002020000000000000000000080000000000000000000000 4000000000000000000000040000000080000000000000000000000000008800000040 0000000000000000000040800000000000000000000008000000040202000001000000 4000000200000000008000000000000000000000001040000020000000002000100000 0000000000000000000000000000000000040100000020000000000002000000000000 0000000000000000400000000000000000000000000000000200000040300000000000 0000000000000000000000 Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1 0200000000000000008001000000200000000000000000000000001000002000000000 0800000000000000000000000020000000000000000000000000000000000000000000 0020000000000000000000040000000000000000000001000000000000408000000000 0000020000000000000000000000000000000000000002000000000204000001004000 0000000001000000008000000000000000000000000000000000000000002000100000 0000000000000000000000000000000000000000000020000000000108000000000000 0080000000000000400400000000000000000000000000010200000000200002000000 0000000000400000000000 CC(NC(=O)C(CC1=CC=CC=C1)NC(C)=O)C1=CC=CC=C1 This ended up interpreting the "'SMILES" column as the id, and the "SMILES_CANONICAL" column as the SMILES, which is why :command:`csv2fps` is able to generate output, though likely most people want the id from the MOLPORTID column. Use ``--id-column`` (or ``--id-col`` for short) to specify the id column to use. In the following I'll specify it by title, and I'll disable the header lines (the "metadata") using the ``--no-metadata``, so the output contains just the first two lines of fingerprint data: .. code-block:: console % chemfp csv2fps -d tsv $FILENAME --id-col MOLPORTID --no-metadata | head -2 | fold 02000000000000000000010000008000000008000000000000000000000000000000000040000000 00000400000000202000000000000000000008000000000000000000000040000000000000000000 00040000000080000000000000000000000000008800000040000000000000000000004080000000 00000000000000080000000402020000010000004000000200000000008000000000000000000000 00104000002000000000200010000000000000000000000000000000000000000401000000200000 00000002000000000000000000000000000040000000000000000000000000000000020000004030 00000000000000000000000000000000 MolPort-009-675-630 02000000000000000080010000002000000000000000000000000010000020000000000800000000 00000000000000002000000000000000000000000000000000000000000000200000000000000000 00040000000000000000000001000000000000408000000000000002000000000000000000000000 00000000000000020000000002040000010040000000000001000000008000000000000000000000 00000000000000000000200010000000000000000000000000000000000000000000000000200000 00000108000000000000008000000000000040040000000000000000000000000001020000000020 00020000000000000000400000000000 MolPort-009-675-631 You can only specify a column title if the file contains headers. You can also specify the column index by column number, starting from 1. I'll use ``--molecule-column`` (or ``--mol-col`` for short) to read the SMILES from column #2 ("SMILES_CANONICAL"), and ``--id-col`` to read the SMILES from column #3: .. code-block:: console % chemfp csv2fps -d tsv $FILENAME --id-col 3 --mol-column 2 --no-metadata | head -2 | fold 02000000000000000000010000008000000008000000000000000000000000000000000040000000 00000400000000202000000000000000000008000000000000000000000040000000000000000000 00040000000080000000000000000000000000008800000040000000000000000000004080000000 00000000000000080000000402020000010000004000000200000000008000000000000000000000 00104000002000000000200010000000000000000000000000000000000000000401000000200000 00000002000000000000000000000000000040000000000000000000000000000000020000004030 00000000000000000000000000000000 MolPort-009-675-630 02000000000000000080010000002000000000000000000000000010000020000000000800000000 00000000000000002000000000000000000000000000000000000000000000200000000000000000 00040000000000000000000001000000000000408000000000000002000000000000000000000000 00000000000000020000000002040000010040000000000001000000008000000000000000000000 00000000000000000000200010000000000000000000000000000000000000000000000000200000 00000108000000000000008000000000000040040000000000000000000000000001020000000020 00020000000000000000400000000000 MolPort-009-675-631 If the specified column name is an integer then it is treated as an index, with 1 indicating the first column (0 and negative values are out of range). If the first character of the column specifier is an "@" then the rest of the specifier is used as the column title. For example, if the second column is titled "10" then use "@10" as the column specifier. If the column title is "@A" then use "@@A" as the column specifier. If the first character of the column specifier is a "#" then the rest of the specifier must be an integer, which is interpreted as the column index. For example, use "#3" to match the third column. If the column title starts with a "#" then use the "@" prefix (eg, if the column is "#entry" then specify it as "@#entry"). Specify the csv2fps input structure format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default :command:`mol2fps` parses the molecule column as SMILES (in "smi" format) . Use ``--format`` to specify another format. For example, the MolPort file column STANDARD_INCHI contains an InChI string, and the INCHIKEY column contains the corresponding InChIKey. I'll read the structures from the former column, and the id from the latter, and have chemfp parse the molecule as "inchi": .. code-block:: console % chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \ --format inchi --no-metadata | head -2 | fold 02000000000000000000010000008000000008000000000000000000000000000000000040000000 00000400000000202000000000000000000008000000000000000000000040000000000000000000 00040000000080000000000000000000000000008800000040000000000000000000004080000000 00000000000000080000000402020000010000004000000200000000008000000000000000000000 00104000002000000000200010000000000000000000000000000000000000000401000000200000 00000002000000000000000000000000000040000000000000000000000000000000020000004030 00000000000000000000000000000000 VEUYDINLFCGWRM-UHFFFAOYSA-N 02000000000000000080010000000000000000000000000000000000000000000010000800000000 00000000000000002000000000000000000000000000000800000000000000000004000000000000 00000000000000000000210000000000000000008000000000000002000000004000000000000000 00000000000000020000000002040000010000000000000000000000008000000000000000000000 00000000000000000000200010000000000010000000000000000000000000000100000000200000 00000000000000000000000000000008000440040800080000000000000000000001020000000200 00000000000000000000000000000000 BAHWBLGZSGIKQC-UHFFFAOYSA-N If you leave out the format, or specify "smi" format, then it will try to interpret the InChI strings as SMILES, which will fail. As a safety mechanism, :command:`csv2fps` will exit if none of the molecules in the first 100 records can be parsed. That looks like: .. code-block:: console % chemfp csv2fps $FILENAME -d tsv --id-col INCHIKEY --mol-col STANDARD_INCHI \ --format smi --no-metadata | head -2 | fold ... many error lines removed ... [11:56:09] SMILES Parse Error: syntax error while parsing: InChI=1S/C10H8C l2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16) [11:56:09] SMILES Parse Error: Failed parsing SMILES 'InChI=1S/C10H8Cl2N4/ c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)' for in put: 'InChI=1S/C10H8Cl2N4/c11-6-3-1-2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5 H,(H3,13,14,15,16)' Error: Each of the first 100 records contained structure errors. Final error: RDKit cannot parse the SMILES 'InChI=1S/C10H8Cl2N4/c11-6-3-1- 2-4-7(6)14-9-5-8(12)15-10(13)16-9/h1-5H,(H3,13,14,15,16)', column 'STANDAR D_INCHI' (#4), record 100, line 101, 'fulldb_smiles-009-000-000--009-499-9 99.txt.gz'. Exiting. Use the ``-R`` command-line option to specify toolkit- and format-specific reader arguments. As a short-cut to enable or disable CXSMILES extension support, use ``--cxsmiles`` (enabled by default) and ``--no-cxsmiles``. structure formats with embedded identifiers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The previous examples read the id from one column, and the molecule from another, as a SMILES or InChI string. Sometimes the column contains a more complex record, like a SMILES record with both a SMILES string and identifier, or even as an embedded SDF. (The CSV format supports multi-line column entries.) Use the ``--id-from-molecule`` option (or ``--id-from-mol`` for short) to have :command:`csv2fps` get the identifier from the parsed molecule record. For a SMILES or InChI record this defaults to the rest of the column entry. This can be changed with the ``--delimiter`` option. If the column entry is an SDF record then by default the identifier comes from the title line. Use ``--id-tag`` to get the identifier from the named data value. Specify an alternate fingerprint type ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The examples so far generated RDKit Morgan fingerprints. To use an alternative name specify a chemfp type string using ``--type`` or use the ``--with`` option to get the fingerprints from the metadata of a fingerprint file. The following uses Open Babel to generate FP2 fingerprints: .. code-block:: console % chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type OpenBabel-FP2 \ | head -8 | fold #FPS1 #num_bits=1021 #type=OpenBabel-FP2/1 #software=OpenBabel/3.1.0 chemfp/4.1 #source=fulldb_smiles-009-000-000--009-499-999.txt.gz #date=2023-05-15T10:21:04 00260200002002000600060000020000001000100000000000000080010400001000500080000001 100a001820010100010200000020c004000000020000a20808100800000004400000880002800005 0090015000003008c008080000080000400000200000000010000000400008000007010000011000 0000000140000000 MolPort-009-675-630 00060000004004200b0006000001000c0000009000000000000000c0000800001800408300000000 00862038000100800102000008000000000000020000000000000400000400000020000002880006 00b0004000003008c0080900800800000c1000220000000000000000000000000006010400011000 0032800000000000 MolPort-009-675-631 The following uses CDK to generate CDK's own implementation of the 881-bit PubChem fingerprints: .. code-block:: console % chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --type CDK-Pubchem \ | head -8 | fold #FPS1 #num_bits=881 #type=CDK-Pubchem/2.0 #software=CDK/2.8 chemfp/4.1 #source=fulldb_smiles-009-000-000--009-499-999.txt.gz #date=2023-05-15T10:22:50 005e0c002000000000000000000000000080060000003c0200000000000000800000007800000000 00b03c8719604c10c10020001140044b100040000004000010118010001110044c01a90861040064 03801111e01913077101000000000000000000000000000000000000000000 MolPort-009-675- 630 00de0c000000000000000000000000000000000000000c0600000000000000800200007800080000 0030148319204c00410300001140844a080041000004000010118111201110064c01898c29041006 69001111e01811017100000000000000000000000000000000000000000000 MolPort-009-675- 631 The chemfp type string may contain fingerprint parameters. The following uses OEChem to generate OEGraphSim circular fingerprints, folded to 128 bit, instead of the default of 4096 bits: .. code-block:: console % chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 \ --type "OpenEye-Circular numbits=128" | head -8 | fold #FPS1 #num_bits=128 #type=OpenEye-Circular/2 numbits=128 minradius=0 maxradius=5 atype=Arom|AtmNum|C hiral|EqHalo|FCharge|HCount btype=Order #software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1 #source=fulldb_smiles-009-000-000--009-499-999.txt.gz #date=2023-05-15T10:24:44 905a04205c4473c607f95c06123b2026 MolPort-009-675-630 a07873c1407c5c7617830d8428910170 MolPort-009-675-631 Finally, here's an example using ``--with`` to get the fingerprint file from a file: .. code-block:: console % chemfp csv2fps $FILENAME -d tsv --id-col 3 --mol-col 1 --using reference.fps | \ head -7 | fold #FPS1 #num_bits=4096 #type=OpenEye-Circular/2 numbits=4096 minradius=0 maxradius=5 atype=Arom|AtmNum| Chiral|EqHalo|FCharge|HCount btype=Order #software=OEGraphSim/2.5.4.1 (20220607) chemfp/4.1 #source=fulldb_smiles-009-000-000--009-499-999.txt.gz #date=2023-05-15T10:29:18 00000000000000000000000000000000000000000000080000000010000000000000000800000000 00200000000040000000000000000000000000000000000000000000000008000004000001080000 00000200000000000000000200000000000000000000100000000000002000200000000000010800 00008000000000000000000000000000000000008000040000000000040000000000000000002000 00008000020000000000000000000000000000040000000000000000000000000000000000800000 00000000000080000000000000000000000000000000000000000000000000000000000000000000 00800000000002000000000000000020002000000100000000080000000000000000000000000000 00000000000000000000000000000000000020000000000000000000000000000000080000000000 00000000000080008000000000000008000000002400000000000000000000000000000000020000 00000000000000004000000100100000000048000801840000000000001000000000020000000000 00000000000000000000000000000000000000000000000000000000000000000000000000001000 20008000000000000000000000200000000000000000000000000000000000000400000000000000 0020000804000000000000000000000000000000000000000000000000000000 MolPort- 009-675-630 CSV files with fingerprint data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of storing a molecule, the CSV file might contain fingerprint data, represented in one of various encodings like hex or base64. If this is the case, use ``--fingerprint-column`` (or ``--fp-column`` or ``--fp-col`` for short) to indicate that the CSV file contains fingerprint data, and to indicate which column contains that data. csv2fps supports the same decoder options as sdf2fps. The default, ``--hex``, treats the fingerprint as hex-encoded with bit - as the first bit of the first byte. Use :command:`chemfp csv2fps --help` for the full list of fingerprint decoder options. csv character encoding ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The "character set" describes how the text in the CSV file is encoded as a sequence of bytes. By default :command:`csv2fps` assumes the CSV file is UTF-8 encoded. This may be a problem with some programs which write data in another encoding, like "utf16" and "cp1252" for Windows or "latin1" for older Unix tools. Use ``--encoding`` to specify the input encoding, which must be one of the `supported Python encoding `_. Use ``--encoding-errors`` to describe how to handle input which could not be decoded. This must be one of the supported `Python encoding handlers `_ "strict", "ignore", "replace", or "backslashreplace". csv2fps processing errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :command:`csv2fps` can run into two main types of problems when processing a row of the file: 1) the row may not have enough colums, and 2) the toolkit may be unable to parse the molecule column or the fingerprint column. The ``--errors`` option specifies how to handle molecule and fingerprint parse errors. The default is "report" for molecules and "strict" for fingerprints. The "strict" handler prints an error message and stops processing. The "report" handler prints an error message and keeps on processing. The "keep" handler simply keeps on processing. The ``--csv-errrors`` option specifies how to handle CSV parse errors. The default is "strict", which requires the id and molecule or fingerprint column to exist, or print an error message and exit if not. It can also be "report" and "keep". csv2fps TODO ------------------------------------------ There is currently no high-level ``chemfp.csv2fps`` Python function which is equivalent to the command-line tool. This will be added in a future version of chemfp. chemfp.csv_readers module ------------------------------------------ Chemfp 4.1 added the :mod:`.csv_readers` module with functions for working with CSV files. To start with, there is a utility function for specifying the csv dialect. It starts with a base dialect and lets you modify different values. The result can be used as a chemfp dialect and as input to Python's own `csv module `_. .. code-block:: pycon >>> from chemfp import csv_readers >>> csv_readers.get_dialect("csv") CSVDialect(delimiter=',', quotechar='"', escapechar=None, doublequote=True, skipinitialspace=False, quoting=0 ('minimal'), lineterminator='\r\n', strict=False) >>> csv_readers.get_dialect("csv", doublequote=False) CSVDialect(delimiter=',', quotechar='"', escapechar=None, \ doublequote=False, skipinitialspace=False, quoting=0 ('minimal'), lineterminator='\r\n', strict=False) The lowest-level interface reads a CSV file, parses any available header, and iterates over rows in the file. .. code-block:: pycon >>> filename = "fulldb_smiles-009-000-000--009-499-999.txt.gz" >>> reader = csv_readers.read_csv_rows(filename, dialect="tsv") >>> reader.titles ['SMILES', 'SMILES_CANONICAL', 'MOLPORTID', 'STANDARD_INCHI', 'INCHIKEY', 'AVAILABILITY'] >>> next(reader) ['Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1', 'Cl.CCC(C(=O)OC1CC2CCC (C1)N2C)c1ccccc1', 'MolPort-009-675-630', 'InChI=1S/C18H25NO2.ClH/c 1-3-17(13-7-5-4-6-8-13)18(20)21-16-11-14-9-10-15(12-16)19(14)2;/h4- 8,14-17H,3,9-12H2,1-2H3;1H', 'VEUYDINLFCGWRM-UHFFFAOYSA-N', 'in sto ck'] >>> reader.close() It also supports a context-manager, if you don't want an explicit close() nor want to depend on Python's garbage collector. The :func:`.read_csv_ids_and_molecules_with_parser` function gives a higher-level, albeit somewhat complicated, way to iterate over (id, molecule) pairs in the file. It requires a function which takes the text of the molecule field and returns an (id, molecule) pair -- the id is only used if ``id_column`` is None: .. code-block:: pycon >>> def convert(mol_text): ... return (None, (len(mol_text), mol_text)) ... >>> reader = csv_readers.read_csv_ids_and_molecules_with_parser( ... filename, convert, dialect="tsv", id_column=3, mol_column=1) >>> reader CSVIdAndMoleculeReader(id from column 'MOLPORTID' (column #3), mol from column 'SMILES' (column #1), dialect='tsv', record_format=None, filename='fulldb_smiles-009-000-000--009-499-999.txt.gz') >>> for title in reader.titles: print(repr(title)) ... 'SMILES' 'SMILES_CANONICAL' 'MOLPORTID' 'STANDARD_INCHI' 'INCHIKEY' 'AVAILABILITY' >>> next(reader) ('MolPort-009-675-630', (40, 'Cl.CCC(C(=O)OC1CC2CCC(C1)N2C)C1=CC=CC=C1')) This is easily modified to return only the molecule text rather than the (length, text) pair. The columns can be specified as integers (the value 1 means the first column) or strings (to use the named column title). This API is designed to be compatible with the "id and molecule" parser create by the toolkit wrappers: .. code-block:: pycon >>> from chemfp import openeye_toolkit >>> converter = openeye_toolkit.make_id_and_molecule_parser("smi") >>> reader = csv_readers.read_csv_ids_and_molecules_with_parser( ... filename, converter, dialect="tsv", id_column=3, mol_column=1) >>> next(reader) ('MolPort-009-675-630', >) NOTE: If you want to use a toolkit wrapper to parse molecules from a CSV file then you almost certainly want to use the new - and simpler - toolkit wrapper function ``read_csv_ids_and_molecules``. Finally, use :func:`.read_csv_ids_and_fingerprints` to read the ids and encoded fingerprints from a file. The default uses the "hex" decoder (:func:`chemfp.bitops.hex_decode`) but you can pass in your own function returning the number of bits (which can be None) and the decoded fingerprint as a byte string: .. code-block:: pycon >>> def decode_fp(fp_text): ... return 32, x.to_bytes(4, "big") ... >>> reader CSVFingerprintReader(id from column 'MOLPORTID' (column #3), fp from column 'SMILES_CANONICAL' (column #2), dialect='tsv', filename= 'fulldb_smiles-009-000-000--009-499-999.txt.gz') >>> next(reader) ('MolPort-009-675-630', b'\x00\x00\x00*') New toolkit wrapper functions to read CSV files ---------------------------------------------------- The toolkit wrappers have a new ``read_csv_ids_and_molecules`` function which reads ids and molecules from a CSV file, using a specified structure format. .. code-block:: pycon >>> from chemfp import rdkit_toolkit as T >>> filename = "fulldb_smiles-009-000-000--009-499-999.txt.gz" >>> reader = T.read_csv_ids_and_molecules(filename, dialect="tsv", ... id_column="MOLPORTID", mol_column="STANDARD_INCHI", format="inchi") >>> >>> reader CSVIdAndMoleculeReader(id from column 'MOLPORTID' (column #3), mol from column 'STANDARD_INCHI' (column #4), dialect='tsv', record_format= 'inchi', filename='fulldb_smiles-009-000-000--009-499-999.txt.gz') >>> next(reader) ('MolPort-009-675-630', ) This works for all of the supported cheminformatics toolkits: .. code-block:: pycon >>> from chemfp import rdkit_toolkit as T >>> from chemfp import openbabel_toolkit as T >>> from chemfp import openeye_toolkit as T >>> from chemfp import cdk_toolkit as T There is currently no direct way to read the molecules and convert them to fingerprints as an (id, fingerprint) pair. You will need to handle the conversion yourself, like the following: .. code-block:: pycon >>> from chemfp import rdkit_toolkit as T >>> fptype = T.morgan(fpSize=128) >>> reader = T.read_csv_ids_and_molecules(filename, dialect="tsv", ... id_column="MOLPORTID", mol_column="STANDARD_INCHI", format="inchi") >>> >>> fptype = T.morgan(fpSize=128) >>> id_fp_iter = ((id, fptype.from_mol(mol)) for (id, mol) in reader) >>> next(id_fp_iter) ('MolPort-009-675-630', b'&\x15HD\xca\xa2\xc0\x00A\x00o\x02P\x00\xc0:') The next version of chemfp will likely have a function for this operation. translate command ----------------------------------- Chemfp uses a "toolkit wrapper" to provide a consistent API for reading and writing structure files for each of the supported toolkits, and for generating fingerprints using the toolkit-specific fingerprint generators. Chemfp 4.1 adds a new :command:`chemfp translate` command-line tool which uses that wrapper API to read structure files in one format and write them to a file in another format. If you do not specify which toolkit to use (via the ``--in-toolkit`` option, or ``-in-tk`` or ``-T`` for short), chemfp will use the value of the "CHEMFP_TOOLKIT" environment variable, which must be a comma-separated list of toolkit names. If that variable is not available then it uses the string "rdkit,openbabel,openeye,cdk", that is, it first checks for the RDKit toolkit, then the Open Babel toolkit, then OpenEye's OEChem toolkit, and finally the CDK toolkit (via the jpype Python/Java adapter). These examples will use the RDKit toolkit. If unspecified, the translate command reads and writes in "smi" format, including possible CXSMILES extensions, and the output will be canonical SMILES but without CXSMILES: .. code-block:: console % echo "OCC ethyl alcohol" | chemfp translate CCO ethyl alcohol % echo "C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example" | chemfp translate C[C@H](F)[C@H](C)[C@@H](C)Br example Open Babel generate the same canonical SMILES: .. code-block:: console % echo "OCC ethyl alcohol" | chemfp translate -T openbabel CCO ethyl alcohol Use ``--in`` and ``--out`` to specify the input and output format: .. code-block:: console % echo "OCC phenol" | chemfp translate --out sdf3k phenol RDKit 0 0 0 0 0 0 0 0 0 0999 V3000 M V30 BEGIN CTAB M V30 COUNTS 3 2 0 0 0 M V30 BEGIN ATOM M V30 1 O 0.000000 0.000000 0.000000 0 M V30 2 C 0.000000 0.000000 0.000000 0 M V30 3 C 0.000000 0.000000 0.000000 0 M V30 END ATOM M V30 BEGIN BOND M V30 1 1 1 2 M V30 2 1 2 3 M V30 END BOND M V30 END CTAB M END $$$$ % echo "C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example" | \ chemfp translate --out cxsmi C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example or depend on the input and filename extension: .. code-block:: console % chemfp translate phenol.smi -o phenol.inchi % cat phenol.inchi InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 phenol While chemfp can convert between different formats, it is not meant as a replacement for a tool like Open Babel, which is able to add or remove hydrogens, assign coordinates, and more as part of the translation process. Translate with the 'text' toolkit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Chemfp includes the "text" toolkit, which knows just enough about SMILES and SDF formatted files to read and write records. It is not a cheminformatics toolkit and it does not handle chemistry or format conversion. The text toolkit, as part of chemfp 4.1 support for CXSMILES, now supports heuristics which try to identify the CXSMILES extension in the SMILES file record. This means the text toolkit can be used with :command:`translate` as a way to remove the CXSMILES extensions from a SMILES file, without fully processing the SMILES string into a molecule object: .. code-block:: console % cat test.smi C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example % chemfp translate test.smi -T text C[C@H](F)[C@H](C)[C@@H](C)Br example % chemfp translate test.smi -T text --out cxsmi C[C@H](F)[C@H](C)[C@@H](C)Br |a:1,o1:3,5| example Unlike the chemistry toolkits, which only support UTF8-encoded input formats, the text toolkit supports other `extended ASCII `_ encodings, like latin1, which means it can be used to convert between two different encodings: .. code-block:: console % chemfp translate -T text --in-encoding latin1 latin1.sdf -o utf8.sdf However, you should almost certainly use a dedicated conversion tool for this purpose, like `iconv `_: .. code-block:: console % iconv -f latin1 -t utf-8 latin1.sdf > utf8.sdf Translate via an intermediate format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The :command:`translate` command can also do the conversion in two steps, that is, read the record in the input format, save it internally to an intermediate "via" format (which defaults to "sdf"), which is then parsed and written in the output format. For example, if the input file is in OpenEye's "oeb" format, and you want to convert it to Java code which will create the corresponding CDK molecule, then you can use OEChem to convert the OEB input to SDF and have CDK convert the SDF to "cdkjava" format, with the following: .. code-block:: console % chemfp translate -T openeye theobromine.oeb -U cdk --out cdkjava { IChemObjectBuilder builder = DefaultChemObjectBuilder.getInstance(); IAtomContainer mol = builder.newInstance(IAtomContainer.class); IAtom a1 = builder.newInstance(IAtom.class,"C"); a1.setFormalCharge(0); a1.setPoint2d(new Point2d(2.1348, 0.7541)); mol.addAtom(a1); .... Admittedly this is mostly a technically interesting gimmick as I haven't come up with a good example where this would be a useful way to convert between two chemistry formats. One place it might be useful is to handle a legacy Latin-1 encoded structure file (or one which uses another extended ASCII encoding), and convert it to another chemistry format, like the following, which uses the "text" toolkit to read "latin1.sdf", which is Latin-1 encoded, into an intermediate UTF-8-encoded SDF, then have RDKit translate the result into SMILES, sending the results to stdout: .. code-block:: console % chemfp translate -T text --in-encoding latin1 -U rdkit latin1.sdf translate_record function ----------------------------------- The toolkit wrappers have a new ``translate_record`` function which convert a single record from one format to another format. By default is parses from "smi" to a molecule then creates and returns in "smi" format. The following shows how the four supported cheminformatics toolkits canonicalize phenol: .. code-block:: pycon >>> import chemfp >>> from chemfp import rdkit_toolkit as T >>> T.translate_record("c1ccccc1O phenol") 'Oc1ccccc1 phenol\n' >>> from chemfp import openeye_toolkit as T >>> T.translate_record("c1ccccc1O phenol") 'c1ccc(cc1)O phenol\n' >>> from chemfp import openbabel_toolkit as T >>T.translate_record("c1ccccc1O phenol") 'Oc1ccccc1 phenol\n' >>> from chemfp import cdk_toolkit as T >>> T.translate_record("c1ccccc1O phenol") 'C1=CC=C(C=C1)O phenol\n' while here you can see they all generate the same InChI string: .. code-block:: pycon >>> import chemfp >>> for T in (chemfp.rdkit, chemfp.openeye, chemfp.openbabel, chemfp.cdk): ... print(T.name.rjust(10), ... T.translate_record("c1ccccc1", in_format="smistring", out_format="inchistring")) ... rdkit InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H openeye InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H openbabel InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H cdk InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H Structure I/O helper functions ------------------------------- Chemfp 4.0 introduced a few helper functions to make it easier to read and write structure files, and process structure records, when the format was known. These were: - ``from_smi()``: parse a SMILES record into a molecule - ``from_smi_file()``: read a SMILES file iterating over molecules - ``to_smi()``: convert a molecule to SMILES - ``to_smi_file()``: open an output file for writing molecule as SMILES plus equivalents for SDF and InChI, and functions for reading and writing SMILES string and InChI strings (these only contain a single struture, with neither identifier nor newline). These names proved hard to remember because they were not consistent with the rest of the chemfp API and did not include enough information about what they did. For example, if I wanted to read a SMILES file I would start `read_` then use tab-complete to see what was available. That would give me `read_ids_and_molecules` and `read_molecules`, but not show `from_smi_file`. Even if I remembered to use `from_smi_file`, I would forget if the iterator only returned molecules, or the more useful (id, molecule) pairs. I would also forget that `from_smi` was to read a record in "smi" format, and not (as I thought) to read a file in "smi". The new shortcut functions in chemfp 4.1 are: - ``read_smi_ids_and_molecules()``: read a SMILES file iterating over (id, molecule) pairs. - ``read_smi_molecules()``: parse a SMILES file iterating over molecules - ``parse_smi()``: parse a "smi" record into a molecule - ``create_smi()``: convert a molecule to a "smi" record - ``open_smi_writer()``: open an output file for writing molecule as SMILES These are also available for "sdf" and "inchi" formats, as well as the parse/create functions for "smistring" and "inchistring", which are the format names for just the SMILES and InChI components of the "smi" and "inchi" records, respectively. A few other formats, like "helm", and "fasta" are present on a toolkit-specific basis, but are probably not useful and are likely to be removed in the future unless people tell me they like them. In addition, "\*_from_string()` and "\*_to_string()" variants are also now supported, like ``read_smi_ids_and_molecules_from_string()`` and ``open_smi_writer_to_string()``. The old shortcuts are still available but generate a PendingDeprecationWarning. They will be removed in a future version of chemfp. Other API changes ---------------------- Chemfp uses a random number generator for some of the diversity selection and clustering options. The RNG is initialized with a 64-bit integer seed. If the seed is not specified then chemfp uses Python's RNG to generate the seed at random (Python's RNG uses a number of methods to generate its own seed). Chemfp 4.0 requested a 64-bit number from Python, which on average takes about 20 characters to represent in the output parameters, which drew the eye far more than it was useful. Chemfp 4.1 requests a 32-bit numbers as its initial RNG, which shouldn't meaningfully affect the scientific results. The :meth:`.FingerprintArena.copy` command now accepts ``ids``. If specified, it contains the list of ids to use in the copy. This is especially useful when making a subset given a list of indices, and you want to be able to map the new subset back to the original index. In that case use: .. code-block:: python b = a.copy(indices=offsets, ids=offsets) so the fingerprint at position *i* in the new arena was at position :code:`b.ids[i]` in the original arena. The diversity pickers now support a "max_time" parameter, to stop picking after a specified amount of seconds (given as a 64-bit float). The timeout is checked after each pick has finished. This helps gives smooth progress bars or other interactive feedback about the picking process, DEPRECATION NOTICE: The :func:`.bitops.byte_difference` and :func:`.bitops.hex_difference` functions have been renamed :func:`.bitops.byte_xor` and :func:`.bitops.hex_xor`, mostly because I couldn't remember what "difference" meant but I knew exactly what "xor" meant. The old functions still work, but generate a PendingDeprecationWarning. They will likely be removed in a future version of chemfp.