chemfp 4.1 documentation¶
chemfp is a set of command-line tools and a Python package for working with binary cheminformatics fingerprints, typically between several hundred and a few thousand bits long.
This is the documentation for chemfp 4.1. It is a work-in-progress and currently only covers the major changes since chemfp 3.5. While chemfp 4.1 is nearly completely backwards compatible with chemfp 3.5, the recent additions change and (hopefully!) improve how to use the chemfp API.
Chemfp 4.1 was released on 17 May 2023. with CXSMILES support, methods to save and load similarity search results to a SciPy sparse matrix, Butina clustering, methods to work with CSV files, and a tools to convert between structure formats.
It was tested with Python 3.8, 3.9, 3.10, and 3.11. It requires the “click” and “tqdm” third-party packages, which should be installed automatically as part of the normal installation process. Some optional features will only work if they are installed by other methods, like the NumPy, SciPy, and Pandas integration.
Chemfp 4.0 added new methods for diversity selection and improves API usability with new high-level functions and improved feedback for interactive use (including progress bars!).
For details on the older parts of the API, please also consult the chemfp 3.5 documentation.
Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit. The supported toolkits are OEChem/OEGraphSim, Open Babel, RDKit and CDK (via the JPype adapter).
Table of Contents¶
- What’s new in chemfp 4.1
- Required dependencies on click and tqdm
- New structure format specifiers
- New SearchResult(s) attributes
- Save simsearch results to “npz” files
- chemfp.npy file entry in an npz file
- Working with npz files in chemfp
- Importing SciPy csr matrices to chemfp
- Butina clustering
- Butina on the command-line
- Butina parameter tuning
- High-level Butina API
- Changed default output format name
- Output metadata options
- spherex changes
- csv background
- csv2fps command
- csv2fps TODO
- chemfp.csv_readers module
- New toolkit wrapper functions to read CSV files
- translate command
- translate_record function
- Structure I/O helper functions
- Other API changes
- What’s new in chemfp 4.0
- Installing chemfp
- Working with the command-line tools
- Generate fingerprint files from PubChem SD tags
- k-nearest neighbor search
- simsearch CSV output
- Threshold search
- simsearch CSV output when no hits
- Combined k-nearest and threshold search
- NxN (self-similar) searches
- Using a toolkit to process the ChEBI dataset
- Use structures as input to simsearch
- Make new fingerprints matching the type in an existing file
- Alternate error handlers
- chemfp’s two cross-toolkit substructure fingerprints
- substruct fingerprints
- Generate binary FPB files from a structure file
- Convert between FPS and FPB formats
- Specify the fpcat output format
- Alternate fingerprint file formats
- The FPB format
- Get licensed FPB files containing ChEMBL 29 fingerprints
- Similarity search with the FPB format
- Multi-core similarity search
- Converting large data sets to FPB format
- Faster gzip decompression
- Generate fingerprints in parallel and merge to FPB format
- Help for the command-line tools
- Getting started with the API
- Get ChEMBL 32 fingerprints in FPB format
- Similarity search of ChEMBL by id
- Similarity search of ChEMBL using a SMILES
- Sorting the search results
- Fingerprints as byte strings
- Generating Fingerprints
- Similarity Search of ChEMBL by fingerprint
- Loading fingerprints into an arena
- Generate an NxN similarity matrix
- Generate an NxM similarity matrix
- Generating fingerprint files
- Generating fingerprints with an alternative type
- Extracting fingerprints from SDF tags
- Select diverse fingerprints with MaxMin
- Use MaxMin with references
- Select diverse fingerprints with Heapsweep
- Sphere Exclusion
- Directed Sphere Exclusion
- Butina clustering
- Fingerprint family and type examples
- Fingerprint families and types
- Fingerprint family
- Fingerprint family discovery
- get_fingerprint_type() and get_type()
- Create a fingerprint using text settings
- FingerprintType properties and methods
- Convert a structure record to a fingerprint
- Convert a structure record to an id and fingerprint
- Make a specialized id and molecule fingerprint parser
- Read a structure file and compute fingerprints
- Structure-based fingerprint reader location
- Read fingerprints from a string containing structures
- Structure-based fingerprint reader errors
- Experimental error handler
- Compute a fingerprint for a native toolkit molecule
- Fingerprint many native toolkit molecules
- Make a specialized molecule fingerprinter
- Toolkit API examples
- Get a chemfp toolkit
- Parse and create SMILES
- Canonical, non-isomeric, and arbitrary SMILES
- Use format to create a record in SDF format
- Use zlib record compression
- Use zst record compression
- Get a list of available formats and distinguish between input and output formats
- Determine the format for a given filename
- Parse the id and the molecule at the same time
- Specify alternate error behavior
- Specify a SMILES delimiter through reader_args
- Specify an output SMILES delimiter through writer_args
- RDKit-specific SMILES reader_args and writer_args
- OpenEye-specific SMILES reader_args and writer_args
- OpenEye-specific aromaticity
- Open Babel-specific SMILES reader_args and writer_args
- CDK-specific SMILES reader_args and writer_args
- Get the default reader_args or writer_args for a format
- Convert text settings into reader and writer arguments
- Multi-toolkit reader_args and writer_args
- Qualified reader and writer parameters names
- Qualified parameter priorities
- Qualified names and text settings
- Read molecules from an SD file or stdin
- Read ids and molecules from an SD file at the same time
- Read ids and molecules using an SD tag for the id
- Read from a string instead of a file
- The reader may reuse molecule objects!
- Write molecules to a SMILES file
- Reader and writer context managers
- Write molecules to stdout in a specified format
- Write molecules to a string (and a bit of InChI)
- Handling errors when reading molecules from a string
- Handling errors when reading molecules from a file
- Ignore errors in create_string() and create_bytes()
- Ignore errors when writing molecules
- Reader and writer format metadata
- Location information: filename, record_format, recno and output_recno
- Location information: record position and content
- Writing your own error handler (Experimental)
- A Babel-like structure format converter
- argparse text settings to reader and writer args
- Creating a specialized record parser
- Molecule API: Get and set the molecule id
- Molecule API: Copy a molecule
- Molecule API: Working with SD tags
- Add fingerprints to an SD file using a toolkit
- Text toolkit examples
- Toolkits may modify the molecular structure
- Toolkits may modify SDF syntax
- The text toolkit “molecules”
- The text toolkit implements the toolkit API
- Reading and adding SD tags with the text_toolkit
- Synchronizing readers from different toolkits through the text toolkit
- Add multiple toolkit fingerprints to an SD file
- Text toolkit and SDF files
- Read id and tag value pairs from an SD file
- Extract the id and atom and bond counts from an SD file
- SDF-specific parser parameters
- Working with SD records as strings
- Unicode and other character encoding
- Mixed encodings and raw bytes
- chemfp API
- chemfp top-level
Chemfp started because around 2007 a project I worked on needed a way to include nearest-neighbor information for a property prediction calculator. The cheminformatics toolkits at the time didn’t include that as a built-in tool, though they did supply the components to build your own. Indeed, asking showed that nearly everyone had built their own similar sorts of tools, each with a different format, and varying levels of performance that were nowhere near the hardware limits.
The first step was to develop the FPS format, a human-readable text-based exchange format for fingerprint data that is easy for software to read and write. It stores fingerprint records containing a hex-encoded fingerprint and a record identifier, as well as metadata like the associated fingerprint type.
People don’t use an alternate format just because it exists, so the next step was to develop useful command-line tools for fingerprint generation and similarity search, as well as a Python API for working with fingerprints in a discovery setting - like adding similarity search to a web app! Alternatively, the sdf2fps tool can extract fingerprint data from an SDF tag field.
People don’t use alternative tools just because they exist, so the third step was improve similarity search performance. This was done by improving the search algorithm and implementation and adding multi-threaded support for NxN or NxM search cases. Similarity search with modern chemfp is over 10x faster than chemfp 1.0!
Similarity search is fast enough that for many cases the FPS read performance became the limiting factor. This is especially noticable in web development where modern practices restart the web app after every change. The FPB binary format was developed as a way to quickly load a fingerprint dataset. Its internal layout is identical to what’s needed for a similarity search so the load step needs little additional processing. The fpcat program converts between the FPS and FPB formats.
Chemfp supports four different cheminformatics toolkits, which are used for molecule I/O and fingerprint generation. One of the goals of the chemfp API is to make it easy to work with fingerprints from different toolkits without learning the details of each toolkit. In the usual computer science fashion, this is done with the “toolkit wrapper API”, which gives a consistent API across the supported toolkits.
The”text toolkit” implements a subset of this API, to work with SDF and SMILES files as text records. The text toolkit also includes special support for working with SDF files, for example, to add fingerprint data as tag data to an SDF record without round-tripping the record through a chemistry toolkit.
With the 4.0 release, chemfp added support for diversity, including MaxMin, sphere exclusion, and heapsweep. Rather than add a new command-line program for each new tool, the “chemfp” command-line tool was added, with subcommands for each tool. The 4.1 release added the “butina”, “csv2fps” and “translate” subcommands, along with Python API additions for clustering and CSV processing.
For a different, more scholarly discussion of chemfp see “The chemfp project” in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project.
To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 .
This program was developed by Andrew Dalke <firstname.lastname@example.org>, Andrew Dalke Scientific, AB. The Base License Agreement <base_license> does not allow you to:
- generate FPB files;
- create in-memory fingerprint arenas with more than 50,000 fingerprints;
- search in-memory fingerprint arenas with more than 50,000 fingerprints, unless they are licensed FPB files;
- perform Tversky searches;
- perform Tanimoto searches of FPS files with more than 20 queries at a time.
See https://chemfp.com/license/ for details on licensing options, which includes no-cost academic licensing and source code licensing.
If you have questions, or wish to request a demo license or purchase a license, send an email to email@example.com.
I also maintain the chemfp-1.x series under a no-cost open source license. Version chemfp-1.6.1 is available at no cost from chemfp.com. This version requires Python 2.7 and is meant to give an open source baseline for benchmarking purposes.
In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, Brian Cole, John Mayfield, Jeff van Santen, and Jakub Gunera.
Thanks also to my wife, Sara Marie, for her many years of support.