chemfp 4.0 documentation¶
chemfp is a set of command-line tools and a Python package for working with binary cheminformatics fingerprints, typically between several hundred and a few thousand bits long.
This is the documentation for chemfp 4.0. It is a work-in-progress and currently only covers the major changes since chemfp 3.5. While chemfp 4.0 is nearly completely backwards compatible with chemfp 3.5, the recent additions change and (hopefully!) improve how to use the chemfp API.
Chemfp 4.0 adds new methods for diversity selection and improves API usability with new high-level functions and improved feedback for interactive use (including progress bars!).
It will take many months to fully import and update the old examples. Until then, please also consult the chemfp 3.5 documentation.
Chemfp 4.0 was released on 12 June 2022. It was tested on Python 3.8, 3.9, and 3.10. It has no required dependencies, though some features, like NumPy and Pandas integration of course require those packages
Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit. The supported toolkits are OEChem/OEGraphSim, Open Babel, RDKit and CDK (via the JPype adapter).
Table of Contents¶
- What’s new in chemfp?
- Installing chemfp
- Working with the command-line tools
- Generate fingerprint files from PubChem SD tags
- k-nearest neighbor search
- simsearch CSV output
- Threshold search
- simsearch CSV output when no hits
- Combined k-nearest and threshold search
- NxN (self-similar) searches
- Using a toolkit to process the ChEBI dataset
- Use structures as input to simsearch
- Make new fingerprints matching the type in an existing file
- Alternate error handlers
- chemfp’s two cross-toolkit substructure fingerprints
- substruct fingerprints
- Generate binary FPB files from a structure file
- Convert between FPS and FPB formats
- Specify the fpcat output format
- Alternate fingerprint file formats
- The FPB format
- Get licensed FPB files containing ChEMBL 29 fingerprints
- Similarity search with the FPB format
- Multi-core similarity search
- Converting large data sets to FPB format
- Faster gzip decompression
- Generate fingerprints in parallel and merge to FPB format
- Help for the command-line tools
- Getting started with the API
- Get ChEMBL 30 fingerprints in FPB format
- Similarity search of ChEMBL by id
- Similarity search of ChEMBL using a SMILES
- Sorting the search results
- Fingerprints as byte strings
- Generating Fingerprints
- Similarity Search of ChEMBL by fingerprint
- Loading fingerprints into an arena
- Generate an NxN similarity matrix
- Generate an NxM similarity matrix
- Generating fingerprint files
- Generating fingerprints with an alternative type
- Extracting fingerprints from SDF tags
- Select diverse fingerprints with MaxMin
- Use MaxMin with references
- Select diverse fingerprints with Heapsweep
- Sphere Exclusion
- Directed Sphere Exclusion
- Fingerprint family and type examples
- Fingerprint families and types
- Fingerprint family
- Fingerprint family discovery
- get_fingerprint_type() and get_type()
- Create a fingerprint using text settings
- FingerprintType properties and methods
- Convert a structure record to a fingerprint
- Convert a structure record to an id and fingerprint
- Make a specialized id and molecule fingerprint parser
- Read a structure file and compute fingerprints
- Structure-based fingerprint reader location
- Read fingerprints from a string containing structures
- Structure-based fingerprint reader errors
- Experimental error handler
- Compute a fingerprint for a native toolkit molecule
- Fingerprint many native toolkit molecules
- Make a specialized molecule fingerprinter
- Toolkit API examples
- Get a chemfp toolkit
- Parse and create SMILES
- Canonical, non-isomeric, and arbitrary SMILES
- Use format to create a record in SDF format
- Use zlib record compression
- Use zst record compression
- Get a list of available formats and distinguish between input and output formats
- Determine the format for a given filename
- Parse the id and the molecule at the same time
- Specify alternate error behavior
- Specify a SMILES delimiter through reader_args
- Specify an output SMILES delimiter through writer_args
- RDKit-specific SMILES reader_args and writer_args
- OpenEye-specific SMILES reader_args and writer_args
- OpenEye-specific aromaticity
- Open Babel-specific SMILES reader_args and writer_args
- CDK-specific SMILES reader_args and writer_args
- Get the default reader_args or writer_args for a format
- Convert text settings into reader and writer arguments
- Multi-toolkit reader_args and writer_args
- Qualified reader and writer parameters names
- Qualified parameter priorities
- Qualified names and text settings
- Read molecules from an SD file or stdin
- Read ids and molecules from an SD file at the same time
- Read ids and molecules using an SD tag for the id
- Read from a string instead of a file
- The reader may reuse molecule objects!
- Write molecules to a SMILES file
- Reader and writer context managers
- Write molecules to stdout in a specified format
- Write molecules to a string (and a bit of InChI)
- Handling errors when reading molecules from a string
- Handling errors when reading molecules from a file
- Ignore errors in create_string() and create_bytes()
- Ignore errors when writing molecules
- Reader and writer format metadata
- Location information: filename, record_format, recno and output_recno
- Location information: record position and content
- Writing your own error handler (Experimental)
- A Babel-like structure format converter
- argparse text settings to reader and writer args
- Creating a specialized record parser
- Molecule API: Get and set the molecule id
- Molecule API: Copy a molecule
- Molecule API: Working with SD tags
- Add fingerprints to an SD file using a toolkit
- Text toolkit examples
- Toolkits may modify the molecular structure
- Toolkits may modify SDF syntax
- The text toolkit “molecules”
- The text toolkit implements the toolkit API
- Reading and adding SD tags with the text_toolkit
- Synchronizing readers from different toolkits through the text toolkit
- Add multiple toolkit fingerprints to an SD file
- Text toolkit and SDF files
- Read id and tag value pairs from an SD file
- Extract the id and atom and bond counts from an SD file
- SDF-specific parser parameters
- Working with SD records as strings
- Unicode and other character encoding
- Mixed encodings and raw bytes
- chemfp API
- chemfp top-level
Chemfp started because around 2007 a project I worked on needed a way to include nearest-neighbor information for a property prediction calculator. The cheminformatics toolkits at the time didn’t include that as a built-in tool, though they did supply the components to build your own. Indeed, asking showed that nearly everyone had built their own similar sorts of tools, each with a different format, and varying levels of performance that were nowhere near the hardware limits.
The first step was to develop the FPS format, a human-readable text-based exchange format for fingerprint data that is easy for software to read and write. It stores fingerprint records containing a hex-encoded fingerprint and a record identifier, as well as metadata like the associated fingerprint type.
People don’t use an alternate format just because it exists, so the next step was to develop useful command-line tools for fingerprint generation and similarity search, as well as a Python API for working with fingerprints in a discovery setting - like adding similarity search to a web app! Alternatively, the sdf2fps tool can extract fingerprint data from an SDF tag field.
People don’t use alternative tools just because they exist, so the third step was improve similarity search performance. This was done by improving the search algorithm and implementation and adding multi-threaded support for NxN or NxM search cases. Similarity search with modern chemfp is over 10x faster than chemfp 1.0!
Similarity search is fast enough that for many cases the FPS read performance became the limiting factor. This is especially noticable in web development where modern practices restart the web app after every change. The FPB binary format was developed as a way to quickly load a fingerprint dataset. Its internal layout is identical to what’s needed for a similarity search so the load step needs little additional processing. The fpcat program converts between the FPS and FPB formats.
Chemfp supports four different cheminformatics toolkits, which are used for molecule I/O and fingerprint generation. One of the goals of the chemfp API is to make it easy to work with fingerprints from different toolkits without learning the details of each toolkit. In the usual computer science fashion, this is done with the “toolkit wrapper API”, which gives a consistent API across the supported toolkits.
The”text toolkit” implements a subset of this API, to work with SDF and SMILES files as text records. The text toolkit also includes special support for working with SDF files, for example, to add fingerprint data as tag data to an SDF record without round-tripping the record through a chemistry toolkit.
Most recently, chemfp added support for diversity, including MaxMin, sphere exclusion, and heapsweep. Rather than add a new command-line program for each new tool, the “chemfp” command-line tool was added, with subcommands for each tool.
For a different, more scholarly discussion of chemfp see “The chemfp project” in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project.
To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 .
This program was developed by Andrew Dalke <firstname.lastname@example.org>, Andrew Dalke Scientific, AB. The Base License Agreement <base_license> does not allow you to:
- generate FPB files;
- create in-memory fingerprint arenas with more than 50,000 fingerprints;
- search in-memory fingerprint arenas with more than 50,000 fingerprints, unless they are licensed FPB files;
- perform Tversky searches;
- perform Tanimoto searches of FPS files with more than 20 queries at a time.
See https://chemfp.com/license/ for details on licensing options, which includes no-cost academic licensing and source code licensing.
If you have questions, or wish to request a demo license or purchase a license, send an email to email@example.com.
I also maintain the chemfp-1.x series under a no-cost open source license. Version chemfp-1.6.1 is available at no cost from chemfp.com. This version requires Python 2.7 and is meant to give an open source baseline for benchmarking purposes.
In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, Brian Cole, John Mayfield, Jeff van Santen, and Jakub Gunera.
Thanks also to my wife, Sara Marie, for her many years of support.