.. _intro:
chemfp 4.1 documentation
==================================
`chemfp `_ is a set of :ref:`command-line tools
` and a :ref:`Python package ` for working
with binary cheminformatics fingerprints, typically between several
hundred and a few thousand bits long.
This is the documentation for chemfp 4.1. It is a work-in-progress and
currently only covers the major changes since `chemfp 3.5
`_. While chemfp 4.1 is
nearly completely backwards compatible with chemfp 3.5, the recent
additions change and (hopefully!) improve how to use the chemfp API.
Chemfp 4.1 was released on 17 May 2023. with CXSMILES support, methods
to save and load similarity search results to a SciPy sparse matrix,
Butina clustering, methods to work with CSV files, and a tools to
convert between structure formats.
It was tested with Python 3.8, 3.9, 3.10, and 3.11. It requires the
"`click `_" and "`tqdm
`_" third-party packages, which should be
installed automatically as part of the normal :ref:`installation
process `. Some optional features will only work if they
are installed by other methods, like the NumPy, SciPy, and Pandas
integration.
Chemfp 4.0 added new methods for diversity selection and improves API
usability with new high-level functions and improved feedback for
interactive use (including progress bars!).
For details on the older parts of the API, please also consult the `chemfp 3.5 documentation
`_.
Remember: chemfp cannot generate fingerprints from a structure file
without a third-party chemistry toolkit. The supported toolkits are
OEChem/OEGraphSim, Open Babel, RDKit and CDK (via the JPype adapter).
Table of Contents
--------------------
.. toctree::
:maxdepth: 2
whats_new_in_41
whats_new_in_40
installing
tools
tool_help
getting_started_api
fingerprint_types
toolkit
text_toolkit
examples
chemfp_api
licenses
Background
----------
Chemfp started because around 2007 a project I worked on needed a way
to include nearest-neighbor information for a property prediction
calculator. The cheminformatics toolkits at the time didn't include
that as a built-in tool, though they did supply the components to
build your own. Indeed, asking showed that nearly everyone had built
their own similar sorts of tools, each with a different format, and
varying levels of performance that were nowhere near the hardware
limits.
The first step was to develop the FPS format, a human-readable
text-based exchange format for fingerprint data that is easy for
software to read and write. It stores fingerprint records containing a
hex-encoded fingerprint and a record identifier, as well as metadata
like the associated fingerprint type.
People don't use an alternate format just because it exists, so the
next step was to develop useful :ref:`command-line tools
` for fingerprint generation and similarity search, as
well as a Python API for working with fingerprints in a discovery
setting - like adding similarity search to a web app! Alternatively,
the sdf2fps tool can extract fingerprint data from an SDF tag field.
People don't use alternative tools just because they exist, so the
third step was improve similarity search performance. This was done by
improving the search algorithm and implementation and adding
multi-threaded support for NxN or NxM search cases. Similarity search
with modern chemfp is over 10x faster than chemfp 1.0!
Similarity search is fast enough that for many cases the FPS read
performance became the limiting factor. This is especially noticable
in web development where modern practices restart the web app after
every change. The FPB binary format was developed as a way to quickly
load a fingerprint dataset. Its internal layout is identical to what's
needed for a similarity search so the load step needs little
additional processing. The fpcat program converts between the FPS and
FPB formats.
Chemfp supports four different cheminformatics toolkits, which are
used for molecule I/O and fingerprint generation. One of the goals of
the chemfp API is to make it easy to work with fingerprints from
different toolkits without learning the details of each toolkit. In the
usual computer science fashion, this is done with the "toolkit wrapper
API", which gives a consistent API across the supported toolkits.
The"text toolkit" implements a subset of this API, to work with SDF
and SMILES files as text records. The text toolkit also includes
special support for working with SDF files, for example, to add
fingerprint data as tag data to an SDF record without round-tripping
the record through a chemistry toolkit.
With the 4.0 release, chemfp added support for diversity, including
MaxMin, sphere exclusion, and heapsweep. Rather than add a new
command-line program for each new tool, the "chemfp" command-line tool
was added, with subcommands for each tool. The 4.1 release added the
"butina", "csv2fps" and "translate" subcommands, along with Python API
additions for clustering and CSV processing.
Citation
--------
For a different, more scholarly discussion of chemfp see "`The chemfp
project
`_"
in the Journal of Cheminformatics. That paper covers the purpose of
the project, its architecture and design, the FPS and FPB file
formats, and the experience in trying to run chemfp as a self-funded
open source project.
To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76
(2019). https://doi.org/10.1186/s13321-019-0398-8 .
Advertisement
-------------------------
This program was developed by Andrew Dalke
, Andrew Dalke Scientific, AB. The `Base
License Agreement ` does not allow you to:
* generate FPB files;
* create in-memory fingerprint arenas with more than
50,000 fingerprints;
* search in-memory fingerprint arenas with more than 50,000
fingerprints, unless they are :ref:`licensed FPB `
files;
* perform Tversky searches;
* perform Tanimoto searches of FPS files with
more than 20 queries at a time.
See ``_ for details on licensing options,
which includes no-cost academic licensing and source code licensing.
If you have questions, or wish to request a demo license or purchase a
license, send an email to `sales@dalkescientific.com
`_.
I also maintain the chemfp-1.x series under a no-cost open source
license. Version chemfp-1.6.1 is available at no cost from
chemfp.com. This version requires Python 2.7 and is meant to give an
open source baseline for benchmarking purposes.
Thanks
------
In no particular order, the following contributed to chemfp in some
way: Noel O'Boyle, Geoff Hutchison, the Open Babel developers, Greg
Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich
Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel
Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn
Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley,
Lionel Uran Landaburu, Sereina Riniker, Brian Cole, John Mayfield,
Jeff van Santen, and Jakub Gunera.
Thanks also to my wife, Sara Marie, for her many years of support.
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`