.. _intro:
chemfp 4.0 documentation
==================================
`chemfp `_ is a set of :ref:`command-line tools
` and a :ref:`Python package ` for working
with binary cheminformatics fingerprints, typically between several
hundred and a few thousand bits long.
This is the documentation for chemfp 4.0. It is a work-in-progress and
currently only covers the major changes since chemfp 3.5. While chemfp
4.0 is nearly completely backwards compatible with chemfp 3.5, the
recent additions change and (hopefully!) improve how to use the chemfp
API.
Chemfp 4.0 adds new methods for diversity selection and improves API
usability with new high-level functions and improved feedback for
interactive use (including progress bars!).
It will take many months to fully import and update the old examples.
Until then, please also consult the `chemfp 3.5 documentation
`_.
Chemfp 4.0 was released on 12 June 2022. It was tested on Python 3.8,
3.9, and 3.10. It has no required dependencies, though some features,
like NumPy and Pandas integration of course require those packages
Remember: chemfp cannot generate fingerprints from a structure file
without a third-party chemistry toolkit. The supported toolkits are
OEChem/OEGraphSim, Open Babel, RDKit and CDK (via the JPype adapter).
Table of Contents
--------------------
.. toctree::
:maxdepth: 2
whatsnew.ipynb
installing
tools
tool_help
getting_started_api
fingerprint_types
toolkit
text_toolkit
examples
chemfp_api
licenses
Background
----------
Chemfp started because around 2007 a project I worked on needed a way
to include nearest-neighbor information for a property prediction
calculator. The cheminformatics toolkits at the time didn't include
that as a built-in tool, though they did supply the components to
build your own. Indeed, asking showed that nearly everyone had built
their own similar sorts of tools, each with a different format, and
varying levels of performance that were nowhere near the hardware
limits.
The first step was to develop the FPS format, a human-readable
text-based exchange format for fingerprint data that is easy for
software to read and write. It stores fingerprint records containing a
hex-encoded fingerprint and a record identifier, as well as metadata
like the associated fingerprint type.
People don't use an alternate format just because it exists, so the
next step was to develop useful :ref:`command-line tools
` for fingerprint generation and similarity search, as
well as a Python API for working with fingerprints in a discovery
setting - like adding similarity search to a web app! Alternatively,
the sdf2fps tool can extract fingerprint data from an SDF tag field.
People don't use alternative tools just because they exist, so the
third step was improve similarity search performance. This was done by
improving the search algorithm and implementation and adding
multi-threaded support for NxN or NxM search cases. Similarity search
with modern chemfp is over 10x faster than chemfp 1.0!
Similarity search is fast enough that for many cases the FPS read
performance became the limiting factor. This is especially noticable
in web development where modern practices restart the web app after
every change. The FPB binary format was developed as a way to quickly
load a fingerprint dataset. Its internal layout is identical to what's
needed for a similarity search so the load step needs little
additional processing. The fpcat program converts between the FPS and
FPB formats.
Chemfp supports four different cheminformatics toolkits, which are
used for molecule I/O and fingerprint generation. One of the goals of
the chemfp API is to make it easy to work with fingerprints from
different toolkits without learning the details of each toolkit. In the
usual computer science fashion, this is done with the "toolkit wrapper
API", which gives a consistent API across the supported toolkits.
The"text toolkit" implements a subset of this API, to work with SDF
and SMILES files as text records. The text toolkit also includes
special support for working with SDF files, for example, to add
fingerprint data as tag data to an SDF record without round-tripping
the record through a chemistry toolkit.
Most recently, chemfp added support for diversity, including MaxMin,
sphere exclusion, and heapsweep. Rather than add a new command-line
program for each new tool, the "chemfp" command-line tool was added,
with subcommands for each tool.
Citation
--------
For a different, more scholarly discussion of chemfp see "`The chemfp
project
`_"
in the Journal of Cheminformatics. That paper covers the purpose of
the project, its architecture and design, the FPS and FPB file
formats, and the experience in trying to run chemfp as a self-funded
open source project.
To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76
(2019). https://doi.org/10.1186/s13321-019-0398-8 .
Advertisement
-------------------------
This program was developed by Andrew Dalke
, Andrew Dalke Scientific, AB. The `Base
License Agreement ` does not allow you to:
* generate FPB files;
* create in-memory fingerprint arenas with more than
50,000 fingerprints;
* search in-memory fingerprint arenas with more than 50,000
fingerprints, unless they are :ref:`licensed FPB `
files;
* perform Tversky searches;
* perform Tanimoto searches of FPS files with
more than 20 queries at a time.
See ``_ for details on licensing options,
which includes no-cost academic licensing and source code licensing.
If you have questions, or wish to request a demo license or purchase a
license, send an email to `sales@dalkescientific.com
`_.
I also maintain the chemfp-1.x series under a no-cost open source
license. Version chemfp-1.6.1 is available at no cost from
chemfp.com. This version requires Python 2.7 and is meant to give an
open source baseline for benchmarking purposes.
Thanks
------
In no particular order, the following contributed to chemfp in some
way: Noel O'Boyle, Geoff Hutchison, the Open Babel developers, Greg
Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich
Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel
Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn
Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley,
Lionel Uran Landaburu, Sereina Riniker, Brian Cole, John Mayfield,
Jeff van Santen, and Jakub Gunera.
Thanks also to my wife, Sara Marie, for her many years of support.
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`