.. _intro:
========================
chemfp 3.1 documentation
========================
`chemfp `_ is a set of tools for working with
cheminformatics fingerprints.
This is the documentation for the commerical version of chemfp. The
documentation for chemfp 1.3, the no-cost version of chemfp, is
available from `http://chemfp.readthedocs.io/en/chemfp-1.3/ `_.
Most people will use the command-line programs to generate and search
fingerprint files. :ref:`ob2fps `, :ref:`oe2fps `, and
:ref:`rdkit2fps ` use respectively the `Open Babel
`_, `OpenEye `_, and
`RDKit `_ chemistry toolkits to convert
structure files into fingerprint files. :ref:`sdf2fps `
extracts fingerprints encoded in SD tags to make the fingerprint
file. :ref:`simsearch ` finds targets in a fingerprint file
which are sufficiently similar to the queries. :ref:`fpcat `
converts between FPS and FPB formats and merges multiple fingerprint
files into one.
The programs are built using the :ref:`chemfp Python library API
`. The search capabilities are part of the public API, as
well as a cross-toolkit API for reading and writing molecules from
structure files or strings, and for computing molecular
fingerprints.
Remember: chemfp cannot generate fingerprints from a structure file
without a third-party chemistry toolkit.
Chemfp 3.1 was released on 18 September 2017. It supports Python 2.7
and 3.5+ and can be used with any recent version of OEChem/OEGraphSim,
Open Babel, or RDKit.
List of chapters
================
.. toctree::
:maxdepth: 3
installing
using-tools
tool-help
using-api
fingerprint_types
toolkit
text_toolkit
api
License and advertisement
=========================
This program was developed by Andrew Dalke
, Andrew Dalke Scientific, AB. It is
distributed under the "MIT" license, shown below.
Further chemfp development depends on funding from people like
you. Asking for voluntary contributions almost never works. Instead,
starting with chemfp 1.1, the source code is distributed under an
incentive program. You can pay for the commerical distribution, or use
the no-cost version.
If you pay for the commercial distribution then you will get the most
recent version of chemfp, free upgrades for one year, support, and a
discount on renewing participation in the incentive program.
I also maintain the chemfp-1.x series. Version chemfp-1.3 is available
at no cost from chemfp.com, or if you know someone with chemfp 2.x or
3.x you might be able to get it from them at no cost. It's free/open
source software, after all.
If you have questions about or with to purchase the commercial
distribution, send an email to sales@dalkescientific.com .
.. highlight:: none
::
Copyright (c) 2010-2017 Andrew Dalke Scientific, AB (Sweden)
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Copyright to portions of the code are held by other people or
organizations, and may be under a different license. See the specific
code for details. These are:
- OpenMP, cpuid, POPCNT, and Lauradoux implementations by Kim
Walisch, , under the MIT license
- SSSE3.2 popcount implementation by Stanford Univeristy (written by
Imran S. Haque ) under the BSD license
- The AVX2 popcount implementation by Daniel Lemire, Nathan Kurz,
Owen Kaser, et al. under the Apache 2 license
- heapq and ascii_buffer_converter by the Python Software Foundation
under the Python license
- TimSort code by Christopher Swenson under the MIT License
- tests/unittest2 by Steve Purcell, the Python Software Foundation,
and others, under the Python license
- chemfp/rdmaccs.patterns and chemfp/rdmaccs2.patterns by Rational
Discovery LLC, Greg Landrum, and Julie Penzotti, under the 3-Clause
BSD License
What's new in version 3.1
=========================
Released 17 September 2017
The new specialized POPCNT implementation for PubChem/CACTVS keys
increases search performance for that case by about 15%.
The SearchResults object gained the
:meth:`to_csr() <.SearchResults.to_csr>` method and the :attr:`shape
<.SearchResults.shape>` attribute. The new method returns a `SciPy
compressed sparse row matrix
`_ containing
the similarity scores, which can be passed into `scikit-learn `_ for
clustering.
The fall 2017 release of OEChem will accept InChI strings as structure
input. The chemfp wrapper now knows about this, as well as the two new
InChI output flavors "RelativeStereo" and "RacemicStereo".
The fall 2017 release of RDKit will fix a bug in the pattern
fingerprint definitions. The new chemfp fingerprint type is
RDKit-Pattern/4.
Changed how oe2fps, rdkit2fps, and ob2fps report missing or empty
identifiers. Previously the default :option:`--errors` setting of
"ignore" simply skipped those records, without any warning
messages. This caused problems processing the ChEBI SD file. Most of
the records have an empty title line, so only a few fingerprint
records were generated. It wasn't obvious that the resulting data set
was useless. The new code always reports a warning for empty or
missing identifiers, even with "ignore". If the :option:`--errors` is
"strict" then the warning becomes an error and processing stops.
Updated the #software line to include "chemfp/1.3" in addition to the
toolkit information. This helps distinguish between, say, two
different programs which generate RDKit Morgan fingerprints. It's also
possible that a chemfp bug can affect the fingerprint output, so the
extra term makes it easier to identify a bad dataset.
There are several small fixes related to memory leaks, the
bytes/Unicode distinction in Python 3, error messages, and error
handling.
Removed chemfp.progressbar and chemfp.futures. These were included in
chemfp 1.1 because I used them in a project for one customer and
thought they might be useful in future chemfp projects. They were
not. Also removed chemfp.argparse because chemfp 3.0 dropped support
for Python 2.6.
What's new in version 3.0.1
===========================
Released 28 August 2017
This is a bug-fix release. This fixes a critical bug in the
general-purpose POPCNT popcount implementation and a bug in the code
to detect the RDKit Pattern fingerprint change in 2017.3.
See the CHANGELOG for details.
What's new in version 3.0
=========================
Released 2 May 2017
Chemfp now supports both Python 2.7 and Python 3.5 or later. It no
longer supports version before Python 2.7. Chemfp will support Python
2.7 at least until 2020, which is the end-of-life for Python 2.7.
This required extensive changes to distinguish between text/Unicode
strings and byte strings. The biggest user-facing change is that
identifiers are now treated as Unicode strings. Fingerprints are still
treated as byte strings.
This change is not backwards compatible. The APIs function parameters
are polymorphic, so in most cases you can pass in either a Unicode
string or a UTF-8 encoded byte string. However, the return type for an
identifier is Unicode, which will likely cause problems with existing
code which expects bytes.
All of the chemistry toolkits have decided to treat files as UTF-8
encoded. Chemfp's "text toolkit" offers limited support for reading
Latin-1 encoded files. This is a tricky topic so contact me if you
have questions or problems.
I have removed the "make_string_creator()" function because it was
hard to explain, hard to maintain, and had little performance
improvement over passing in the arguments to
:func:`chemfp.create_string`. This will break compatibility, but then
again, I don't think anyone used it. If it is a problem, I suggest
creating a function, as in the following::
>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
u'O-c1:c:c:c:c:c:1'
>>> def make_string(mol):
... return T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
...
>>> make_string(mol)
u'O-c1:c:c:c:c:c:1'
If you look carefully at the previous example, you'll see the other
major backwards incompatibility. The function :func:`chemfp.create_string` now
return a Unicode string instead of a byte string. This also means its
`format` parameter no longer accepts the ".zlib" or ".gzip" extensions.
Instead, to get the old behavior use the new API function
:func:`chemfp.create_bytes`"::
>>> T.create_bytes(mol, "smistring", writer_args = {"allBondsExplicit": True})
'O-c1:c:c:c:c:c:1'
>>> T.create_bytes(mol, "smistring.zlib", writer_args = {"allBondsExplicit": True})
'x\x9c\xf3\xd7M6\xb4J\x86CC\x00&\xc8\x04\x8d'
There's a similar change between :func:`chemfp.open_molecule_writer_to_string`
and the new function :func:`chemfp.open_molecule_writer_to_bytes`.
There are also some new features in version 3.0 which don't break
compatibility.
Similarity search is faster because there are now specialized popcount
implementations based on the fingerprint length. On one benchmark,
166-bit searches are 35% faster, 1024-bit searches are 25% faster, and
2048-bit searches are 5% faster.
There is a new popcount implementation for processors with the AVX2
instruction set. It is about 15% faster than the POPCNT version for
2048 bit fingerprints. To test it out you will have to compile chemfp
with :option:`--with-avx2` enabled.
Added support for the Avalon fingerprints in RDKit, if RDKit has been
compiled with Avalon support.
What's new in version 2.1
=========================
Released 2 July 2015
Version 2.1 adds Tversky support for every place there was Tanimoto
search (except the handful of deprecated APIs). There are new search
routines for FPS and arena searches, including OpenMP support, and new
bitops functions to compute a Tversky index between two fingerprints.
The k-nearest arena searches now support OpenMP. Previously they were
single threaded even though the other search functions supported
multiple threads.
The built-in SDF parser saw a couple improvements, including support
for both "\\n" and "\\r\\n" newlines, instead of only "\\n" newlines.
There were a number of bug fixes that concern edge cases. For example,
some 64-bit double calculations could be off-by-one in the last digit,
and fingerprints with 0 bits set could cause a few problems.
What's new in version 2.0
=========================
Released 8 April 2015
Version 2.0 includes many new features designed for web service
development. The new "FPB" binary fingerprint file format is very fast
to load, which is great for web server reloading during development
and on the command-line. The speed comes from using a memory-mapped
file, which also means that multiple chemfp instances can use the same
file on the same machine without extra memory overhead.
The most extensive improvement is the new portable API for working
with structure files and fingerprint types. The moment you start
working with multiple chemistry toolkits, you realize that they all
have different ways to read and write molecules, and to generate
fingerprints from a molecule. Chemfp tries hard to have a consistent
API for these common tasks, without sacrificing performance, so you
can get get your work done. For example, with the new API it's easy to
take an SD record as an input string, compute the MACCS fingerprints
for each available toolkit, add the results as new SD tags, and return
the updated record.
This sounds so easy, doesn't it? It took about a year to develop. The
API is quite extensive, and includes the ability to pass
toolkit-specific options to the underlying parsers, a low-level SDF
parser that can be used to index a file, a way to get a list of
available formats and fingerprint types, and methods to parse
fingerprint arguments from strings.
New with version 2.0 is the ability to handle PubChem-sized
data. Previous versions used 32 bit indexing and had a limit of 4GB,
which is enough for 33M 1024-bit fingerprints, but PubChem has about
twice that many structures.
There are also a lot of improvements, bug fixes, and performance
tweaks. For example, the FPS reader is now almost twice as fast! For
details, see the CHANGELOG file of the release.
Future
======
The chemfp code base is solid and in use at many companies, some of
whom have paid for the commercial version. It has great support for
fingerprint generation, fast similarity search, and toolkit
portability, but there's plenty left to do in future. Here's a mixture
of things that are likely and things which are possibilties. Of course
funding and feedback would help prioritize things. `Let me know
`_ if you need something like one of
these.
Right now you're limited to the built-in toolkit fingerprint types,
plus chemfp's own SMARTS-based fingerprints. There should be a
registration system so you can tell chemfp about user-defined
fingerprint types.
I would like some way to select fingeprint subsets. My original
thought was something like an awk for the FPS format, with the ability
to select N fingerprints at random, or those matching a given set of
identifiers, etc. My current thought is to implement it as a sqlite
virtual table.
Chemfp supports Tanimoto and Tversky similarity. I could also add
support for other measures; cosine and Hamming seem like the most
useful other alternatives.
Chemfp does not currently support Microsoft Windows computer because
the code assumes the LP64 model, where "int" is 32 bits and "long" is
64 bits. It will require a lot of low-level work to tweak everything
correctly for the Windows LLP64 model, where "int" and "long" are 32
bits and "long long" is 64 bits. Once that's done, I'll have to figure
out how to make an installer. I've decided to put it off until a
someone (or someones) fund it.
The threshold and k-nearest arena search results store hits using
compressed sparse rows. These work well for sparse results, but when
you want the entire similarity matrix (ie, with a minimum threshold of
0.0) of a large arena, then time and space to maintain the sparse data
structure becomes noticable. It's likely in that case that you want
to store the scores in a 2D NumPy matrix.
I'm really interested in using chemfp to handle different sorts of
clustering. Let me know if there are things I can add to the API which
would help you do that.
If you are not a Python programmer then you might prefer that the core
search routines be made accessible through a C API. That's possible,
in that the software was designed with that in mind, but it needs more
development and testing.
Chemfp ever since version 1.1 supports OpenMP. That's great for
shared-memory machines. Are you interested in supporting a distributed
computing version?
There are any number of higher-level tools which can be built on the
chemfp components. For example, what about a wsgi component which
implements a web-based search API for your local network? Wouldn't it
be nice to say::
fpserver filename1.fpb
and have a simple search service?
What about an IPython visualization tool?
There's a paper (doi:10.1093/bioinformatics/byq067) on using
locality-sensitive hashing to find highly similar fingerprints. Are
there cases where it's more useful than chemfp's direct search?
Several people have asked about GPU implementations. My feeling is
that the CPU is fast enough, and much easier to deploy. That's not
saying I wouldn't be interested in a GPU implementation, only
describing why it's not at the top of the list.
Thanks
======
In no particular order, the following contributed to chemfp in some
way: Noel O'Boyle, Geoff Hutchison, the Open Babel developers, Greg
Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich
Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel
Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn
Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, and
Lionel Uran Landaburu.
Thanks also to my wife, Sara Marie, for her many years of support.
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`