.. _intro:
==========================
chemfp 3.2.1 documentation
==========================
`chemfp `_ is a set of tools for working with
cheminformatics fingerprints.
This is the documentation for the commerical version of chemfp. The
documentation for chemfp 1.4, the no-cost version of chemfp, is
available from `http://chemfp.readthedocs.io/en/chemfp-1.4/ `_.
Most people will use the command-line programs to generate and search
fingerprint files. :ref:`ob2fps `, :ref:`oe2fps `, and
:ref:`rdkit2fps ` use respectively the `Open Babel
`_, `OpenEye `_, and
`RDKit `_ chemistry toolkits to convert
structure files into fingerprint files. :ref:`sdf2fps `
extracts fingerprints encoded in SD tags to make the fingerprint
file. :ref:`simsearch ` finds targets in a fingerprint file
which are sufficiently similar to the queries. :ref:`fpcat `
converts between FPS and FPB formats and merges multiple fingerprint
files into one.
The programs are built using the :ref:`chemfp Python library API
`. The search capabilities are part of the public API, as
well as a cross-toolkit API for reading and writing molecules from
structure files or strings, and for computing molecular
fingerprints.
Remember: chemfp cannot generate fingerprints from a structure file
without a third-party chemistry toolkit.
Chemfp 3.2.1 was released on 12 April 2018. It supports Python 2.7
and 3.5+ and can be used with any recent version of OEChem/OEGraphSim,
Open Babel, or RDKit. See :ref:`What's New ` for a
description of the changes.
List of chapters
================
.. toctree::
:maxdepth: 3
installing
using-tools
tool-help
using-api
fingerprint_types
toolkit
text_toolkit
api
License and advertisement
=========================
This program was developed by Andrew Dalke
, Andrew Dalke Scientific, AB. It is
available for purchase under an academic license, a commerical
proprietary license, or an open source (MIT) license. A purchase of a
license includes free upgrades and support for one year, and a
discount on support renewal. (The support for the academic license is
more limited than the other two options.)
I also maintain the chemfp-1.x series. Version chemfp-1.4 is available
at no cost from chemfp.com, or if you know someone with a copy of
chemfp 2.x or 3.x under the MIT license, you might be able to get it
from them at no cost.
If you have questions about or with to purchase the commercial
distribution, send an email to sales@dalkescientific.com . You may
also request a demo license for evaluation.
.. highlight:: none
::
Copyright (c) 2010-2018 Andrew Dalke Scientific, AB (Sweden)
The chemfp source code contains confidential and proprietary source
code. It is NOT for public release unless you have purchased a copy
of the source code under the MIT license. Otherwise you may use and
modify this software ONLY under the terms of the License agreement
you or your organization made with Andrew Dalke Scientific AB.
Andrew Dalke Scientific AB warrants that; a) it has the right to
grant a license for the Use of the Software according to the terms
of this agreement; b) for the ninety days following the delivery
(the “Software Warranty Period”), the software will perform
substantially in accordance with the specifications described in the
Manual, when properly operated on the designed Platform. In case of
a breach of the Limited Warranty, Customer’s exclusive remedy is to
destroy all copies of the Software and Supporting Materials and
receive a full refund.
Dalke Scientific makes no other warranties, express, implied or
statutory, regarding the software or services, and expressly
disclaims all implied warranties of merchantability,
non-infringement, title and fitness for a particular
purpose. Neither Dalke Scientific nor its suppliers warrant that the
software or any services performed hereunder will be free from
defects.
Copyright to portions of the code are held by other people or
organizations, and under different licenses. These are:
- OpenMP, cpuid, POPCNT, and Lauradoux implementations by Kim
Walisch, , under the MIT license
- SSSE3.2 popcount implementation by Stanford University (written by
Imran S. Haque ) under the BSD license
- The AVX2 popcount implementation by Daniel Lemire, Nathan Kurz,
Owen Kaser, et al. under the Apache 2 license
- heapq and ascii_buffer_converter by the Python Software Foundation
under the Python license
- TimSort code by Christopher Swenson under the MIT License
- tests/unittest2 by Steve Purcell, the Python Software Foundation,
and others, under the Python license
- chemfp/rdmaccs.patterns and chemfp/rdmaccs2.patterns by Rational
Discovery LLC, Greg Landrum, and Julie Penzotti, under the 3-Clause
BSD License
.. _whats-new:
What's new in version 3.2.1
===========================
Released 12 April 2018
The biggest change is in the chemfp license. The commercial version is
now distributed under a propritary license instead of the MIT open
source license.
There are two other minor changes. The build process now includes
support for AVX2 by default, and the fingerprint writer classes have a
new 'format' attribute which is either "fps" or "fpb", or is None if
not defined.
License key
-----------
This marks the first release of chemfp with a proprietary license.
Or rather, licenses. There is an academic license and commercial
licenses in various flavors. In addition, chemfp is still available
under the open source MIT license, though that option is the most
expensive. The chemfp 1.x series (currently chemfp 1.4) is still
available for no cost under the MIT license, and receives updates, but
it only supports Python 2.7 and it does not have as many features.
Chemfp 3.2.1 is available in source code and as a pre-compiled Python
package which should run under `most x86 64-bit Linux-based OSes
`_. The pre-compiled
packages requires a license key.
The license key is date locked. If a valid key is not found then
"import chemfp" will print diagnostic messages to stderr and
fingerprint search and arena generation functionality will be
disabled. If you call one of the disabled functions then it will raise
a NotImplementedError exception. Simsearch will not work, and neither
will FPB generation.
Chemfp will look for the license key in the CHEMFP_LICENSE environment
variable. For example, in bash::
export CHEMFP_LICENSE=20101225-demo@HPDHKMHBIAENBEFLMCNKFGFAABNDGDOB
The first 8 digits are the year, month, and date that the license
expires, in GMT. In this demo example the license expired at the end
of Christmas Day of 2010.
After the date comes optional configuration data including a user
identifier, followed by the '@', and ending with a validation key.
There is no centralized license manger, and you may run chemfp on as
many computers at your site as you wish, within the limits of your
license agreement.
There are two new API functions:
- chemfp.is_licensed() - return True if the license key is valid or no
license key is needed, otherwise return False.
- chemfp.get_license_date() - return the license key expiration date
as a 3-element tuple in the form (year, month, day). If the license
key is not found or does not pass the security check then the
function returns None. If this version of chemfp does not need a
license key then it returns (9999, 12, 25).
Why the change in license policy?
---------------------------------
In 2009 or so I decided to see if I could make a living `selling free
software `_. Most people
who develop open source software for chemistry get their funding from
other sources. Academics might be funded from grants, a company might
use an open source project for business reasons, as a way to lower
overall costs. Some companies sell a proprietary product or access to
a service which uses an open source component, where the income from
the non-free sources funds the free software development. But I can
only think of a one or two cases in where people tried to make a
living off of the source code itself, and they were not that successful.
I had some ideas of how it might be successful, and tried them
out. While I had some sales, I never made anywhere near what I would
have made for the same effort as a consultant or contractor.
I also ran into some difficulties. Most software companies provide
their software either free or with steep discounts to academic
organizations. If I do that with the most recent version of chemfp, I
take a rather large risk that some grad student will post the source
on GitHub. (Pharmaceutical company employees are much less likely to
do that.)
I charge a lot of money for chemfp, because the few people who need
high performance similarity search are willing to pay for
it. Potential customers want to try it out. Since I either control the
copyright or use components which allow proprietary use, I was able to
make a non-disclosure agreement for the evaluation period. Had I been
using GPL-based components, and thus restricted to a free software
license, that would have been impossible.
I could continue to work at it trying to make a living selling free
software, but after 9 years of trying I decided it's time to switch to
a more standard proprietary licensing scheme.
The chemfp 1.x line will still be available at no cost under the MIT
license.
AVX2 popcount enabled by default
--------------------------------
AVX2 compilation is now enabled by default. It was disabled in earlier
releases because the AVX2 command-line flag was used to compile every
file and I was worried that it might result in a binary which couldn't
be used by older hardware. For this release I figured out how to use
the -mssse3 and -mavx2 flags only for the relevant popcount
calculations.
At run-time chemfp will detect which CPU-specific features are
available and only use the SSSE3 or AVX2 implementations when
appropriate.
What's new in version 3.2
=========================
Released 19 March 2018
This version mostly contains bug fixes and internal improvements. The
biggest additions are support for Dave Cosgrove's 'flush' fingerprint
file format, and support for 'fromAtoms' in some of the RDKit
fingerprints.
The configuration has changed to use setuptools.
Previously the command-line programs were installed as small
scripts. Now they are created and installed using the
"console_scripts" entry_point as part of the install process. This is
more in line with the modern way of installing command-line tools for
Python.
If these scripts are no longer installed correctly, please let me
know.
If you have installed the `chemfp_converters package
`_ then chemfp will
use it to read and write fingerprint files in flush format. It can be
used as output from the \*2fps programs, as input and output to fpcat,
and as query input to simsearch.
Added "fromAtoms" support for the RDKit hash, torsion, Morgan, and
pair fingerprints. This is primarily useful if you want to generate
the circular environment around specific atoms of a single molecule,
and you know the atom indices. If you pass in multiple molecules then
the same indices will be used for all of them. Out-of-range values are
ignored.
The command-line option is :option:`--from-atoms`, which takes a
comma-separated list of non-negative integer atom indices. For
examples::
--from-atoms 0
--from-atoms 29,30
The corresponding fingerprint type strings have also been updated. If
fromAtoms is specified then the string `fromAtoms=i,j,k,...` is added
to the string. If it is not specified then the fromAtoms term is not
present, in order to maintain compability with older types
strings. (The philosophy is that two fingerprint types are equivalent
if and only if their type strings are equivalent.)
The :option:`--from-atoms` option is only useful when there's a single
query and when you have some other mechanism to determine which subset
of the atoms to use. For example, you might parse a SMILES, use a
SMARTS pattern to find the subset, get the indices of the SMARTS
match, and pass the SMILES and indices to rdk2fps to generate the
fingerprint for that substructure.
Be aware that the union of the fingerprint for :option:`--from-atoms`
X and the fingerprint for :option:`--from-atoms` Y might not be equal
to the fingerprint for :option:`--from-atoms` X,Y. However, if a bit
is present in the union of the X and Y fingerprints then it will be
present in the X,Y fingerprint.
Why? The fingerprint implementation first generates a sparse count
fingerprint, then converts that to a bitstring fingerprint. The
conversion is affected by the feature count. If a feature is present
in both X and Y then X,Y fingerprint may have additional bits sets
over the individual fingerprints.
Bug fixes
---------
Fixed a bug in FPB identifier index lookup. When the id's hash didn't
exist, it got stuck in an infinite loop. There is a special token to
identify the end of the hash chain. Unfortunately, that token wasn't
marked as a b"byte string" during the Python 2to3 conversion, so that
token was never found, causing the code to loop over the chain
forever. It is now a byte string, and a check was added to prevent
infinite loops.
Fixed a bug where a k=0 similarity search using an FPS file as the
targets caused a segfault. The code assumed that k would be at least
1. If you do a k=0 search, it will currently read the entire file,
checking for format errors, and return no hits.
Chemfp no longer generates Python warnings. That is, the regression
tests all pass under "python -W error unit2 discover". The biggest
problem was the ResourceWarning from all of the files which were never
explicitly closed. They used to depend on the garbage collector to
close the file but now use either through a context manager or with
close(). In addition, several strings contains invalid escape
characters and some regression tests used deprecated APIs.
The context manager and close() method for the FPBFingerprintAreana
now close the underlying file object/mmap rather than depend on the
garbage collector.
The readers and writers which are wrappers to an iterator which may
hold a file object, and where the file object was created by chemfp,
now know to close() the wrapped iterator when processing is over.
Added a check that the threshold and count symmetric arena searches
have a popcount. Unordered arenas caused the code to segfault.
What's new in version 3.1
=========================
Released 17 September 2017
The new specialized POPCNT implementation for PubChem/CACTVS keys
increases search performance for that case by about 15%.
The SearchResults object gained the
:meth:`to_csr() <.SearchResults.to_csr>` method and the :attr:`shape
<.SearchResults.shape>` attribute. The new method returns a `SciPy
compressed sparse row matrix
`_ containing
the similarity scores, which can be passed into `scikit-learn `_ for
clustering.
The fall 2017 release of OEChem will accept InChI strings as structure
input. The chemfp wrapper now knows about this, as well as the two new
InChI output flavors "RelativeStereo" and "RacemicStereo".
The fall 2017 release of RDKit will fix a bug in the pattern
fingerprint definitions. The new chemfp fingerprint type is
RDKit-Pattern/4.
Changed how oe2fps, rdkit2fps, and ob2fps report missing or empty
identifiers. Previously the default :option:`--errors` setting of
"ignore" simply skipped those records, without any warning
messages. This caused problems processing the ChEBI SD file. Most of
the records have an empty title line, so only a few fingerprint
records were generated. It wasn't obvious that the resulting data set
was useless. The new code always reports a warning for empty or
missing identifiers, even with "ignore". If the :option:`--errors` is
"strict" then the warning becomes an error and processing stops.
Updated the #software line to include "chemfp/3.1" in addition to the
toolkit information. This helps distinguish between, say, two
different programs which generate RDKit Morgan fingerprints. It's also
possible that a chemfp bug can affect the fingerprint output, so the
extra term makes it easier to identify a bad dataset.
There are several small fixes related to memory leaks, the
bytes/Unicode distinction in Python 3, error messages, and error
handling.
Removed chemfp.progressbar and chemfp.futures. These were included in
chemfp 1.1 because I used them in a project for one customer and
thought they might be useful in future chemfp projects. They were
not. Also removed chemfp.argparse because chemfp 3.0 dropped support
for Python 2.6.
What's new in version 3.0.1
===========================
Released 28 August 2017
This is a bug-fix release. This fixes a critical bug in the
general-purpose POPCNT popcount implementation and a bug in the code
to detect the RDKit Pattern fingerprint change in 2017.3.
See the CHANGELOG for details.
What's new in version 3.0
=========================
Released 2 May 2017
Chemfp now supports both Python 2.7 and Python 3.5 or later. It no
longer supports version before Python 2.7. Chemfp will support Python
2.7 at least until 2020, which is the end-of-life for Python 2.7.
This required extensive changes to distinguish between text/Unicode
strings and byte strings. The biggest user-facing change is that
identifiers are now treated as Unicode strings. Fingerprints are still
treated as byte strings.
This change is not backwards compatible. The APIs function parameters
are polymorphic, so in most cases you can pass in either a Unicode
string or a UTF-8 encoded byte string. However, the return type for an
identifier is Unicode, which will likely cause problems with existing
code which expects bytes.
All of the chemistry toolkits have decided to treat files as UTF-8
encoded. Chemfp's "text toolkit" offers limited support for reading
Latin-1 encoded files. This is a tricky topic so contact me if you
have questions or problems.
I have removed the "make_string_creator()" function because it was
hard to explain, hard to maintain, and had little performance
improvement over passing in the arguments to
:func:`chemfp.create_string`. This will break compatibility, but then
again, I don't think anyone used it. If it is a problem, I suggest
creating a function, as in the following::
>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
u'O-c1:c:c:c:c:c:1'
>>> def make_string(mol):
... return T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
...
>>> make_string(mol)
u'O-c1:c:c:c:c:c:1'
If you look carefully at the previous example, you'll see the other
major backwards incompatibility. The function :func:`chemfp.create_string` now
return a Unicode string instead of a byte string. This also means its
`format` parameter no longer accepts the ".zlib" or ".gzip" extensions.
Instead, to get the old behavior use the new API function
:func:`chemfp.create_bytes`::
>>> T.create_bytes(mol, "smistring", writer_args = {"allBondsExplicit": True})
'O-c1:c:c:c:c:c:1'
>>> T.create_bytes(mol, "smistring.zlib", writer_args = {"allBondsExplicit": True})
'x\x9c\xf3\xd7M6\xb4J\x86CC\x00&\xc8\x04\x8d'
There's a similar change between :func:`chemfp.open_molecule_writer_to_string`
and the new function :func:`chemfp.open_molecule_writer_to_bytes`.
There are also some new features in version 3.0 which don't break
compatibility.
Similarity search is faster because there are now specialized popcount
implementations based on the fingerprint length. On one benchmark,
166-bit searches are 35% faster, 1024-bit searches are 25% faster, and
2048-bit searches are 5% faster.
There is a new popcount implementation for processors with the AVX2
instruction set. It is about 15% faster than the POPCNT version for
2048 bit fingerprints. To test it out you will have to compile chemfp
with :option:`--with-avx2` enabled.
Added support for the Avalon fingerprints in RDKit, if RDKit has been
compiled with Avalon support.
What's new in version 2.1
=========================
Released 2 July 2015
Version 2.1 adds Tversky support for every place there was Tanimoto
search (except the handful of deprecated APIs). There are new search
routines for FPS and arena searches, including OpenMP support, and new
bitops functions to compute a Tversky index between two fingerprints.
The k-nearest arena searches now support OpenMP. Previously they were
single threaded even though the other search functions supported
multiple threads.
The built-in SDF parser saw a couple improvements, including support
for both "\\n" and "\\r\\n" newlines, instead of only "\\n" newlines.
There were a number of bug fixes that concern edge cases. For example,
some 64-bit double calculations could be off-by-one in the last digit,
and fingerprints with 0 bits set could cause a few problems.
What's new in version 2.0
=========================
Released 8 April 2015
Version 2.0 includes many new features designed for web service
development. The new "FPB" binary fingerprint file format is very fast
to load, which is great for web server reloading during development
and on the command-line. The speed comes from using a memory-mapped
file, which also means that multiple chemfp instances can use the same
file on the same machine without extra memory overhead.
The most extensive improvement is the new portable API for working
with structure files and fingerprint types. The moment you start
working with multiple chemistry toolkits, you realize that they all
have different ways to read and write molecules, and to generate
fingerprints from a molecule. Chemfp tries hard to have a consistent
API for these common tasks, without sacrificing performance, so you
can get get your work done. For example, with the new API it's easy to
take an SD record as an input string, compute the MACCS fingerprints
for each available toolkit, add the results as new SD tags, and return
the updated record.
This sounds so easy, doesn't it? It took about a year to develop. The
API is quite extensive, and includes the ability to pass
toolkit-specific options to the underlying parsers, a low-level SDF
parser that can be used to index a file, a way to get a list of
available formats and fingerprint types, and methods to parse
fingerprint arguments from strings.
New with version 2.0 is the ability to handle PubChem-sized
data. Previous versions used 32 bit indexing and had a limit of 4GB,
which is enough for 33M 1024-bit fingerprints, but PubChem has about
twice that many structures.
There are also a lot of improvements, bug fixes, and performance
tweaks. For example, the FPS reader is now almost twice as fast! For
details, see the CHANGELOG file of the release.
Future
======
The chemfp code base is solid and in use at many companies, some of
whom have paid for the commercial version. It has great support for
fingerprint generation, fast similarity search, and toolkit
portability, but there's plenty left to do in future. Here's a mixture
of things that are likely and things which are possibilties. Of course
funding and feedback would help prioritize things. `Let me know
`_ if you need something like one of
these.
Right now you're limited to the built-in toolkit fingerprint types,
plus chemfp's own SMARTS-based fingerprints. There should be a
registration system so you can tell chemfp about user-defined
fingerprint types.
I would like some way to select fingeprint subsets. My original
thought was something like an awk for the FPS format, with the ability
to select N fingerprints at random, or those matching a given set of
identifiers, etc. My current thought is to implement it as a sqlite
virtual table.
Chemfp supports Tanimoto and Tversky similarity. I could also add
support for other measures; cosine and Hamming seem like the most
useful other alternatives.
Chemfp does not currently support Microsoft Windows computer because
the code assumes the LP64 model, where "int" is 32 bits and "long" is
64 bits. It will require a lot of low-level work to tweak everything
correctly for the Windows LLP64 model, where "int" and "long" are 32
bits and "long long" is 64 bits. Once that's done, I'll have to figure
out how to make an installer. I've decided to put it off until a
someone (or someones) fund it.
The threshold and k-nearest arena search results store hits using
compressed sparse rows. These work well for sparse results, but when
you want the entire similarity matrix (ie, with a minimum threshold of
0.0) of a large arena, then time and space to maintain the sparse data
structure becomes noticable. It's likely in that case that you want
to store the scores in a 2D NumPy matrix.
I'm really interested in using chemfp to handle different sorts of
clustering. Let me know if there are things I can add to the API which
would help you do that.
If you are not a Python programmer then you might prefer that the core
search routines be made accessible through a C API. That's possible,
in that the software was designed with that in mind, but it needs more
development and testing.
Chemfp ever since version 1.1 supports OpenMP. That's great for
shared-memory machines. Are you interested in supporting a distributed
computing version?
There are any number of higher-level tools which can be built on the
chemfp components. For example, what about a wsgi component which
implements a web-based search API for your local network? Wouldn't it
be nice to say::
fpserver filename1.fpb
and have a simple search service?
What about an IPython visualization tool?
There's a paper (doi:10.1093/bioinformatics/byq067) on using
locality-sensitive hashing to find highly similar fingerprints. Are
there cases where it's more useful than chemfp's direct search?
Several people have asked about GPU implementations. My feeling is
that the CPU is fast enough, and much easier to deploy. That's not
saying I wouldn't be interested in a GPU implementation, only
describing why it's not at the top of the list.
Thanks
======
In no particular order, the following contributed to chemfp in some
way: Noel O'Boyle, Geoff Hutchison, the Open Babel developers, Greg
Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich
Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel
Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn
Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley,
Lionel Uran Landaburu, and Sereina Riniker.
Thanks also to my wife, Sara Marie, for her many years of support.
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`