chemfp 3.2 documentation

chemfp is a set of tools for working with cheminformatics fingerprints.

This is the documentation for the commerical version of chemfp. The documentation for chemfp 1.4, the no-cost version of chemfp, is available from http://chemfp.readthedocs.io/en/chemfp-1.4/.

Most people will use the command-line programs to generate and search fingerprint files. ob2fps, oe2fps, and rdkit2fps use respectively the Open Babel, OpenEye, and RDKit chemistry toolkits to convert structure files into fingerprint files. sdf2fps extracts fingerprints encoded in SD tags to make the fingerprint file. simsearch finds targets in a fingerprint file which are sufficiently similar to the queries. fpcat converts between FPS and FPB formats and merges multiple fingerprint files into one.

The programs are built using the chemfp Python library API. The search capabilities are part of the public API, as well as a cross-toolkit API for reading and writing molecules from structure files or strings, and for computing molecular fingerprints.

Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit.

Chemfp 3.2 was released on 19 March 2018. It supports Python 2.7 and 3.5+ and can be used with any recent version of OEChem/OEGraphSim, Open Babel, or RDKit.

List of chapters

License and advertisement

This program was developed by Andrew Dalke <dalke@dalkescientific.com>, Andrew Dalke Scientific, AB. It is distributed under the “MIT” license, shown below.

Further chemfp development depends on funding from people like you. Asking for voluntary contributions almost never works. Instead, starting with chemfp 1.1, the source code is distributed under an incentive program. You can pay for the commerical distribution, or use the no-cost version.

If you pay for the commercial distribution then you will get the most recent version of chemfp, free upgrades for one year, support, and a discount on renewing participation in the incentive program.

I also maintain the chemfp-1.x series. Version chemfp-1.4 is available at no cost from chemfp.com, or if you know someone with chemfp 2.x or 3.x you might be able to get it from them at no cost. It’s free/open source software, after all.

If you have questions about or with to purchase the commercial distribution, send an email to sales@dalkescientific.com .

Copyright (c) 2010-2018 Andrew Dalke Scientific, AB (Sweden)

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Copyright to portions of the code are held by other people or organizations, and may be under a different license. See the specific code for details. These are:

  • OpenMP, cpuid, POPCNT, and Lauradoux implementations by Kim Walisch, <kim.walisch@gmail.com>, under the MIT license
  • SSSE3.2 popcount implementation by Stanford University (written by Imran S. Haque <ihaque@cs.stanford.edu>) under the BSD license
  • The AVX2 popcount implementation by Daniel Lemire, Nathan Kurz, Owen Kaser, et al. under the Apache 2 license
  • heapq and ascii_buffer_converter by the Python Software Foundation under the Python license
  • TimSort code by Christopher Swenson under the MIT License
  • tests/unittest2 by Steve Purcell, the Python Software Foundation, and others, under the Python license
  • chemfp/rdmaccs.patterns and chemfp/rdmaccs2.patterns by Rational Discovery LLC, Greg Landrum, and Julie Penzotti, under the 3-Clause BSD License

What’s new in version 3.2

Released 19 March 2018

This version mostly contains bug fixes and internal improvements. The biggest additions are support for Dave Cosgrove’s ‘flush’ fingerprint file format, and support for ‘fromAtoms’ in some of the RDKit fingerprints.

The configuration has changed to use setuptools.

Previously the command-line programs were installed as small scripts. Now they are created and installed using the “console_scripts” entry_point as part of the install process. This is more in line with the modern way of installing command-line tools for Python.

If these scripts are no longer installed correctly, please let me know.

If you have installed the chemfp_converters package then chemfp will use it to read and write fingerprint files in flush format. It can be used as output from the *2fps programs, as input and output to fpcat, and as query input to simsearch.

Added “fromAtoms” support for the RDKit hash, torsion, Morgan, and pair fingerprints. This is primarily useful if you want to generate the circular environment around specific atoms of a single molecule, and you know the atom indices. If you pass in multiple molecules then the same indices will be used for all of them. Out-of-range values are ignored.

The command-line option is --from-atoms, which takes a comma-separated list of non-negative integer atom indices. For examples:

--from-atoms 0
--from-atoms 29,30

The corresponding fingerprint type strings have also been updated. If fromAtoms is specified then the string fromAtoms=i,j,k,… is added to the string. If it is not specified then the fromAtoms term is not present, in order to maintain compability with older types strings. (The philosophy is that two fingerprint types are equivalent if and only if their type strings are equivalent.)

The --from-atoms option is only useful when there’s a single query and when you have some other mechanism to determine which subset of the atoms to use. For example, you might parse a SMILES, use a SMARTS pattern to find the subset, get the indices of the SMARTS match, and pass the SMILES and indices to rdk2fps to generate the fingerprint for that substructure.

Be aware that the union of the fingerprint for --from-atoms X and the fingerprint for --from-atoms Y might not be equal to the fingerprint for --from-atoms X,Y. However, if a bit is present in the union of the X and Y fingerprints then it will be present in the X,Y fingerprint.

Why? The fingerprint implementation first generates a sparse count fingerprint, then converts that to a bitstring fingerprint. The conversion is affected by the feature count. If a feature is present in both X and Y then X,Y fingerprint may have additional bits sets over the individual fingerprints.

Bug fixes

Fixed a bug in FPB identifier index lookup. When the id’s hash didn’t exist, it got stuck in an infinite loop. There is a special token to identify the end of the hash chain. Unfortunately, that token wasn’t marked as a b”byte string” during the Python 2to3 conversion, so that token was never found, causing the code to loop over the chain forever. It is now a byte string, and a check was added to prevent infinite loops.

Fixed a bug where a k=0 similarity search using an FPS file as the targets caused a segfault. The code assumed that k would be at least 1. If you do a k=0 search, it will currently read the entire file, checking for format errors, and return no hits.

Chemfp no longer generates Python warnings. That is, the regression tests all pass under “python -W error unit2 discover”. The biggest problem was the ResourceWarning from all of the files which were never explicitly closed. They used to depend on the garbage collector to close the file but now use either through a context manager or with close(). In addition, several strings contains invalid escape characters and some regression tests used deprecated APIs.

The context manager and close() method for the FPBFingerprintAreana now close the underlying file object/mmap rather than depend on the garbage collector.

The readers and writers which are wrappers to an iterator which may hold a file object, and where the file object was created by chemfp, now know to close() the wrapped iterator when processing is over.

Added a check that the threshold and count symmetric arena searches have a popcount. Unordered arenas caused the code to segfault.

What’s new in version 3.1

Released 17 September 2017

The new specialized POPCNT implementation for PubChem/CACTVS keys increases search performance for that case by about 15%.

The SearchResults object gained the to_csr() method and the shape attribute. The new method returns a SciPy compressed sparse row matrix containing the similarity scores, which can be passed into scikit-learn for clustering.

The fall 2017 release of OEChem will accept InChI strings as structure input. The chemfp wrapper now knows about this, as well as the two new InChI output flavors “RelativeStereo” and “RacemicStereo”.

The fall 2017 release of RDKit will fix a bug in the pattern fingerprint definitions. The new chemfp fingerprint type is RDKit-Pattern/4.

Changed how oe2fps, rdkit2fps, and ob2fps report missing or empty identifiers. Previously the default --errors setting of “ignore” simply skipped those records, without any warning messages. This caused problems processing the ChEBI SD file. Most of the records have an empty title line, so only a few fingerprint records were generated. It wasn’t obvious that the resulting data set was useless. The new code always reports a warning for empty or missing identifiers, even with “ignore”. If the --errors is “strict” then the warning becomes an error and processing stops.

Updated the #software line to include “chemfp/3.1” in addition to the toolkit information. This helps distinguish between, say, two different programs which generate RDKit Morgan fingerprints. It’s also possible that a chemfp bug can affect the fingerprint output, so the extra term makes it easier to identify a bad dataset.

There are several small fixes related to memory leaks, the bytes/Unicode distinction in Python 3, error messages, and error handling.

Removed chemfp.progressbar and chemfp.futures. These were included in chemfp 1.1 because I used them in a project for one customer and thought they might be useful in future chemfp projects. They were not. Also removed chemfp.argparse because chemfp 3.0 dropped support for Python 2.6.

What’s new in version 3.0.1

Released 28 August 2017

This is a bug-fix release. This fixes a critical bug in the general-purpose POPCNT popcount implementation and a bug in the code to detect the RDKit Pattern fingerprint change in 2017.3.

See the CHANGELOG for details.

What’s new in version 3.0

Released 2 May 2017

Chemfp now supports both Python 2.7 and Python 3.5 or later. It no longer supports version before Python 2.7. Chemfp will support Python 2.7 at least until 2020, which is the end-of-life for Python 2.7.

This required extensive changes to distinguish between text/Unicode strings and byte strings. The biggest user-facing change is that identifiers are now treated as Unicode strings. Fingerprints are still treated as byte strings.

This change is not backwards compatible. The APIs function parameters are polymorphic, so in most cases you can pass in either a Unicode string or a UTF-8 encoded byte string. However, the return type for an identifier is Unicode, which will likely cause problems with existing code which expects bytes.

All of the chemistry toolkits have decided to treat files as UTF-8 encoded. Chemfp’s “text toolkit” offers limited support for reading Latin-1 encoded files. This is a tricky topic so contact me if you have questions or problems.

I have removed the “make_string_creator()” function because it was hard to explain, hard to maintain, and had little performance improvement over passing in the arguments to chemfp.create_string(). This will break compatibility, but then again, I don’t think anyone used it. If it is a problem, I suggest creating a function, as in the following:

>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
u'O-c1:c:c:c:c:c:1'
>>> def make_string(mol):
...   return T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
...
>>> make_string(mol)
u'O-c1:c:c:c:c:c:1'

If you look carefully at the previous example, you’ll see the other major backwards incompatibility. The function chemfp.create_string() now return a Unicode string instead of a byte string. This also means its format parameter no longer accepts the “.zlib” or “.gzip” extensions.

Instead, to get the old behavior use the new API function chemfp.create_bytes():

>>> T.create_bytes(mol, "smistring", writer_args = {"allBondsExplicit": True})
'O-c1:c:c:c:c:c:1'
>>> T.create_bytes(mol, "smistring.zlib", writer_args = {"allBondsExplicit": True})
'x\x9c\xf3\xd7M6\xb4J\x86CC\x00&\xc8\x04\x8d'

There’s a similar change between chemfp.open_molecule_writer_to_string() and the new function chemfp.open_molecule_writer_to_bytes().

There are also some new features in version 3.0 which don’t break compatibility.

Similarity search is faster because there are now specialized popcount implementations based on the fingerprint length. On one benchmark, 166-bit searches are 35% faster, 1024-bit searches are 25% faster, and 2048-bit searches are 5% faster.

There is a new popcount implementation for processors with the AVX2 instruction set. It is about 15% faster than the POPCNT version for 2048 bit fingerprints. To test it out you will have to compile chemfp with --with-avx2 enabled.

Added support for the Avalon fingerprints in RDKit, if RDKit has been compiled with Avalon support.

What’s new in version 2.1

Released 2 July 2015

Version 2.1 adds Tversky support for every place there was Tanimoto search (except the handful of deprecated APIs). There are new search routines for FPS and arena searches, including OpenMP support, and new bitops functions to compute a Tversky index between two fingerprints.

The k-nearest arena searches now support OpenMP. Previously they were single threaded even though the other search functions supported multiple threads.

The built-in SDF parser saw a couple improvements, including support for both “\n” and “\r\n” newlines, instead of only “\n” newlines.

There were a number of bug fixes that concern edge cases. For example, some 64-bit double calculations could be off-by-one in the last digit, and fingerprints with 0 bits set could cause a few problems.

What’s new in version 2.0

Released 8 April 2015

Version 2.0 includes many new features designed for web service development. The new “FPB” binary fingerprint file format is very fast to load, which is great for web server reloading during development and on the command-line. The speed comes from using a memory-mapped file, which also means that multiple chemfp instances can use the same file on the same machine without extra memory overhead.

The most extensive improvement is the new portable API for working with structure files and fingerprint types. The moment you start working with multiple chemistry toolkits, you realize that they all have different ways to read and write molecules, and to generate fingerprints from a molecule. Chemfp tries hard to have a consistent API for these common tasks, without sacrificing performance, so you can get get your work done. For example, with the new API it’s easy to take an SD record as an input string, compute the MACCS fingerprints for each available toolkit, add the results as new SD tags, and return the updated record.

This sounds so easy, doesn’t it? It took about a year to develop. The API is quite extensive, and includes the ability to pass toolkit-specific options to the underlying parsers, a low-level SDF parser that can be used to index a file, a way to get a list of available formats and fingerprint types, and methods to parse fingerprint arguments from strings.

New with version 2.0 is the ability to handle PubChem-sized data. Previous versions used 32 bit indexing and had a limit of 4GB, which is enough for 33M 1024-bit fingerprints, but PubChem has about twice that many structures.

There are also a lot of improvements, bug fixes, and performance tweaks. For example, the FPS reader is now almost twice as fast! For details, see the CHANGELOG file of the release.

Future

The chemfp code base is solid and in use at many companies, some of whom have paid for the commercial version. It has great support for fingerprint generation, fast similarity search, and toolkit portability, but there’s plenty left to do in future. Here’s a mixture of things that are likely and things which are possibilties. Of course funding and feedback would help prioritize things. Let me know if you need something like one of these.

Right now you’re limited to the built-in toolkit fingerprint types, plus chemfp’s own SMARTS-based fingerprints. There should be a registration system so you can tell chemfp about user-defined fingerprint types.

I would like some way to select fingeprint subsets. My original thought was something like an awk for the FPS format, with the ability to select N fingerprints at random, or those matching a given set of identifiers, etc. My current thought is to implement it as a sqlite virtual table.

Chemfp supports Tanimoto and Tversky similarity. I could also add support for other measures; cosine and Hamming seem like the most useful other alternatives.

Chemfp does not currently support Microsoft Windows computer because the code assumes the LP64 model, where “int” is 32 bits and “long” is 64 bits. It will require a lot of low-level work to tweak everything correctly for the Windows LLP64 model, where “int” and “long” are 32 bits and “long long” is 64 bits. Once that’s done, I’ll have to figure out how to make an installer. I’ve decided to put it off until a someone (or someones) fund it.

The threshold and k-nearest arena search results store hits using compressed sparse rows. These work well for sparse results, but when you want the entire similarity matrix (ie, with a minimum threshold of 0.0) of a large arena, then time and space to maintain the sparse data structure becomes noticable. It’s likely in that case that you want to store the scores in a 2D NumPy matrix.

I’m really interested in using chemfp to handle different sorts of clustering. Let me know if there are things I can add to the API which would help you do that.

If you are not a Python programmer then you might prefer that the core search routines be made accessible through a C API. That’s possible, in that the software was designed with that in mind, but it needs more development and testing.

Chemfp ever since version 1.1 supports OpenMP. That’s great for shared-memory machines. Are you interested in supporting a distributed computing version?

There are any number of higher-level tools which can be built on the chemfp components. For example, what about a wsgi component which implements a web-based search API for your local network? Wouldn’t it be nice to say:

fpserver filename1.fpb

and have a simple search service?

What about an IPython visualization tool?

There’s a paper (doi:10.1093/bioinformatics/byq067) on using locality-sensitive hashing to find highly similar fingerprints. Are there cases where it’s more useful than chemfp’s direct search?

Several people have asked about GPU implementations. My feeling is that the CPU is fast enough, and much easier to deploy. That’s not saying I wouldn’t be interested in a GPU implementation, only describing why it’s not at the top of the list.

Thanks

In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, and Lionel Uran Landaburu.

Thanks also to my wife, Sara Marie, for her many years of support.

Indices and tables