chemfp 3.2.1 documentation

chemfp is a set of tools for working with cheminformatics fingerprints.

This is the documentation for the commerical version of chemfp. The documentation for chemfp 1.4, the no-cost version of chemfp, is available from

Most people will use the command-line programs to generate and search fingerprint files. ob2fps, oe2fps, and rdkit2fps use respectively the Open Babel, OpenEye, and RDKit chemistry toolkits to convert structure files into fingerprint files. sdf2fps extracts fingerprints encoded in SD tags to make the fingerprint file. simsearch finds targets in a fingerprint file which are sufficiently similar to the queries. fpcat converts between FPS and FPB formats and merges multiple fingerprint files into one.

The programs are built using the chemfp Python library API. The search capabilities are part of the public API, as well as a cross-toolkit API for reading and writing molecules from structure files or strings, and for computing molecular fingerprints.

Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit.

Chemfp 3.2.1 was released on 12 April 2018. It supports Python 2.7 and 3.5+ and can be used with any recent version of OEChem/OEGraphSim, Open Babel, or RDKit. See What’s New for a description of the changes.

List of chapters

License and advertisement

This program was developed by Andrew Dalke <>, Andrew Dalke Scientific, AB. It is available for purchase under an academic license, a commerical proprietary license, or an open source (MIT) license. A purchase of a license includes free upgrades and support for one year, and a discount on support renewal. (The support for the academic license is more limited than the other two options.)

I also maintain the chemfp-1.x series. Version chemfp-1.4 is available at no cost from, or if you know someone with a copy of chemfp 2.x or 3.x under the MIT license, you might be able to get it from them at no cost.

If you have questions about or with to purchase the commercial distribution, send an email to . You may also request a demo license for evaluation.

Copyright (c) 2010-2018 Andrew Dalke Scientific, AB (Sweden)

The chemfp source code contains confidential and proprietary source
code. It is NOT for public release unless you have purchased a copy
of the source code under the MIT license. Otherwise you may use and
modify this software ONLY under the terms of the License agreement
you or your organization made with Andrew Dalke Scientific AB.

Andrew Dalke Scientific AB warrants that; a) it has the right to
grant a license for the Use of the Software according to the terms
of this agreement; b) for the ninety days following the delivery
(the “Software Warranty Period”), the software will perform
substantially in accordance with the specifications described in the
Manual, when properly operated on the designed Platform. In case of
a breach of the Limited Warranty, Customer’s exclusive remedy is to
destroy all copies of the Software and Supporting Materials and
receive a full refund.

Dalke Scientific makes no other warranties, express, implied or
statutory, regarding the software or services, and expressly
disclaims all implied warranties of merchantability,
non-infringement, title and fitness for a particular
purpose. Neither Dalke Scientific nor its suppliers warrant that the
software or any services performed hereunder will be free from

Copyright to portions of the code are held by other people or organizations, and under different licenses. These are:

  • OpenMP, cpuid, POPCNT, and Lauradoux implementations by Kim Walisch, <>, under the MIT license
  • SSSE3.2 popcount implementation by Stanford University (written by Imran S. Haque <>) under the BSD license
  • The AVX2 popcount implementation by Daniel Lemire, Nathan Kurz, Owen Kaser, et al. under the Apache 2 license
  • heapq and ascii_buffer_converter by the Python Software Foundation under the Python license
  • TimSort code by Christopher Swenson under the MIT License
  • tests/unittest2 by Steve Purcell, the Python Software Foundation, and others, under the Python license
  • chemfp/rdmaccs.patterns and chemfp/rdmaccs2.patterns by Rational Discovery LLC, Greg Landrum, and Julie Penzotti, under the 3-Clause BSD License

What’s new in version 3.2.1

Released 12 April 2018

The biggest change is in the chemfp license. The commercial version is now distributed under a propritary license instead of the MIT open source license.

There are two other minor changes. The build process now includes support for AVX2 by default, and the fingerprint writer classes have a new ‘format’ attribute which is either “fps” or “fpb”, or is None if not defined.

License key

This marks the first release of chemfp with a proprietary license.

Or rather, licenses. There is an academic license and commercial licenses in various flavors. In addition, chemfp is still available under the open source MIT license, though that option is the most expensive. The chemfp 1.x series (currently chemfp 1.4) is still available for no cost under the MIT license, and receives updates, but it only supports Python 2.7 and it does not have as many features.

Chemfp 3.2.1 is available in source code and as a pre-compiled Python package which should run under most x86 64-bit Linux-based OSes. The pre-compiled packages requires a license key.

The license key is date locked. If a valid key is not found then “import chemfp” will print diagnostic messages to stderr and fingerprint search and arena generation functionality will be disabled. If you call one of the disabled functions then it will raise a NotImplementedError exception. Simsearch will not work, and neither will FPB generation.

Chemfp will look for the license key in the CHEMFP_LICENSE environment variable. For example, in bash:


The first 8 digits are the year, month, and date that the license expires, in GMT. In this demo example the license expired at the end of Christmas Day of 2010.

After the date comes optional configuration data including a user identifier, followed by the ‘@’, and ending with a validation key.

There is no centralized license manger, and you may run chemfp on as many computers at your site as you wish, within the limits of your license agreement.

There are two new API functions:

  • chemfp.is_licensed() - return True if the license key is valid or no license key is needed, otherwise return False.
  • chemfp.get_license_date() - return the license key expiration date as a 3-element tuple in the form (year, month, day). If the license key is not found or does not pass the security check then the function returns None. If this version of chemfp does not need a license key then it returns (9999, 12, 25).

Why the change in license policy?

In 2009 or so I decided to see if I could make a living selling free software. Most people who develop open source software for chemistry get their funding from other sources. Academics might be funded from grants, a company might use an open source project for business reasons, as a way to lower overall costs. Some companies sell a proprietary product or access to a service which uses an open source component, where the income from the non-free sources funds the free software development. But I can only think of a one or two cases in where people tried to make a living off of the source code itself, and they were not that successful.

I had some ideas of how it might be successful, and tried them out. While I had some sales, I never made anywhere near what I would have made for the same effort as a consultant or contractor.

I also ran into some difficulties. Most software companies provide their software either free or with steep discounts to academic organizations. If I do that with the most recent version of chemfp, I take a rather large risk that some grad student will post the source on GitHub. (Pharmaceutical company employees are much less likely to do that.)

I charge a lot of money for chemfp, because the few people who need high performance similarity search are willing to pay for it. Potential customers want to try it out. Since I either control the copyright or use components which allow proprietary use, I was able to make a non-disclosure agreement for the evaluation period. Had I been using GPL-based components, and thus restricted to a free software license, that would have been impossible.

I could continue to work at it trying to make a living selling free software, but after 9 years of trying I decided it’s time to switch to a more standard proprietary licensing scheme.

The chemfp 1.x line will still be available at no cost under the MIT license.

AVX2 popcount enabled by default

AVX2 compilation is now enabled by default. It was disabled in earlier releases because the AVX2 command-line flag was used to compile every file and I was worried that it might result in a binary which couldn’t be used by older hardware. For this release I figured out how to use the -mssse3 and -mavx2 flags only for the relevant popcount calculations.

At run-time chemfp will detect which CPU-specific features are available and only use the SSSE3 or AVX2 implementations when appropriate.

What’s new in version 3.2

Released 19 March 2018

This version mostly contains bug fixes and internal improvements. The biggest additions are support for Dave Cosgrove’s ‘flush’ fingerprint file format, and support for ‘fromAtoms’ in some of the RDKit fingerprints.

The configuration has changed to use setuptools.

Previously the command-line programs were installed as small scripts. Now they are created and installed using the “console_scripts” entry_point as part of the install process. This is more in line with the modern way of installing command-line tools for Python.

If these scripts are no longer installed correctly, please let me know.

If you have installed the chemfp_converters package then chemfp will use it to read and write fingerprint files in flush format. It can be used as output from the *2fps programs, as input and output to fpcat, and as query input to simsearch.

Added “fromAtoms” support for the RDKit hash, torsion, Morgan, and pair fingerprints. This is primarily useful if you want to generate the circular environment around specific atoms of a single molecule, and you know the atom indices. If you pass in multiple molecules then the same indices will be used for all of them. Out-of-range values are ignored.

The command-line option is --from-atoms, which takes a comma-separated list of non-negative integer atom indices. For examples:

--from-atoms 0
--from-atoms 29,30

The corresponding fingerprint type strings have also been updated. If fromAtoms is specified then the string fromAtoms=i,j,k,… is added to the string. If it is not specified then the fromAtoms term is not present, in order to maintain compability with older types strings. (The philosophy is that two fingerprint types are equivalent if and only if their type strings are equivalent.)

The --from-atoms option is only useful when there’s a single query and when you have some other mechanism to determine which subset of the atoms to use. For example, you might parse a SMILES, use a SMARTS pattern to find the subset, get the indices of the SMARTS match, and pass the SMILES and indices to rdk2fps to generate the fingerprint for that substructure.

Be aware that the union of the fingerprint for --from-atoms X and the fingerprint for --from-atoms Y might not be equal to the fingerprint for --from-atoms X,Y. However, if a bit is present in the union of the X and Y fingerprints then it will be present in the X,Y fingerprint.

Why? The fingerprint implementation first generates a sparse count fingerprint, then converts that to a bitstring fingerprint. The conversion is affected by the feature count. If a feature is present in both X and Y then X,Y fingerprint may have additional bits sets over the individual fingerprints.

Bug fixes

Fixed a bug in FPB identifier index lookup. When the id’s hash didn’t exist, it got stuck in an infinite loop. There is a special token to identify the end of the hash chain. Unfortunately, that token wasn’t marked as a b”byte string” during the Python 2to3 conversion, so that token was never found, causing the code to loop over the chain forever. It is now a byte string, and a check was added to prevent infinite loops.

Fixed a bug where a k=0 similarity search using an FPS file as the targets caused a segfault. The code assumed that k would be at least 1. If you do a k=0 search, it will currently read the entire file, checking for format errors, and return no hits.

Chemfp no longer generates Python warnings. That is, the regression tests all pass under “python -W error unit2 discover”. The biggest problem was the ResourceWarning from all of the files which were never explicitly closed. They used to depend on the garbage collector to close the file but now use either through a context manager or with close(). In addition, several strings contains invalid escape characters and some regression tests used deprecated APIs.

The context manager and close() method for the FPBFingerprintAreana now close the underlying file object/mmap rather than depend on the garbage collector.

The readers and writers which are wrappers to an iterator which may hold a file object, and where the file object was created by chemfp, now know to close() the wrapped iterator when processing is over.

Added a check that the threshold and count symmetric arena searches have a popcount. Unordered arenas caused the code to segfault.

What’s new in version 3.1

Released 17 September 2017

The new specialized POPCNT implementation for PubChem/CACTVS keys increases search performance for that case by about 15%.

The SearchResults object gained the to_csr() method and the shape attribute. The new method returns a SciPy compressed sparse row matrix containing the similarity scores, which can be passed into scikit-learn for clustering.

The fall 2017 release of OEChem will accept InChI strings as structure input. The chemfp wrapper now knows about this, as well as the two new InChI output flavors “RelativeStereo” and “RacemicStereo”.

The fall 2017 release of RDKit will fix a bug in the pattern fingerprint definitions. The new chemfp fingerprint type is RDKit-Pattern/4.

Changed how oe2fps, rdkit2fps, and ob2fps report missing or empty identifiers. Previously the default --errors setting of “ignore” simply skipped those records, without any warning messages. This caused problems processing the ChEBI SD file. Most of the records have an empty title line, so only a few fingerprint records were generated. It wasn’t obvious that the resulting data set was useless. The new code always reports a warning for empty or missing identifiers, even with “ignore”. If the --errors is “strict” then the warning becomes an error and processing stops.

Updated the #software line to include “chemfp/3.1” in addition to the toolkit information. This helps distinguish between, say, two different programs which generate RDKit Morgan fingerprints. It’s also possible that a chemfp bug can affect the fingerprint output, so the extra term makes it easier to identify a bad dataset.

There are several small fixes related to memory leaks, the bytes/Unicode distinction in Python 3, error messages, and error handling.

Removed chemfp.progressbar and chemfp.futures. These were included in chemfp 1.1 because I used them in a project for one customer and thought they might be useful in future chemfp projects. They were not. Also removed chemfp.argparse because chemfp 3.0 dropped support for Python 2.6.

What’s new in version 3.0.1

Released 28 August 2017

This is a bug-fix release. This fixes a critical bug in the general-purpose POPCNT popcount implementation and a bug in the code to detect the RDKit Pattern fingerprint change in 2017.3.

See the CHANGELOG for details.

What’s new in version 3.0

Released 2 May 2017

Chemfp now supports both Python 2.7 and Python 3.5 or later. It no longer supports version before Python 2.7. Chemfp will support Python 2.7 at least until 2020, which is the end-of-life for Python 2.7.

This required extensive changes to distinguish between text/Unicode strings and byte strings. The biggest user-facing change is that identifiers are now treated as Unicode strings. Fingerprints are still treated as byte strings.

This change is not backwards compatible. The APIs function parameters are polymorphic, so in most cases you can pass in either a Unicode string or a UTF-8 encoded byte string. However, the return type for an identifier is Unicode, which will likely cause problems with existing code which expects bytes.

All of the chemistry toolkits have decided to treat files as UTF-8 encoded. Chemfp’s “text toolkit” offers limited support for reading Latin-1 encoded files. This is a tricky topic so contact me if you have questions or problems.

I have removed the “make_string_creator()” function because it was hard to explain, hard to maintain, and had little performance improvement over passing in the arguments to chemfp.create_string(). This will break compatibility, but then again, I don’t think anyone used it. If it is a problem, I suggest creating a function, as in the following:

>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
>>> def make_string(mol):
...   return T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
>>> make_string(mol)

If you look carefully at the previous example, you’ll see the other major backwards incompatibility. The function chemfp.create_string() now return a Unicode string instead of a byte string. This also means its format parameter no longer accepts the “.zlib” or “.gzip” extensions.

Instead, to get the old behavior use the new API function chemfp.create_bytes():

>>> T.create_bytes(mol, "smistring", writer_args = {"allBondsExplicit": True})
>>> T.create_bytes(mol, "smistring.zlib", writer_args = {"allBondsExplicit": True})

There’s a similar change between chemfp.open_molecule_writer_to_string() and the new function chemfp.open_molecule_writer_to_bytes().

There are also some new features in version 3.0 which don’t break compatibility.

Similarity search is faster because there are now specialized popcount implementations based on the fingerprint length. On one benchmark, 166-bit searches are 35% faster, 1024-bit searches are 25% faster, and 2048-bit searches are 5% faster.

There is a new popcount implementation for processors with the AVX2 instruction set. It is about 15% faster than the POPCNT version for 2048 bit fingerprints. To test it out you will have to compile chemfp with --with-avx2 enabled.

Added support for the Avalon fingerprints in RDKit, if RDKit has been compiled with Avalon support.

What’s new in version 2.1

Released 2 July 2015

Version 2.1 adds Tversky support for every place there was Tanimoto search (except the handful of deprecated APIs). There are new search routines for FPS and arena searches, including OpenMP support, and new bitops functions to compute a Tversky index between two fingerprints.

The k-nearest arena searches now support OpenMP. Previously they were single threaded even though the other search functions supported multiple threads.

The built-in SDF parser saw a couple improvements, including support for both “\n” and “\r\n” newlines, instead of only “\n” newlines.

There were a number of bug fixes that concern edge cases. For example, some 64-bit double calculations could be off-by-one in the last digit, and fingerprints with 0 bits set could cause a few problems.

What’s new in version 2.0

Released 8 April 2015

Version 2.0 includes many new features designed for web service development. The new “FPB” binary fingerprint file format is very fast to load, which is great for web server reloading during development and on the command-line. The speed comes from using a memory-mapped file, which also means that multiple chemfp instances can use the same file on the same machine without extra memory overhead.

The most extensive improvement is the new portable API for working with structure files and fingerprint types. The moment you start working with multiple chemistry toolkits, you realize that they all have different ways to read and write molecules, and to generate fingerprints from a molecule. Chemfp tries hard to have a consistent API for these common tasks, without sacrificing performance, so you can get get your work done. For example, with the new API it’s easy to take an SD record as an input string, compute the MACCS fingerprints for each available toolkit, add the results as new SD tags, and return the updated record.

This sounds so easy, doesn’t it? It took about a year to develop. The API is quite extensive, and includes the ability to pass toolkit-specific options to the underlying parsers, a low-level SDF parser that can be used to index a file, a way to get a list of available formats and fingerprint types, and methods to parse fingerprint arguments from strings.

New with version 2.0 is the ability to handle PubChem-sized data. Previous versions used 32 bit indexing and had a limit of 4GB, which is enough for 33M 1024-bit fingerprints, but PubChem has about twice that many structures.

There are also a lot of improvements, bug fixes, and performance tweaks. For example, the FPS reader is now almost twice as fast! For details, see the CHANGELOG file of the release.


The chemfp code base is solid and in use at many companies, some of whom have paid for the commercial version. It has great support for fingerprint generation, fast similarity search, and toolkit portability, but there’s plenty left to do in future. Here’s a mixture of things that are likely and things which are possibilties. Of course funding and feedback would help prioritize things. Let me know if you need something like one of these.

Right now you’re limited to the built-in toolkit fingerprint types, plus chemfp’s own SMARTS-based fingerprints. There should be a registration system so you can tell chemfp about user-defined fingerprint types.

I would like some way to select fingeprint subsets. My original thought was something like an awk for the FPS format, with the ability to select N fingerprints at random, or those matching a given set of identifiers, etc. My current thought is to implement it as a sqlite virtual table.

Chemfp supports Tanimoto and Tversky similarity. I could also add support for other measures; cosine and Hamming seem like the most useful other alternatives.

Chemfp does not currently support Microsoft Windows computer because the code assumes the LP64 model, where “int” is 32 bits and “long” is 64 bits. It will require a lot of low-level work to tweak everything correctly for the Windows LLP64 model, where “int” and “long” are 32 bits and “long long” is 64 bits. Once that’s done, I’ll have to figure out how to make an installer. I’ve decided to put it off until a someone (or someones) fund it.

The threshold and k-nearest arena search results store hits using compressed sparse rows. These work well for sparse results, but when you want the entire similarity matrix (ie, with a minimum threshold of 0.0) of a large arena, then time and space to maintain the sparse data structure becomes noticable. It’s likely in that case that you want to store the scores in a 2D NumPy matrix.

I’m really interested in using chemfp to handle different sorts of clustering. Let me know if there are things I can add to the API which would help you do that.

If you are not a Python programmer then you might prefer that the core search routines be made accessible through a C API. That’s possible, in that the software was designed with that in mind, but it needs more development and testing.

Chemfp ever since version 1.1 supports OpenMP. That’s great for shared-memory machines. Are you interested in supporting a distributed computing version?

There are any number of higher-level tools which can be built on the chemfp components. For example, what about a wsgi component which implements a web-based search API for your local network? Wouldn’t it be nice to say:

fpserver filename1.fpb

and have a simple search service?

What about an IPython visualization tool?

There’s a paper (doi:10.1093/bioinformatics/byq067) on using locality-sensitive hashing to find highly similar fingerprints. Are there cases where it’s more useful than chemfp’s direct search?

Several people have asked about GPU implementations. My feeling is that the CPU is fast enough, and much easier to deploy. That’s not saying I wouldn’t be interested in a GPU implementation, only describing why it’s not at the top of the list.


In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, and Sereina Riniker.

Thanks also to my wife, Sara Marie, for her many years of support.

Indices and tables