What’s New / CHANGELOG

What’s new in 3.4 (27 August 2020)

The main changes in this release were to improve support for “non-standard” fingerprint lengths. Previous releases had special support for the most common fingerprint lengths in cheminformatics; 166-bit (24-byte), 512-bit (64 byte), 881-bit (112 byte), 1024-bit (128 byte), and 2048-bit (256 byte) fingerprints. This release extends that special support to a wider set of lengths.

New POPCNT implementations

Added specialized POPCNT implementations for all 8-byte-multiple fingerprint lengths up to 1024 bytes, plus faster implementations for 8-byte and 32-byte multiple lengths beyond that.

The new specialized POPCNT algorithms are 10-30% faster than chemfp 3.4 for tiny fingerprints (<512 bits) and 0-10% faster for larger fingerprints. For tiny fingerprints the new algorithms are about 20% faster than chemfp 1.6.1. For larger fingerprints, 3.4.1 is about 5% faster than 1.6.1.

New AVX2 implementations

Added a specialized AVX2 implementation for 1024 bits. This is only about 0.5% faster than the version in 3.4, but meant to avoid the slight overhead from the bug fix described below.

Added specialized AVX2/POPCNT hybrid implementations for 160, 192, and 224 bytes (1280, 1536, and 1792 bits). These are about 33%, 25%, and 20% faster than the POPCNT equivalents. Note that the 2048-bit AVX2 search performance is about the same as the 1536-bit performance, so if all you care about is performance then you should never use a length between 160 and 256 bytes.

New FingerprintArena methods

Added two new FingerprintArena methods. FingerprintArena.sample() randomly selects a subset of the fingerprints and returns them in a new arena. FingerprintArena.train_test_split() returns two randomly selected and disjoint subsets of the area, typically used as a training set and a test set.

Bug fixes

There are also two bug fixes:

  • Fixed bug in fpcat where using –reorder would write the FPS header twice.
  • Fixed bug in AVX2 implementation when the storage size was a multiple of 1024 bits but the fingerprint size was smaller.

What’s new in 3.4 (24 June 2020)

This is summary of the changes since chemfp 3.3. For more details, see the individual intermediate changelog entries below.

J. Cheminf. publication

There is a two year gap between the 3.3 and 3.4 releases. More than six months of that time went to writing the paper “The chemfp project” for the Journal of Cheminformatics, which covers all of the major aspects of chemfp.

Towards the end of writing the paper I realized there was an improvement to the basic search algorithm. The naive Tanimoto calculation test against a threshold requires a floating-point division. I had replaced that with a faster comparison using integer multiplication. The newest version replaces that with a simple comparison of the popcount to an expected minimum value.

This increases the MACCS search performance by roughly 15%. For larger fingerprint lengths the improvement is only a few percent at best, which is expected as chemfp is mostly memory bandwidth bound, not CPU bound.

New licensing options

Pre-compiled chemfp distributions for Linux-based operating systems are now available at no cost under the “Chemfp Base License Agreement”. Most of the chemfp features are available for internal use, except that:

  • fingerprint arenas may not be larger than 50,000 fingerprints;
  • in-memory arena searches may not have more than 50,000 queries or targets;
  • FPS searches may not have more than 20 queries;
  • Tversky search is disabled;
  • writing FPB files is disabled.

These features can be enabled with a valid license key, set via the environment variable CHEMFP_LICENSE. Email sales@dalkescientific.com to request a evaluation license or to purchase a license. Source distributions are also available.

To download the pre-compiled package for “manylinux” use:

python -m pip install chemfp -i https://chemfp.com/packages/

See LICENSE from the distribution or https://chemfp.com/BaseLicense.txt for full details.

Chemistry toolkit changes

RDKit: Added support for the “SECFP” SMILES-based circular fingerprints from the Reymond group. Added RDKit-Fingerprint branchedPaths and useBondOrder options. Added RDKit-Morgan includeRedundantEnvironments option. Added RDKit-AtomPair nBitsPerEntry, includeRedundantEnvironments, and use2D options. Added RDKit-Torsion nBitsPerEntry and includeChirality options.

RDKit (continued): New SMILES output option cxsmiles to include extra annotations. New SDF input option includeTags to disable importing SD tags. New SDF output option v3k to always use v3000 format. Added support for RDKit’s Mol2, PDB, Maestro, XYZ, HELM, and FASTA parsers. Added a new “sequence” format to handle just the 1D sequence string.

Open Babel: Added support for 3.0. Added support for ECFP fingerprints, with family names: OpenBabel-ECFP0, OpenBabel-ECFP2, OpenBabel-ECFP4, OpenBabel-ECFP6, OpenBabel-ECFP8, OpenBabel-ECFP10. Open Babel 3.0 includes new formats, which were automatically supported by chemfp.

OpenEye: Added support for OEChem’s OEZ, CIF, mmCIF, PDB, FASTA, and CSV parsers. Added a new “sequence” format to handle just the 1D sequence string. Added experimental support for substructure screens, with the family names OpenEye-MoleculeScreen, OpenEye-MDLScreen, and OpenEye-SMARTSScreen.

Tool changes

Simsearch now accepts a structure input, either as a command-line argument or from a filename. It will use the fingerprint type from the target data set or a user-specified fingerprint type to convert the structures into fingerprints.

Added a --help-format option to rdkit2fps, ob2fps and oe2fps which shows all available input formats and their reader options.

I/O changes

Added support for Zstandard compression everywhere that gzip compression is supported. Use the filename or format extension “.zst” to indicate that compression type. Chemfp’s RDKit toolkit adapater also supports Zstandard, but not the Open Babel and OpenEye adapters.

Note: Zstandard compression requires the third-party “zstandard” Python package be installed.

Improved the gzip reader performance by about 15%. Improved the FPS reader by about 20%. Overall, sdf2fps is about 10% faster extracting PubChem fingerprints from the PubChem sdf.gz files.

Improved FPB output performance by about 10% by using a C extension.

Chemfp now supports reading compressed FPB files, and reading FPB files from stdin. These are read entirely into memory before use as they cannot be memory-mapped. This was a feature request from a customer who stored large fingerprint files on a network-based filesystem. It was faster to read a compressed file and decompress into memory than it was to memory-map and use the contents of an uncompressed file.

What’s new in 3.4b3 (18 June 2020)

  • Changed --list-formats to --help-formats.
  • Updated oe2fps --help-formats.
  • Fixed a bug in several OEChem create_string() and create_bytes() implementations where a non-None ‘id’ changed the molecule title.
  • Finished updating the documentation.

What’s new in 3.4b2 (12 June 2020)

  • Changed the licensing model to let people use chemfp without a valid license key, with restrictions:
    • fingerprint arenas may not be larger than 50,000 fingerprints,
    • arena searches may not have more than 50,000 queries or targets,
    • FPS searches may not have more than 20 queries
    • Tversky searches are disabled, and writing FPB files is disabled.
  • Added “includeTags” option for the RDKit toolkit SDF reader. The default of True parses the SD tag data. This isn’t needed if you just want to generate fingerprints. rdkit2fps sets includeTags=False by default, for a ~5% speedup in parsing a PubChem file.
  • Added Zstandard input and output options to sdf2fps.
  • Fixed a couple of bugs in the new gzio module. Better code to handle finding libz, and support for different response codes for older versions of libz.
  • Added compression --level option to fpcat
  • Support OEChem 2.3 from 2019.Oct.
  • Added support for OEChem formats OEZ, CIF, mmCIF, PDB, FASTA, and CSV. Also implemented a “sequence” format based on the FASTA reader.
  • Added experimental support for OpenEye’s fingerprint-like screens. The new fingerprint family names are OpenEye-MoleculeScreen, OpenEye-MDLScreen, and OpenEye-SMARTSScreen. The functions are type-based: QMols produce query screens and “regular” molecules produce target screens.
  • get_fingerprint_families() now supports an optional “toolkit_name” parameter which loads and returns only the fingerprints families for the specified toolkit.
  • BUG FIX: some OpenEye toolkit writers, when passed a new identifier, SetTitle(new_id) on the molecule before writing, but did not SetTitle(old_id) to restore original id.
  • BUG FIX: the code did not check for fingerprint generation failures when using OEChem/OEGraphSim. Fixed the code so it doesn’t an empty molecule returns an empty fingerprint, instead of reusing whatever the previous fingerprint was.
  • Fixed a number of issues identified by PyFlakes, including some bugs, mostly related to error conditions which weren’t tested.

What’s new in 3.4b1 (24 April 2020)

  • Support Open Babel 3.0.
  • Support Open Babel ECFP fingerprints. Requires Open Babel 3.0 or later. Use --nBits to specify a size other than the default of 4096 bits. (Must be a power of 2, and at least 32.)
  • Support RDKit parsers for FASTA, sequence, HELM, Mol2, PDB, Maestro and XYZ formats.
  • RDKit SMILES writers now support the “cxsmiles” boolean flag to generate CXSMILES strings. RDKit SDF and Molfile writers support “v3k” boolean flag to always generate V3000 records.
  • Added support for RDKit SECFP fingerprints, developed by the Reymond group. These are circular fingerprints similar to ECFP fingerprints except they use canonical fragment SMILES for the circular substructures to generate hash values.
  • Added support for additional RDKit fingerprint parameters:
    • RDKit-Fingerprint: branchedPaths and useBondOrder
    • RDKit-Morgan: includeRedundantEnvironments
    • RDKit-AtomPair: nBitsPerEntry, includeChirality, and use2D
    • RDKit-Torsion: nBitsPerEntry, includeChirality
  • Added support for Zstandard compression everywhere gzip is supported, except for the Open Babel and OpenEye toolkits, where the native toolkits do not support Zstandard and do not accept a Python file object.
  • Sped up FPB generation for ChEMBL by about 9% by rewriting several parts of the FPID block writer code in C.
  • Faster gzip read performance when reading from stdin or a named file. The new module calls zlib functions directly, which gives 15-25% improved performance. If you have problems with the new gzip reader, you can disable it be setting the environment variable CHEMFP_USE_SYSTEM_GZIP to 1.
  • For even faster gzip read performance, chemfp can use an external program to decompress stdin or a named file. If the environment variable CHEMFP_GZCAT is set then chemfp will interpret it as command-line arguments to use in a subprocess. This may be zcat, gzcat or gzip -dc, or pigz -dc. (NOTE: this variable was named CHEMFP_GZCAT_BINARY in the a4 release.)

In one test of simsearch, using 1.7M 2048-bit RDKit Morgan fingerprints from ChEMBL 23, measuring wall-clock time:

  • a search of the uncompressed file took 1.45 seconds
  • CHEMFP_GZCAT=gzcat took 2.16 seconds (3.07 of total user time)
  • the new gzip reader took 3.65 seconds
  • CHEMFP_USE_SYSTEM_GZIP=1 took 4.36 seconds

Note that part of the speedup is because gzcat runs in another process so take better advantage of multicore hardware. (That is, I measured wall-clock time on a multicore machine, not overall CPU time.)

  • Improved the error handling when chemfp uses an external program to decompress an gzip’ed file. NOTE: IT IS NOT FOOLPROOF! Chemfp waits 0.01 seconds to see if gzcat has exited unexpectedly, which might happen if the file does not exist or cannot be read. However, there is a chance that gzcat may take longer to report an error. In addition, chemfp does not detect if gzip exited early because the file was corrupt or incomplete.
  • Added a --list-formats options to oe2fps, rdkit2fps, and ob2fps, which gives more detailed information about the supported input structure file formats and their options.
  • No longer including or using a copy of unittest2, which was needed for Python 2.6 support.

What’s new in 3.4a4 (18 March 2020)

  • simsearch accepts a structure file as query input. Use --in or --query-format to specify the format type, or let chemfp try to figure out from the filename extension.

If the fingerprint type is not specified with --query-type then the target file metadata must specify the type.

The --id-tags, --delimiter, --has-header, -R and --errors options from the *2fps programs are also supported.

  • The OEChem SMILES and InChI readers now support the has_header reader_arg to skip the first line of the file. Use --has-header to enable that feature in oe2fps.
  • FPB files may now be read from stdin, and fpb.gz files are supported. Unlike regular FPB files, which are memory-mapped, the contents of these files are read into memory before use. The main use case for fpb.gz files is to reduce network I/O if the files are on a remote disk.
  • Changed the FPS reader block size from around 11K to 100K, giving a 20% boost in read performance and 10% boost in fpcat performance. The smaller block size was chosen 10 years ago, on much less powerful hardware.
  • Experimental support for zstd compression, based on the filename ending with either .fps.zst or .fpb.zst. This depends on the third-party “zstandard” package. My experience is that piping gzip output to chemfp is faster than letting chemp use Python’s built-in gzip reader or using zstandard.
  • Experimental support to use an external binary to decompress a gzip file. Set “CHEMFP_GZCAT_BINARY” to “gzcat” or “zcat” or whatever program you use to read a gzip-compressed file (passed on the command-line) and write the uncompressed contents to stdout. My timings show using an external program is 25% faster than using Python’s built-in gzip module.

What’s new in version 3.4a2

Released 7 June 2019

Performance improvements for Tanimoto search. Older versions used a fast rejection test based on a rational approximation to the threshold. It required two multiplications for each test. The new implementation uses an exact test based on the minimum required intersection count, with only one comparision per test.

The chemfp benchmark suggests timing improvements like:

  • 10-20% faster for 166 bits (POPCNT)
  • 1-10% faster for 881 bits (POPCNT)
  • 2- 7% faster for 1024 bits (POPCNT)
  • 0- 9% faster for 1024 bits (AVX2)
  • 0- 2% faster for 2048 bits (POPCNT)
  • 0-10% faster for 2048 bits (AVX2)

These numbers will be firmed up for the 3.4 release.

Improved error handling for oe2fps, ob2fps, and rdkit2fps when the underlying toolkit is not installed.

BUG FIX: Fixed several errrors related to storing 4GB or more of record identifier strings. This can occur if your id contains both the id and the SMILES or other large data, or if you have many fingerprints each with a large id (eg, an IUPAC name). The FPB format has a design limit of about 250M records, corresponding to 17.2 characters per id before the old code would break.

BUG FIX: the Avalon fingerprint type is now registered. Previously it worked only if one of the other RDKit fingerprint types was used first.

BUG FIX: the simseach metadata now uses #query_source and #target_source instead of #query_sources and #target_sources.

BUG FIX: Fixed bug which prevented reading FPS files using the Windows newline convention.

BUG FIX: Fixed segfault when hex_to_bitlist or hex_contains were called with the wrong number of arguments.

BUG FIX: simsearch --query incorrectly included a #query_sources in the output, as a duplicate of #target_sources. Now it correctly omits #query_sources.

What’s new in version 3.4a1

Released 6 November 2018

Added the arena methods to_numpy_array() and to_numpy_bitarray(). The first returns a NumPy array view of the underlying fingerprint data, as uint8 values, including pad bytes. This array makes it easier for other programs to work directly with the chemfp fingerprint data. The second creates a new NumPy array with one uint8 byte per fingerprint bit. The default returns all bits, or you can specific which bit columns to use. This function makes it easy to use fingerprint bits as descriptors for clustering or other predictive algorithms.

Added the fingerprints attribute to the FingerprintArena class. It gives list-like access the fingerprints. For example, it can be used to iterate over the fingerprints.

BUG FIX: count_all() now uses a 64-bit integer. Previously it used as signed 32-bit integer, which could overflow for large results.

BUG FIX: removed a memory leak in symmetric threshold searches.

BUG FIX: Calling the Tversky threshold arena search with the Tanimoto values alpha=beta=1.0 now calls the (more optimized) Tanimoto arena search. Previous it called the Tanimoto arena search and then did the general Tversky search, taking over twice as long to give the same results.

BUG FIX: The knearest Tversky symmetric arena search did not release Python references if there was an allocation failure during the search. Now fixed.

BUG FIX: The FPS fingerprint writer didn’t verify that the fingerprint length matched the number of bytes in the metadata. Fixed, and normalized the length change error message across the writers.

BUG FIX: The 3.3 broke support for compiling with --no-openmp. Fixed.

What’s new in version 3.3

Released 16 August 2018

BUG FIX: the k-nearest symmetric Tanimoto and Tversky search code contained a flaw when there was more than one fingerprint with no bits set and the threshold was 0.0. Since all of the scores will be 0.0, the code uses the first k fingerprints as the matches. However, they put all of the hits into the first search result (item 0), rather than the corresponding result for each given query. This also opened up a race condition for the OpenMP implementation, which could cause chemfp to crash.

Performance improvements for the POPCNT and AVX2-based searches. This was done by developing specialized versions of the Tanimoto and Tversky search functions for each of the POPCNT and AVX2 implementations, by initializing some of the AVX2 registers only once per search rather than once per popcount, by improving the rejection test for obvious mismatches, and by improving the alignment for AVX2 loads.

Releative to chemfp 1.5 (the latest free version of chemfp), version 3.3 is about 20-35% faster for 166-bit searches, 20-25% faster for 881-bit searches, and around 50% faster for 1024- and 2048-bit searches.

Relative to chemfp 3.2.1 (the previous version of chemfp), version 3.3 is 60% faster for 166-bit fingerprints, 15% faster for for 881-bit fingerprints, 25% faster for 1024-bit fingerprints, and 15% faster for 2048-bit fingerprints.

Unindexed search (which occurs when the fingerprints are not in popcount order) now uses the fast popcount implementations rather than the generic byte-based one. The result is about 6x faster.

Changed the simsearch --times option for more fine-grained reporting. The output (sent to stderr) now looks like:

open 0.01 read 0.08 search 0.10 output 0.27 total 0.46

where ‘open’ is the time to open the file and read the metadata, ‘read’ is the time spent reading the file, ‘search’ is the time for the actual search, ‘output’ is the time to write the search results, and ‘total’ is the total time from when the file is opened to when the last output is written.

Added SearchResult.format_ids_and_scores_as_bytes() to improve the simsearch output performance when there are many hits. Turns out the limiting factor in that case is not the search time but output formatting. The old code uses Python calls to convert each score to a double. The new code pushes that code into C. My benchmark used a k=all NxN search of ~2,000 PubChem fingerprints to generate about 4M scores. The output time went from 15.60s to 5.62s. (The search time was only 0.11s on my laptop.)

There is a new option, “report-algorithm” with the corresponding environment variable CHEMFP_REPORT_ALGORITHM. The default does nothing. Set it to “1” to have chemfp print a description of the search algorithm used, including any specialization, and the number of threads. For examples:

chemfp search using threshold Tanimoto arena, index, single threaded (generic)
chemfp search using threshold Tversky arena, index, single threaded (popcnt_128_128)
chemfp search using knearest Tanimoto arena symmetric, OpenMP (popcnt_112), 8 threads

For the ‘generic’ searches, use CHEMFP_REPORT_INTERSECT=1 to see which specific popcount function is used.

There is a new option, “use-specialized-algorithms” with the corresponding environment variable CHEMFP_USE_SPECIALIZED_ALGORITHMS. The default, “1”, uses the new specialized algorithms mentioned above. Set it to “0” to have chemfp fall back to the generic algorithm. This option is primarily used for timing comparisons and may be removed in future versions of chemfp.

There is experimental multi-threaded support for single-query searches. By default it is disabled because on newer hardware it is slower than single-threaded search, and it will take time to figure out why.

The new option “num-column-threads” controls this feature. (In chemfp nomenclature, each query is a row, and the targets are columns.) By default it is 1, meaning that single-query searches are single-threaded. Change it to 2 or higher to enable the “OpenMP columns” algorithm. The number of threads used is the smaller of the number of column threads and the value of chemfp.get_num_threads().

For one benchmark, based on a threshold Tanimoto search of RDKit’s 2048-bit fingerprint, the search time on my MacBook Pro laptop using POPCNT from 2011 goes from 19.7 to 16.1 seconds when I use 2 threads instead of 1. On the other hand, on a Skylake machine using AVX2 the time goes from 5.3 to 9.3 seconds.

Better error handling in simsearch so that I/O error prints an error message and exit rather than give a full stack trace. Testing this feature also identified bugs in the error handling code, which have been fixed.

What’s new in version 3.2.1

Released 12 April 2018

The biggest change is in the chemfp license. The commercial version is now distributed under a propritary license instead of the MIT open source license.

There are two other minor changes. The build process now includes support for AVX2 by default, and the fingerprint writer classes have a new ‘format’ attribute which is either “fps” or “fpb”, or is None if not defined.

License key

This marks the first release of chemfp with a proprietary license.

Or rather, licenses. There is an academic license and commercial licenses in various flavors. In addition, chemfp is still available under the open source MIT license, though that option is the most expensive. The chemfp 1.x series (currently chemfp 1.5) is still available for no cost under the MIT license, and receives updates, but it only supports Python 2.7 and it does not have as many features.

Chemfp 3.2.1 is available in source code and as a pre-compiled Python package which should run under most x86 64-bit Linux-based OSes. The pre-compiled packages requires a license key.

The license key is date locked. If a valid key is not found then “import chemfp” will print diagnostic messages to stderr and fingerprint search and arena generation functionality will be disabled. If you call one of the disabled functions then it will raise a NotImplementedError exception. Simsearch will not work, and neither will FPB generation.

Chemfp will look for the license key in the CHEMFP_LICENSE environment variable. For example, in bash:

export CHEMFP_LICENSE=20101225-demo@HPDHKMHBIAENBEFLMCNKFGFAABNDGDOB

The first 8 digits are the year, month, and date that the license expires, in GMT. In this demo example the license expired at the end of Christmas Day of 2010.

After the date comes optional configuration data including a user identifier, followed by the ‘@’, and ending with a validation key.

There is no centralized license manger, and you may run chemfp on as many computers at your site as you wish, within the limits of your license agreement.

There are two new API functions:

  • chemfp.is_licensed() - return True if the license key is valid or no license key is needed, otherwise return False.
  • chemfp.get_license_date() - return the license key expiration date as a 3-element tuple in the form (year, month, day). If the license key is not found or does not pass the security check then the function returns None. If this version of chemfp does not need a license key then it returns (9999, 12, 25).

Why the change in license policy?

In 2009 or so I decided to see if I could make a living selling free software. Most people who develop open source software for chemistry get their funding from other sources. Academics might be funded from grants, a company might use an open source project for business reasons, as a way to lower overall costs. Some companies sell a proprietary product or access to a service which uses an open source component, where the income from the non-free sources funds the free software development. But I can only think of a one or two cases in where people tried to make a living off of the source code itself, and they were not that successful.

I had some ideas of how it might be successful, and tried them out. While I had some sales, I never made anywhere near what I would have made for the same effort as a consultant or contractor.

I also ran into some difficulties. Most software companies provide their software either free or with steep discounts to academic organizations. If I do that with the most recent version of chemfp, I take a rather large risk that some grad student will post the source on GitHub. (Pharmaceutical company employees are much less likely to do that.)

I charge a lot of money for chemfp, because the few people who need high performance similarity search are willing to pay for it. Potential customers want to try it out. Since I either control the copyright or use components which allow proprietary use, I was able to make a non-disclosure agreement for the evaluation period. Had I been using GPL-based components, and thus restricted to a free software license, that would have been impossible.

I could continue to work at it trying to make a living selling free software, but after 9 years of trying I decided it’s time to switch to a more standard proprietary licensing scheme.

The chemfp 1.x line will still be available at no cost under the MIT license.

AVX2 popcount enabled by default

AVX2 compilation is now enabled by default. It was disabled in earlier releases because the AVX2 command-line flag was used to compile every file and I was worried that it might result in a binary which couldn’t be used by older hardware. For this release I figured out how to use the -mssse3 and -mavx2 flags only for the relevant popcount calculations.

At run-time chemfp will detect which CPU-specific features are available and only use the SSSE3 or AVX2 implementations when appropriate.

What’s new in version 3.2

Released 19 March 2018

This version mostly contains bug fixes and internal improvements. The biggest additions are support for Dave Cosgrove’s ‘flush’ fingerprint file format, and support for ‘fromAtoms’ in some of the RDKit fingerprints.

The configuration has changed to use setuptools.

Previously the command-line programs were installed as small scripts. Now they are created and installed using the “console_scripts” entry_point as part of the install process. This is more in line with the modern way of installing command-line tools for Python.

If these scripts are no longer installed correctly, please let me know.

If you have installed the chemfp_converters package then chemfp will use it to read and write fingerprint files in flush format. It can be used as output from the *2fps programs, as input and output to fpcat, and as query input to simsearch.

Added “fromAtoms” support for the RDKit hash, torsion, Morgan, and pair fingerprints. This is primarily useful if you want to generate the circular environment around specific atoms of a single molecule, and you know the atom indices. If you pass in multiple molecules then the same indices will be used for all of them. Out-of-range values are ignored.

The command-line option is --from-atoms, which takes a comma-separated list of non-negative integer atom indices. For examples:

--from-atoms 0
--from-atoms 29,30

The corresponding fingerprint type strings have also been updated. If fromAtoms is specified then the string fromAtoms=i,j,k,… is added to the string. If it is not specified then the fromAtoms term is not present, in order to maintain compability with older types strings. (The philosophy is that two fingerprint types are equivalent if and only if their type strings are equivalent.)

The --from-atoms option is only useful when there’s a single query and when you have some other mechanism to determine which subset of the atoms to use. For example, you might parse a SMILES, use a SMARTS pattern to find the subset, get the indices of the SMARTS match, and pass the SMILES and indices to rdk2fps to generate the fingerprint for that substructure.

Be aware that the union of the fingerprint for --from-atoms X and the fingerprint for --from-atoms Y might not be equal to the fingerprint for --from-atoms X,Y. However, if a bit is present in the union of the X and Y fingerprints then it will be present in the X,Y fingerprint.

Why? The fingerprint implementation first generates a sparse count fingerprint, then converts that to a bitstring fingerprint. The conversion is affected by the feature count. If a feature is present in both X and Y then X,Y fingerprint may have additional bits sets over the individual fingerprints.

Bug fixes

Fixed a bug in FPB identifier index lookup. When the id’s hash didn’t exist, it got stuck in an infinite loop. There is a special token to identify the end of the hash chain. Unfortunately, that token wasn’t marked as a b”byte string” during the Python 2to3 conversion, so that token was never found, causing the code to loop over the chain forever. It is now a byte string, and a check was added to prevent infinite loops.

Fixed a bug where a k=0 similarity search using an FPS file as the targets caused a segfault. The code assumed that k would be at least 1. If you do a k=0 search, it will currently read the entire file, checking for format errors, and return no hits.

Chemfp no longer generates Python warnings. That is, the regression tests all pass under “python -W error unit2 discover”. The biggest problem was the ResourceWarning from all of the files which were never explicitly closed. They used to depend on the garbage collector to close the file but now use either through a context manager or with close(). In addition, several strings contains invalid escape characters and some regression tests used deprecated APIs.

The context manager and close() method for the FPBFingerprintAreana now close the underlying file object/mmap rather than depend on the garbage collector.

The readers and writers which are wrappers to an iterator which may hold a file object, and where the file object was created by chemfp, now know to close() the wrapped iterator when processing is over.

Added a check that the threshold and count symmetric arena searches have a popcount. Unordered arenas caused the code to segfault.

What’s new in version 3.1

Released 17 September 2017

The new specialized POPCNT implementation for PubChem/CACTVS keys increases search performance for that case by about 15%.

The SearchResults object gained the to_csr() method and the shape attribute. The new method returns a SciPy compressed sparse row matrix containing the similarity scores, which can be passed into scikit-learn for clustering.

The fall 2017 release of OEChem will accept InChI strings as structure input. The chemfp wrapper now knows about this, as well as the two new InChI output flavors “RelativeStereo” and “RacemicStereo”.

The fall 2017 release of RDKit will fix a bug in the pattern fingerprint definitions. The new chemfp fingerprint type is RDKit-Pattern/4.

Changed how oe2fps, rdkit2fps, and ob2fps report missing or empty identifiers. Previously the default --errors setting of “ignore” simply skipped those records, without any warning messages. This caused problems processing the ChEBI SD file. Most of the records have an empty title line, so only a few fingerprint records were generated. It wasn’t obvious that the resulting data set was useless. The new code always reports a warning for empty or missing identifiers, even with “ignore”. If the --errors is “strict” then the warning becomes an error and processing stops.

Updated the #software line to include “chemfp/3.1” in addition to the toolkit information. This helps distinguish between, say, two different programs which generate RDKit Morgan fingerprints. It’s also possible that a chemfp bug can affect the fingerprint output, so the extra term makes it easier to identify a bad dataset.

There are several small fixes related to memory leaks, the bytes/Unicode distinction in Python 3, error messages, and error handling.

Removed chemfp.progressbar and chemfp.futures. These were included in chemfp 1.1 because I used them in a project for one customer and thought they might be useful in future chemfp projects. They were not. Also removed chemfp.argparse because chemfp 3.0 dropped support for Python 2.6.

What’s new in version 3.0.1

Released 28 August 2017

This is a bug-fix release. This fixes a critical bug in the general-purpose POPCNT popcount implementation and a bug in the code to detect the RDKit Pattern fingerprint change in 2017.3.

See the CHANGELOG for details.

What’s new in version 3.0

Released 2 May 2017

Chemfp now supports both Python 2.7 and Python 3.5 or later. It no longer supports version before Python 2.7. Chemfp will support Python 2.7 at least until 2020, which is the end-of-life for Python 2.7.

This required extensive changes to distinguish between text/Unicode strings and byte strings. The biggest user-facing change is that identifiers are now treated as Unicode strings. Fingerprints are still treated as byte strings.

This change is not backwards compatible. The APIs function parameters are polymorphic, so in most cases you can pass in either a Unicode string or a UTF-8 encoded byte string. However, the return type for an identifier is Unicode, which will likely cause problems with existing code which expects bytes.

All of the chemistry toolkits have decided to treat files as UTF-8 encoded. Chemfp’s “text toolkit” offers limited support for reading Latin-1 encoded files. This is a tricky topic so contact me if you have questions or problems.

I have removed the “make_string_creator()” function because it was hard to explain, hard to maintain, and had little performance improvement over passing in the arguments to chemfp.create_string(). This will break compatibility, but then again, I don’t think anyone used it. If it is a problem, I suggest creating a function, as in the following:

>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
u'O-c1:c:c:c:c:c:1'
>>> def make_string(mol):
...   return T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
...
>>> make_string(mol)
u'O-c1:c:c:c:c:c:1'

If you look carefully at the previous example, you’ll see the other major backwards incompatibility. The function chemfp.create_string() now return a Unicode string instead of a byte string. This also means its format parameter no longer accepts the “.zlib” or “.gzip” extensions.

Instead, to get the old behavior use the new API function chemfp.create_bytes():

>>> T.create_bytes(mol, "smistring", writer_args = {"allBondsExplicit": True})
'O-c1:c:c:c:c:c:1'
>>> T.create_bytes(mol, "smistring.zlib", writer_args = {"allBondsExplicit": True})
'x\x9c\xf3\xd7M6\xb4J\x86CC\x00&\xc8\x04\x8d'

There’s a similar change between chemfp.open_molecule_writer_to_string() and the new function chemfp.open_molecule_writer_to_bytes().

There are also some new features in version 3.0 which don’t break compatibility.

Similarity search is faster because there are now specialized popcount implementations based on the fingerprint length. On one benchmark, 166-bit searches are 35% faster, 1024-bit searches are 25% faster, and 2048-bit searches are 5% faster.

There is a new popcount implementation for processors with the AVX2 instruction set. It is about 15% faster than the POPCNT version for 2048 bit fingerprints. To test it out you will have to compile chemfp with --with-avx2 enabled.

Added support for the Avalon fingerprints in RDKit, if RDKit has been compiled with Avalon support.

What’s new in version 2.1

Released 2 July 2015

Version 2.1 adds Tversky support for every place there was Tanimoto search (except the handful of deprecated APIs). There are new search routines for FPS and arena searches, including OpenMP support, and new bitops functions to compute a Tversky index between two fingerprints.

The k-nearest arena searches now support OpenMP. Previously they were single threaded even though the other search functions supported multiple threads.

The built-in SDF parser saw a couple improvements, including support for both “\n” and “\r\n” newlines, instead of only “\n” newlines.

There were a number of bug fixes that concern edge cases. For example, some 64-bit double calculations could be off-by-one in the last digit, and fingerprints with 0 bits set could cause a few problems.

What’s new in version 2.0

Released 8 April 2015

Version 2.0 includes many new features designed for web service development. The new “FPB” binary fingerprint file format is very fast to load, which is great for web server reloading during development and on the command-line. The speed comes from using a memory-mapped file, which also means that multiple chemfp instances can use the same file on the same machine without extra memory overhead.

The most extensive improvement is the new portable API for working with structure files and fingerprint types. The moment you start working with multiple chemistry toolkits, you realize that they all have different ways to read and write molecules, and to generate fingerprints from a molecule. Chemfp tries hard to have a consistent API for these common tasks, without sacrificing performance, so you can get get your work done. For example, with the new API it’s easy to take an SD record as an input string, compute the MACCS fingerprints for each available toolkit, add the results as new SD tags, and return the updated record.

This sounds so easy, doesn’t it? It took about a year to develop. The API is quite extensive, and includes the ability to pass toolkit-specific options to the underlying parsers, a low-level SDF parser that can be used to index a file, a way to get a list of available formats and fingerprint types, and methods to parse fingerprint arguments from strings.

New with version 2.0 is the ability to handle PubChem-sized data. Previous versions used 32 bit indexing and had a limit of 4GB, which is enough for 33M 1024-bit fingerprints, but PubChem has about twice that many structures.

There are also a lot of improvements, bug fixes, and performance tweaks. For example, the FPS reader is now almost twice as fast! For details, see the CHANGELOG file of the release.