.. _whats-new:
######################
What's New / CHANGELOG
######################
What's new in 3.4 (27 June 2020)
===================================
The main changes in this release were to improve support for
"non-standard" fingerprint lengths. Previous releases had special
support for the most common fingerprint lengths in cheminformatics;
166-bit (24-byte), 512-bit (64 byte), 881-bit (112 byte), 1024-bit
(128 byte), and 2048-bit (256 byte) fingerprints. This release extends
that special support to a wider set of lengths.
New POPCNT implementations
--------------------------
Added specialized POPCNT implementations for all 8-byte-multiple
fingerprint lengths up to 1024 bytes, plus faster implementations
for 8-byte and 32-byte multiple lengths beyond that.
The new specialized POPCNT algorithms are 10-30% faster than chemfp
3.4 for tiny fingerprints (<512 bits) and 0-10% faster for larger
fingerprints. For tiny fingerprints the new algorithms are about 20%
faster than chemfp 1.6.1. For larger fingerprints, 3.4.1 is about 5%
faster than 1.6.1.
New AVX2 implementations
------------------------
Added a specialized AVX2 implementation for 1024 bits. This is only
about 0.5% faster than the version in 3.4, but meant to avoid the
slight overhead from the bug fix described below.
Added specialized AVX2/POPCNT hybrid implementations for 160, 192,
and 224 bytes (1280, 1536, and 1792 bits). These are about 33%, 25%,
and 20% faster than the POPCNT equivalents. Note that the 2048-bit
AVX2 search performance is about the same as the 1536-bit
performance, so if all you care about is performance then you should
never use a length between 160 and 256 bytes.
New FingerprintArena methods
----------------------------
Added two new FingerprintArena
methods. :meth:`.FingerprintArena.sample` randomly selects a subset of
the fingerprints and returns them in a new arena.
:meth:`.FingerprintArena.train_test_split` returns two randomly
selected and disjoint subsets of the area, typically used as a
training set and a test set.
Bug fixes
---------
There are also two bug fixes:
* Fixed bug in fpcat where using --reorder would write the FPS header
twice.
* Fixed bug in AVX2 implementation when the storage size was a
multiple of 1024 bits but the fingerprint size was smaller.
What's new in 3.4 (24 June 2020)
================================
This is summary of the changes since chemfp 3.3. For more details, see
the individual intermediate changelog entries below.
J. Cheminf. publication
-----------------------
There is a two year gap between the 3.3 and 3.4 releases. More than
six months of that time went to writing the paper "The chemfp project"
for the Journal of Cheminformatics, which covers all of the major
aspects of chemfp.
Dalke, Andrew. The chemfp project. J. Cheminformatics 11, 76
(2019). https://doi.org/10.1186/s13321-019-0398-8
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8
Towards the end of writing the paper I realized there was an
improvement to the basic search algorithm. The naive Tanimoto
calculation test against a threshold requires a floating-point
division. I had replaced that with a faster comparison using integer
multiplication. The newest version replaces that with a simple
comparison of the popcount to an expected minimum value.
This increases the MACCS search performance by roughly 15%. For larger
fingerprint lengths the improvement is only a few percent at best,
which is expected as chemfp is mostly memory bandwidth bound, not CPU
bound.
New licensing options
---------------------
Pre-compiled chemfp distributions for Linux-based operating systems
are now available at no cost under the "Chemfp Base License
Agreement". Most of the chemfp features are available for internal
use, except that:
- fingerprint arenas may not be larger than 50,000 fingerprints;
- in-memory arena searches may not have more than 50,000 queries or targets;
- FPS searches may not have more than 20 queries;
- Tversky search is disabled;
- writing FPB files is disabled.
These features can be enabled with a valid license key, set via the
environment variable ``CHEMFP_LICENSE``. Email
sales@dalkescientific.com to request a evaluation license or to
purchase a license. Source distributions are also available.
To download the pre-compiled package for "manylinux" use::
python -m pip install chemfp -i https://chemfp.com/packages/
See ``LICENSE`` from the distribution or
https://chemfp.com/BaseLicense.txt for full details.
Chemistry toolkit changes
-------------------------
RDKit: Added support for the "SECFP" SMILES-based circular
fingerprints from the Reymond group. Added RDKit-Fingerprint
``branchedPaths`` and ``useBondOrder`` options. Added RDKit-Morgan
``includeRedundantEnvironments`` option. Added RDKit-AtomPair
``nBitsPerEntry``, ``includeRedundantEnvironments``, and ``use2D``
options. Added RDKit-Torsion ``nBitsPerEntry`` and ``includeChirality`` options.
RDKit (continued): New SMILES output option ``cxsmiles`` to include
extra annotations. New SDF input option ``includeTags`` to disable
importing SD tags. New SDF output option ``v3k`` to always use v3000
format. Added support for RDKit's Mol2, PDB, Maestro, XYZ, HELM, and
FASTA parsers. Added a new "sequence" format to handle just the 1D
sequence string.
Open Babel: Added support for 3.0. Added support for ECFP
fingerprints, with family names: ``OpenBabel-ECFP0``,
``OpenBabel-ECFP2``, ``OpenBabel-ECFP4``, ``OpenBabel-ECFP6``,
``OpenBabel-ECFP8``, ``OpenBabel-ECFP10``. Open Babel 3.0 includes new
formats, which were automatically supported by chemfp.
OpenEye: Added support for OEChem's OEZ, CIF, mmCIF, PDB, FASTA, and
CSV parsers. Added a new "sequence" format to handle just the 1D
sequence string. Added experimental support for substructure screens,
with the family names ``OpenEye-MoleculeScreen``,
``OpenEye-MDLScreen``, and ``OpenEye-SMARTSScreen``.
Tool changes
------------
Simsearch now accepts a structure input, either as a command-line
argument or from a filename. It will use the fingerprint type from the
target data set or a user-specified fingerprint type to convert the
structures into fingerprints.
Added a ``--help-format`` option to rdkit2fps, ob2fps and oe2fps which
shows all available input formats and their reader options.
I/O changes
-----------
Added support for Zstandard compression everywhere that gzip
compression is supported. Use the filename or format extension ".zst"
to indicate that compression type. Chemfp's RDKit toolkit adapater
also supports Zstandard, but not the Open Babel and OpenEye adapters.
Note: Zstandard compression requires the third-party "zstandard"
Python package be installed.
Improved the gzip reader performance by about 15%. Improved the FPS
reader by about 20%. Overall, sdf2fps is about 10% faster extracting
PubChem fingerprints from the PubChem sdf.gz files.
Improved FPB output performance by about 10% by using a C extension.
Chemfp now supports reading compressed FPB files, and reading FPB
files from stdin. These are read entirely into memory before use as
they cannot be memory-mapped. This was a feature request from a
customer who stored large fingerprint files on a network-based
filesystem. It was faster to read a compressed file and decompress
into memory than it was to memory-map and use the contents of an
uncompressed file.
What's new in 3.4b3 (18 June 2020)
======================================
* Changed ``--list-formats`` to ``--help-formats``.
* Updated oe2fps ``--help-formats``.
* Fixed a bug in several OEChem ``create_string()`` and
``create_bytes()`` implementations where a non-None 'id' changed the
molecule title.
* Finished updating the documentation.
What's new in 3.4b2 (12 June 2020)
======================================
* Changed the licensing model to let people use chemfp without a valid
license key, with restrictions:
- fingerprint arenas may not be larger than 50,000 fingerprints,
- arena searches may not have more than 50,000 queries or targets,
- FPS searches may not have more than 20 queries
- Tversky searches are disabled, and writing FPB files is disabled.
* Added "includeTags" option for the RDKit toolkit SDF reader. The
default of True parses the SD tag data. This isn't needed if you
just want to generate fingerprints. rdkit2fps sets includeTags=False
by default, for a ~5% speedup in parsing a PubChem file.
* Added Zstandard input and output options to sdf2fps.
* Fixed a couple of bugs in the new gzio module. Better code to handle
finding libz, and support for different response codes for older
versions of libz.
* Added compression ``--level`` option to fpcat
* Support OEChem 2.3 from 2019.Oct.
* Added support for OEChem formats OEZ, CIF, mmCIF, PDB, FASTA, and
CSV. Also implemented a "sequence" format based on the FASTA reader.
* Added experimental support for OpenEye's fingerprint-like
screens. The new fingerprint family names are
``OpenEye-MoleculeScreen``, ``OpenEye-MDLScreen``, and
``OpenEye-SMARTSScreen``. The functions are type-based: QMols produce
query screens and "regular" molecules produce target screens.
* ``get_fingerprint_families()`` now supports an optional
"toolkit_name" parameter which loads and returns only the
fingerprints families for the specified toolkit.
* BUG FIX: some OpenEye toolkit writers, when passed a new
identifier, SetTitle(new_id) on the molecule before writing, but did
not SetTitle(old_id) to restore original id.
* BUG FIX: the code did not check for fingerprint generation failures
when using OEChem/OEGraphSim. Fixed the code so it doesn't an empty
molecule returns an empty fingerprint, instead of reusing whatever
the previous fingerprint was.
* Fixed a number of issues identified by PyFlakes, including some
bugs, mostly related to error conditions which weren't tested.
What's new in 3.4b1 (24 April 2020)
======================================
* Support Open Babel 3.0.
* Support Open Babel ECFP fingerprints. Requires Open Babel 3.0 or
later. Use ``--nBits`` to specify a size other than the default of
4096 bits. (Must be a power of 2, and at least 32.)
* Support RDKit parsers for FASTA, sequence, HELM, Mol2, PDB, Maestro
and XYZ formats.
* RDKit SMILES writers now support the "cxsmiles" boolean flag to
generate CXSMILES strings. RDKit SDF and Molfile writers support
"v3k" boolean flag to always generate V3000 records.
* Added support for RDKit SECFP fingerprints, developed by the Reymond
group. These are circular fingerprints similar to ECFP fingerprints
except they use canonical fragment SMILES for the circular
substructures to generate hash values.
* Added support for additional RDKit fingerprint parameters:
- RDKit-Fingerprint: branchedPaths and useBondOrder
- RDKit-Morgan: includeRedundantEnvironments
- RDKit-AtomPair: nBitsPerEntry, includeChirality, and use2D
- RDKit-Torsion: nBitsPerEntry, includeChirality
* Added support for Zstandard compression everywhere gzip is
supported, except for the Open Babel and OpenEye toolkits, where the
native toolkits do not support Zstandard and do not accept a Python
file object.
* Sped up FPB generation for ChEMBL by about 9% by rewriting several
parts of the FPID block writer code in C.
* Faster gzip read performance when reading from stdin or a named
file. The new module calls zlib functions directly, which gives
15-25% improved performance. If you have problems with the new gzip
reader, you can disable it be setting the environment variable
``CHEMFP_USE_SYSTEM_GZIP`` to ``1``.
* For even faster gzip read performance, chemfp can use an external
program to decompress stdin or a named file. If the environment
variable ``CHEMFP_GZCAT`` is set then chemfp will interpret it as
command-line arguments to use in a subprocess. This may be `zcat`,
``gzcat`` or ``gzip -dc``, or ``pigz -dc``. (NOTE: this variable was named
``CHEMFP_GZCAT_BINARY`` in the a4 release.)
In one test of simsearch, using 1.7M 2048-bit RDKit Morgan
fingerprints from ChEMBL 23, *measuring wall-clock time*:
- a search of the uncompressed file took 1.45 seconds
- ``CHEMFP_GZCAT=gzcat`` took 2.16 seconds (3.07 of total user time)
- the new gzip reader took 3.65 seconds
- ``CHEMFP_USE_SYSTEM_GZIP=1`` took 4.36 seconds
Note that part of the speedup is because gzcat runs in another process
so take better advantage of multicore hardware. (That is, I measured
wall-clock time on a multicore machine, not overall CPU time.)
* Improved the error handling when chemfp uses an external program to
decompress an gzip'ed file. NOTE: IT IS NOT FOOLPROOF! Chemfp waits
0.01 seconds to see if gzcat has exited unexpectedly, which might
happen if the file does not exist or cannot be read. However, there
is a chance that gzcat may take longer to report an error. In
addition, chemfp does not detect if gzip exited early because the
file was corrupt or incomplete.
* Added a ``--list-formats`` options to oe2fps, rdkit2fps, and ob2fps,
which gives more detailed information about the supported input
structure file formats and their options.
* No longer including or using a copy of unittest2, which was needed
for Python 2.6 support.
What's new in 3.4a4 (18 March 2020)
======================================
* simsearch accepts a structure file as query input. Use ``--in`` or
``--query-format`` to specify the format type, or let chemfp try to
figure out from the filename extension.
If the fingerprint type is not specified with ``--query-type`` then the
target file metadata must specify the type.
The ``--id-tags``, ``--delimiter``, ``--has-header``, ``-R`` and
``--errors`` options from the \*2fps programs are also supported.
* The OEChem SMILES and InChI readers now support the ``has_header``
reader_arg to skip the first line of the file. Use ``--has-header`` to
enable that feature in oe2fps.
* FPB files may now be read from stdin, and fpb.gz files are
supported. Unlike regular FPB files, which are memory-mapped, the
contents of these files are read into memory before use. The main
use case for fpb.gz files is to reduce network I/O if the files are
on a remote disk.
* Changed the FPS reader block size from around 11K to 100K, giving a
20% boost in read performance and 10% boost in fpcat
performance. The smaller block size was chosen 10 years ago, on much
less powerful hardware.
* Experimental support for zstd compression, based on the filename
ending with either ``.fps.zst`` or ``.fpb.zst``. This depends on the
third-party "zstandard" package. My experience is that piping gzip
output to chemfp is faster than letting chemp use Python's built-in
gzip reader or using zstandard.
* Experimental support to use an external binary to decompress a gzip
file. Set "``CHEMFP_GZCAT_BINARY``" to "``gzcat``" or "``zcat``" or
whatever program you use to read a gzip-compressed file (passed on
the command-line) and write the uncompressed contents to stdout. My
timings show using an external program is 25% faster than using
Python's built-in gzip module.
What's new in version 3.4a2
===========================
Released 7 June 2019
Performance improvements for Tanimoto search. Older versions used a
fast rejection test based on a rational approximation to the
threshold. It required two multiplications for each test. The new
implementation uses an exact test based on the minimum required
intersection count, with only one comparision per test.
The chemfp benchmark suggests timing improvements like:
- 10-20% faster for 166 bits (POPCNT)
- 1-10% faster for 881 bits (POPCNT)
- 2- 7% faster for 1024 bits (POPCNT)
- 0- 9% faster for 1024 bits (AVX2)
- 0- 2% faster for 2048 bits (POPCNT)
- 0-10% faster for 2048 bits (AVX2)
These numbers will be firmed up for the 3.4 release.
Improved error handling for oe2fps, ob2fps, and rdkit2fps when the
underlying toolkit is not installed.
BUG FIX: Fixed several errrors related to storing 4GB or more of
record identifier strings. This can occur if your id contains both the
id and the SMILES or other large data, or if you have many
fingerprints each with a large id (eg, an IUPAC name). The FPB format
has a design limit of about 250M records, corresponding to 17.2
characters per id before the old code would break.
BUG FIX: the Avalon fingerprint type is now registered. Previously
it worked only if one of the other RDKit fingerprint types was used
first.
BUG FIX: the simseach metadata now uses ``#query_source`` and
``#target_source`` instead of ``#query_sources`` and
``#target_sources``.
BUG FIX: Fixed bug which prevented reading FPS files using the
Windows newline convention.
BUG FIX: Fixed segfault when :func:`hex_to_bitlist <.hex_to_bitlist>`
or :func:`hex_contains <.hex_contains>` were called with the wrong
number of arguments.
BUG FIX: ``simsearch --query`` incorrectly included a ``#query_sources`` in
the output, as a duplicate of ``#target_sources``. Now it correctly omits
``#query_sources``.
What's new in version 3.4a1
===========================
Released 6 November 2018
Added the arena methods :meth:`~.FingerprintArena.to_numpy_array`
and :meth:`~.FingerprintArena.to_numpy_bitarray`. The first returns a
NumPy array view of the underlying fingerprint data, as uint8 values,
including pad bytes. This array makes it easier for other programs to
work directly with the chemfp fingerprint data. The second creates a
new NumPy array with one uint8 byte per fingerprint bit. The default
returns all bits, or you can specific which bit columns to use. This
function makes it easy to use fingerprint bits as descriptors for
clustering or other predictive algorithms.
Added the :attr:`~.FingerprintArena.fingerprints` attribute to the
FingerprintArena class. It gives list-like access the
fingerprints. For example, it can be used to iterate over the fingerprints.
BUG FIX: :meth:`~.SearchResults.count_all` now uses a 64-bit integer.
Previously it used as signed 32-bit integer, which could overflow for
large results.
BUG FIX: removed a memory leak in symmetric threshold searches.
BUG FIX: Calling the Tversky threshold arena search with the Tanimoto
values alpha=beta=1.0 now calls the (more optimized) Tanimoto arena
search. Previous it called the Tanimoto arena search and then did the
general Tversky search, taking over twice as long to give the same
results.
BUG FIX: The knearest Tversky symmetric arena search did not
release Python references if there was an allocation failure during
the search. Now fixed.
BUG FIX: The FPS fingerprint writer didn't verify that the
fingerprint length matched the number of bytes in the metadata. Fixed,
and normalized the length change error message across the writers.
BUG FIX: The 3.3 broke support for compiling with ``--no-openmp``.
Fixed.
What's new in version 3.3
===========================
Released 16 August 2018
BUG FIX: the k-nearest symmetric Tanimoto and Tversky search code
contained a flaw when there was more than one fingerprint with no bits
set and the threshold was 0.0. Since all of the scores will be 0.0,
the code uses the first k fingerprints as the matches. However, they
put all of the hits into the first search result (item 0), rather than
the corresponding result for each given query. This also opened up a
race condition for the OpenMP implementation, which could cause chemfp
to crash.
Performance improvements for the POPCNT and AVX2-based
searches. This was done by developing specialized versions of the
Tanimoto and Tversky search functions for each of the POPCNT and AVX2
implementations, by initializing some of the AVX2 registers only once
per search rather than once per popcount, by improving the rejection
test for obvious mismatches, and by improving the alignment for AVX2
loads.
Releative to chemfp 1.5 (the latest free version of chemfp), version
3.3 is about 20-35% faster for 166-bit searches, 20-25% faster for
881-bit searches, and around 50% faster for 1024- and 2048-bit
searches.
Relative to chemfp 3.2.1 (the previous version of chemfp), version 3.3
is 60% faster for 166-bit fingerprints, 15% faster for for 881-bit
fingerprints, 25% faster for 1024-bit fingerprints, and 15% faster for
2048-bit fingerprints.
Unindexed search (which occurs when the fingerprints are not in
popcount order) now uses the fast popcount implementations rather than
the generic byte-based one. The result is about 6x faster.
Changed the simsearch :option:`--times` option for more fine-grained
reporting. The output (sent to stderr) now looks like::
open 0.01 read 0.08 search 0.10 output 0.27 total 0.46
where 'open' is the time to open the file and read the metadata,
'read' is the time spent reading the file, 'search' is the time for
the actual search, 'output' is the time to write the search results,
and 'total' is the total time from when the file is opened to when the
last output is written.
Added :meth:`.SearchResult.format_ids_and_scores_as_bytes` to improve the
simsearch output performance when there are many hits. Turns out the
limiting factor in that case is not the search time but output
formatting. The old code uses Python calls to convert each score to a
double. The new code pushes that code into C. My benchmark used a
k=all NxN search of ~2,000 PubChem fingerprints to generate about 4M
scores. The output time went from 15.60s to 5.62s. (The search time
was only 0.11s on my laptop.)
There is a new option, "report-algorithm" with the corresponding
environment variable CHEMFP_REPORT_ALGORITHM. The default does
nothing. Set it to "1" to have chemfp print a description of the
search algorithm used, including any specialization, and the number of
threads. For examples::
chemfp search using threshold Tanimoto arena, index, single threaded (generic)
chemfp search using threshold Tversky arena, index, single threaded (popcnt_128_128)
chemfp search using knearest Tanimoto arena symmetric, OpenMP (popcnt_112), 8 threads
For the 'generic' searches, use CHEMFP_REPORT_INTERSECT=1 to see which
specific popcount function is used.
There is a new option, "use-specialized-algorithms" with the
corresponding environment variable CHEMFP_USE_SPECIALIZED_ALGORITHMS.
The default, "1", uses the new specialized algorithms mentioned
above. Set it to "0" to have chemfp fall back to the generic
algorithm. This option is primarily used for timing comparisons and
may be removed in future versions of chemfp.
There is experimental multi-threaded support for single-query
searches. By default it is disabled because on newer hardware it is
slower than single-threaded search, and it will take time to figure
out why.
The new option "num-column-threads" controls this feature. (In chemfp
nomenclature, each query is a row, and the targets are columns.) By
default it is 1, meaning that single-query searches are
single-threaded. Change it to 2 or higher to enable the "OpenMP
columns" algorithm. The number of threads used is the smaller of the
number of column threads and the value of chemfp.get_num_threads().
For one benchmark, based on a threshold Tanimoto search of RDKit's
2048-bit fingerprint, the search time on my MacBook Pro laptop using
POPCNT from 2011 goes from 19.7 to 16.1 seconds when I use 2 threads
instead of 1. On the other hand, on a Skylake machine using AVX2 the
time goes from 5.3 to 9.3 seconds.
Better error handling in simsearch so that I/O error prints an error
message and exit rather than give a full stack trace. Testing this
feature also identified bugs in the error handling code, which have
been fixed.
What's new in version 3.2.1
===========================
Released 12 April 2018
The biggest change is in the chemfp license. The commercial version is
now distributed under a propritary license instead of the MIT open
source license.
There are two other minor changes. The build process now includes
support for AVX2 by default, and the fingerprint writer classes have a
new 'format' attribute which is either "fps" or "fpb", or is None if
not defined.
License key
-----------
This marks the first release of chemfp with a proprietary license.
Or rather, licenses. There is an academic license and commercial
licenses in various flavors. In addition, chemfp is still available
under the open source MIT license, though that option is the most
expensive. The chemfp 1.x series (currently chemfp 1.5) is still
available for no cost under the MIT license, and receives updates, but
it only supports Python 2.7 and it does not have as many features.
Chemfp 3.2.1 is available in source code and as a pre-compiled Python
package which should run under `most x86 64-bit Linux-based OSes
`_. The pre-compiled
packages requires a license key.
The license key is date locked. If a valid key is not found then
"import chemfp" will print diagnostic messages to stderr and
fingerprint search and arena generation functionality will be
disabled. If you call one of the disabled functions then it will raise
a NotImplementedError exception. Simsearch will not work, and neither
will FPB generation.
Chemfp will look for the license key in the CHEMFP_LICENSE environment
variable. For example, in bash::
export CHEMFP_LICENSE=20101225-demo@HPDHKMHBIAENBEFLMCNKFGFAABNDGDOB
The first 8 digits are the year, month, and date that the license
expires, in GMT. In this demo example the license expired at the end
of Christmas Day of 2010.
After the date comes optional configuration data including a user
identifier, followed by the '@', and ending with a validation key.
There is no centralized license manger, and you may run chemfp on as
many computers at your site as you wish, within the limits of your
license agreement.
There are two new API functions:
- chemfp.is_licensed() - return True if the license key is valid or no
license key is needed, otherwise return False.
- chemfp.get_license_date() - return the license key expiration date
as a 3-element tuple in the form (year, month, day). If the license
key is not found or does not pass the security check then the
function returns None. If this version of chemfp does not need a
license key then it returns (9999, 12, 25).
Why the change in license policy?
---------------------------------
In 2009 or so I decided to see if I could make a living `selling free
software `_. Most people
who develop open source software for chemistry get their funding from
other sources. Academics might be funded from grants, a company might
use an open source project for business reasons, as a way to lower
overall costs. Some companies sell a proprietary product or access to
a service which uses an open source component, where the income from
the non-free sources funds the free software development. But I can
only think of a one or two cases in where people tried to make a
living off of the source code itself, and they were not that successful.
I had some ideas of how it might be successful, and tried them
out. While I had some sales, I never made anywhere near what I would
have made for the same effort as a consultant or contractor.
I also ran into some difficulties. Most software companies provide
their software either free or with steep discounts to academic
organizations. If I do that with the most recent version of chemfp, I
take a rather large risk that some grad student will post the source
on GitHub. (Pharmaceutical company employees are much less likely to
do that.)
I charge a lot of money for chemfp, because the few people who need
high performance similarity search are willing to pay for
it. Potential customers want to try it out. Since I either control the
copyright or use components which allow proprietary use, I was able to
make a non-disclosure agreement for the evaluation period. Had I been
using GPL-based components, and thus restricted to a free software
license, that would have been impossible.
I could continue to work at it trying to make a living selling free
software, but after 9 years of trying I decided it's time to switch to
a more standard proprietary licensing scheme.
The chemfp 1.x line will still be available at no cost under the MIT
license.
AVX2 popcount enabled by default
--------------------------------
AVX2 compilation is now enabled by default. It was disabled in earlier
releases because the AVX2 command-line flag was used to compile every
file and I was worried that it might result in a binary which couldn't
be used by older hardware. For this release I figured out how to use
the -mssse3 and -mavx2 flags only for the relevant popcount
calculations.
At run-time chemfp will detect which CPU-specific features are
available and only use the SSSE3 or AVX2 implementations when
appropriate.
What's new in version 3.2
=========================
Released 19 March 2018
This version mostly contains bug fixes and internal improvements. The
biggest additions are support for Dave Cosgrove's 'flush' fingerprint
file format, and support for 'fromAtoms' in some of the RDKit
fingerprints.
The configuration has changed to use setuptools.
Previously the command-line programs were installed as small
scripts. Now they are created and installed using the
"console_scripts" entry_point as part of the install process. This is
more in line with the modern way of installing command-line tools for
Python.
If these scripts are no longer installed correctly, please let me
know.
If you have installed the `chemfp_converters package
`_ then chemfp will
use it to read and write fingerprint files in flush format. It can be
used as output from the \*2fps programs, as input and output to fpcat,
and as query input to simsearch.
Added "fromAtoms" support for the RDKit hash, torsion, Morgan, and
pair fingerprints. This is primarily useful if you want to generate
the circular environment around specific atoms of a single molecule,
and you know the atom indices. If you pass in multiple molecules then
the same indices will be used for all of them. Out-of-range values are
ignored.
The command-line option is :option:`--from-atoms`, which takes a
comma-separated list of non-negative integer atom indices. For
examples::
--from-atoms 0
--from-atoms 29,30
The corresponding fingerprint type strings have also been updated. If
fromAtoms is specified then the string `fromAtoms=i,j,k,...` is added
to the string. If it is not specified then the fromAtoms term is not
present, in order to maintain compability with older types
strings. (The philosophy is that two fingerprint types are equivalent
if and only if their type strings are equivalent.)
The :option:`--from-atoms` option is only useful when there's a single
query and when you have some other mechanism to determine which subset
of the atoms to use. For example, you might parse a SMILES, use a
SMARTS pattern to find the subset, get the indices of the SMARTS
match, and pass the SMILES and indices to rdk2fps to generate the
fingerprint for that substructure.
Be aware that the union of the fingerprint for :option:`--from-atoms`
X and the fingerprint for :option:`--from-atoms` Y might not be equal
to the fingerprint for :option:`--from-atoms` X,Y. However, if a bit
is present in the union of the X and Y fingerprints then it will be
present in the X,Y fingerprint.
Why? The fingerprint implementation first generates a sparse count
fingerprint, then converts that to a bitstring fingerprint. The
conversion is affected by the feature count. If a feature is present
in both X and Y then X,Y fingerprint may have additional bits sets
over the individual fingerprints.
Bug fixes
---------
Fixed a bug in FPB identifier index lookup. When the id's hash didn't
exist, it got stuck in an infinite loop. There is a special token to
identify the end of the hash chain. Unfortunately, that token wasn't
marked as a b"byte string" during the Python 2to3 conversion, so that
token was never found, causing the code to loop over the chain
forever. It is now a byte string, and a check was added to prevent
infinite loops.
Fixed a bug where a k=0 similarity search using an FPS file as the
targets caused a segfault. The code assumed that k would be at least
1. If you do a k=0 search, it will currently read the entire file,
checking for format errors, and return no hits.
Chemfp no longer generates Python warnings. That is, the regression
tests all pass under "python -W error unit2 discover". The biggest
problem was the ResourceWarning from all of the files which were never
explicitly closed. They used to depend on the garbage collector to
close the file but now use either through a context manager or with
close(). In addition, several strings contains invalid escape
characters and some regression tests used deprecated APIs.
The context manager and close() method for the FPBFingerprintAreana
now close the underlying file object/mmap rather than depend on the
garbage collector.
The readers and writers which are wrappers to an iterator which may
hold a file object, and where the file object was created by chemfp,
now know to close() the wrapped iterator when processing is over.
Added a check that the threshold and count symmetric arena searches
have a popcount. Unordered arenas caused the code to segfault.
What's new in version 3.1
=========================
Released 17 September 2017
The new specialized POPCNT implementation for PubChem/CACTVS keys
increases search performance for that case by about 15%.
The SearchResults object gained the
:meth:`to_csr() <.SearchResults.to_csr>` method and the :attr:`shape
<.SearchResults.shape>` attribute. The new method returns a `SciPy
compressed sparse row matrix
`_ containing
the similarity scores, which can be passed into `scikit-learn `_ for
clustering.
The fall 2017 release of OEChem will accept InChI strings as structure
input. The chemfp wrapper now knows about this, as well as the two new
InChI output flavors "RelativeStereo" and "RacemicStereo".
The fall 2017 release of RDKit will fix a bug in the pattern
fingerprint definitions. The new chemfp fingerprint type is
RDKit-Pattern/4.
Changed how oe2fps, rdkit2fps, and ob2fps report missing or empty
identifiers. Previously the default :option:`--errors` setting of
"ignore" simply skipped those records, without any warning
messages. This caused problems processing the ChEBI SD file. Most of
the records have an empty title line, so only a few fingerprint
records were generated. It wasn't obvious that the resulting data set
was useless. The new code always reports a warning for empty or
missing identifiers, even with "ignore". If the :option:`--errors` is
"strict" then the warning becomes an error and processing stops.
Updated the #software line to include "chemfp/3.1" in addition to the
toolkit information. This helps distinguish between, say, two
different programs which generate RDKit Morgan fingerprints. It's also
possible that a chemfp bug can affect the fingerprint output, so the
extra term makes it easier to identify a bad dataset.
There are several small fixes related to memory leaks, the
bytes/Unicode distinction in Python 3, error messages, and error
handling.
Removed chemfp.progressbar and chemfp.futures. These were included in
chemfp 1.1 because I used them in a project for one customer and
thought they might be useful in future chemfp projects. They were
not. Also removed chemfp.argparse because chemfp 3.0 dropped support
for Python 2.6.
What's new in version 3.0.1
===========================
Released 28 August 2017
This is a bug-fix release. This fixes a critical bug in the
general-purpose POPCNT popcount implementation and a bug in the code
to detect the RDKit Pattern fingerprint change in 2017.3.
See the CHANGELOG for details.
What's new in version 3.0
=========================
Released 2 May 2017
Chemfp now supports both Python 2.7 and Python 3.5 or later. It no
longer supports version before Python 2.7. Chemfp will support Python
2.7 at least until 2020, which is the end-of-life for Python 2.7.
This required extensive changes to distinguish between text/Unicode
strings and byte strings. The biggest user-facing change is that
identifiers are now treated as Unicode strings. Fingerprints are still
treated as byte strings.
This change is not backwards compatible. The APIs function parameters
are polymorphic, so in most cases you can pass in either a Unicode
string or a UTF-8 encoded byte string. However, the return type for an
identifier is Unicode, which will likely cause problems with existing
code which expects bytes.
All of the chemistry toolkits have decided to treat files as UTF-8
encoded. Chemfp's "text toolkit" offers limited support for reading
Latin-1 encoded files. This is a tricky topic so contact me if you
have questions or problems.
I have removed the "make_string_creator()" function because it was
hard to explain, hard to maintain, and had little performance
improvement over passing in the arguments to
:func:`chemfp.create_string`. This will break compatibility, but then
again, I don't think anyone used it. If it is a problem, I suggest
creating a function, as in the following::
>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
u'O-c1:c:c:c:c:c:1'
>>> def make_string(mol):
... return T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True})
...
>>> make_string(mol)
u'O-c1:c:c:c:c:c:1'
If you look carefully at the previous example, you'll see the other
major backwards incompatibility. The function :func:`chemfp.create_string` now
return a Unicode string instead of a byte string. This also means its
`format` parameter no longer accepts the ".zlib" or ".gzip" extensions.
Instead, to get the old behavior use the new API function
:func:`chemfp.create_bytes`::
>>> T.create_bytes(mol, "smistring", writer_args = {"allBondsExplicit": True})
'O-c1:c:c:c:c:c:1'
>>> T.create_bytes(mol, "smistring.zlib", writer_args = {"allBondsExplicit": True})
'x\x9c\xf3\xd7M6\xb4J\x86CC\x00&\xc8\x04\x8d'
There's a similar change between :func:`chemfp.open_molecule_writer_to_string`
and the new function :func:`chemfp.open_molecule_writer_to_bytes`.
There are also some new features in version 3.0 which don't break
compatibility.
Similarity search is faster because there are now specialized popcount
implementations based on the fingerprint length. On one benchmark,
166-bit searches are 35% faster, 1024-bit searches are 25% faster, and
2048-bit searches are 5% faster.
There is a new popcount implementation for processors with the AVX2
instruction set. It is about 15% faster than the POPCNT version for
2048 bit fingerprints. To test it out you will have to compile chemfp
with :option:`--with-avx2` enabled.
Added support for the Avalon fingerprints in RDKit, if RDKit has been
compiled with Avalon support.
What's new in version 2.1
=========================
Released 2 July 2015
Version 2.1 adds Tversky support for every place there was Tanimoto
search (except the handful of deprecated APIs). There are new search
routines for FPS and arena searches, including OpenMP support, and new
bitops functions to compute a Tversky index between two fingerprints.
The k-nearest arena searches now support OpenMP. Previously they were
single threaded even though the other search functions supported
multiple threads.
The built-in SDF parser saw a couple improvements, including support
for both "\\n" and "\\r\\n" newlines, instead of only "\\n" newlines.
There were a number of bug fixes that concern edge cases. For example,
some 64-bit double calculations could be off-by-one in the last digit,
and fingerprints with 0 bits set could cause a few problems.
What's new in version 2.0
=========================
Released 8 April 2015
Version 2.0 includes many new features designed for web service
development. The new "FPB" binary fingerprint file format is very fast
to load, which is great for web server reloading during development
and on the command-line. The speed comes from using a memory-mapped
file, which also means that multiple chemfp instances can use the same
file on the same machine without extra memory overhead.
The most extensive improvement is the new portable API for working
with structure files and fingerprint types. The moment you start
working with multiple chemistry toolkits, you realize that they all
have different ways to read and write molecules, and to generate
fingerprints from a molecule. Chemfp tries hard to have a consistent
API for these common tasks, without sacrificing performance, so you
can get get your work done. For example, with the new API it's easy to
take an SD record as an input string, compute the MACCS fingerprints
for each available toolkit, add the results as new SD tags, and return
the updated record.
This sounds so easy, doesn't it? It took about a year to develop. The
API is quite extensive, and includes the ability to pass
toolkit-specific options to the underlying parsers, a low-level SDF
parser that can be used to index a file, a way to get a list of
available formats and fingerprint types, and methods to parse
fingerprint arguments from strings.
New with version 2.0 is the ability to handle PubChem-sized
data. Previous versions used 32 bit indexing and had a limit of 4GB,
which is enough for 33M 1024-bit fingerprints, but PubChem has about
twice that many structures.
There are also a lot of improvements, bug fixes, and performance
tweaks. For example, the FPS reader is now almost twice as fast! For
details, see the CHANGELOG file of the release.