chemfp 3.4.1 documentation

chemfp is a set of command-line tools and a Python package for working with cheminformatics fingerprints.

This is the documentation for the commerical version of chemfp, which support Python 2.7 and 3.6 or later. The documentation for chemfp 1.6.1, the most recent version of the no-cost/open source version of chemfp, is available from http://chemfp.readthedocs.io/en/chemfp-1.6.1/. Chemfp 1.6.1 only supports Python 2.7.

Most people will use the command-line programs to generate and search fingerprint files. ob2fps, oe2fps, and rdkit2fps use respectively the Open Babel, OpenEye, and RDKit chemistry toolkits to convert structure files into fingerprint files. sdf2fps extracts fingerprints encoded in SD tags to make the fingerprint file. simsearch finds targets in a fingerprint file which are sufficiently similar to the queries. fpcat converts between FPS and FPB formats and merges multiple fingerprint files into one.

The programs are built using the chemfp Python library API. The search capabilities are part of the public API, as well as a cross-toolkit API for reading and writing molecules from structure files or strings, and for computing molecular fingerprints.

Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit.

Chemfp 3.4.1 was released on 27 August 2020. It supports Python 2.7 and 3.6+ and can be used with any recent version of OEChem/OEGraphSim, Open Babel, or RDKit. See What’s New for a description of the changes.

For a different, more scholarly discussion of chemfp see “The chemfp project” in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project.

To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 .

Table of Contents

License and advertisement

This program was developed by Andrew Dalke <dalke@dalkescientific.com>, Andrew Dalke Scientific, AB. It is available for purchase under an academic license, a commerical proprietary license, or an open source (MIT) license. A purchase of a license includes free upgrades and support for one year, and a discount on support renewal. (The support for the academic license is more limited than the other two options.)

I also maintain the chemfp-1.x series. Version chemfp-1.6.1 is available at no cost from chemfp.com, or if you know someone with a copy of chemfp 2.x or 3.x under the MIT license, you might be able to get it from them at no cost.

If you have questions about or with to purchase the commercial distribution, send an email to sales@dalkescientific.com. You may also request a demo license for evaluation.

Chemfp may be used without a valid license key under the following license:

                  Chemfp Base License Agreement v1.1
                              18 Jun 2020

This is the default License Agreement for chemfp, a high-performance
similarity search tool for cheminformatics fingerprints. It applies to
anyone who has a copy of a pre-compiled chemfp distribution and who
did not purchase or otherwise acquire an alternate License Agreement
from Andrew Dalke Scientific AB ("Dalke Scientific") or its authorized
redistributors.

This License Agreement, which covers the chemfp source code, is
neither open source nor free software. It is a proprietary License
Agreement for software made available to you at no cost.

1. Reservation of Rights and Ownership

Chemfp is licensed, not sold. Dalke Scientific, its affiliates and
suppliers own and retain all right, title and interest in and to
chemfp, including all copyrights, patents, trade secret rights,
trademarks and other intellectual property rights therein, except as
explicitly described below or explicitly covered under another License
Agreement as stated in the relevant part of the source code.

The chemfp distribution is protected by Swedish copyright laws and
other intellectual property laws and international treaty provisions.

You may make copies for internal use of chemfp, including for use on
third-party hardware such as cloud providers, so long as the users of
chemfp are internal to your organization (i.e. employees,
contractors, interns, agents, and other persons under your control and
direction).

You may not distribute modified copies of chemfp, in whole or in part,
to any third party, nor may you rent, sublicense, or lease, with or
without consideration, chemfp to third parties. You further may not
use chemfp to act as a service bureau or application service provider
or use chemfp for commercial software hosting services.

In addition, you may not publish chemfp for others to use it in any
way that is against the law.

2. Other License Restrictions and Grants

If you develop software for internal use then you may use any chemfp
functionality, except that you may not use chemfp to:

  - generate FPB files
  - create or search in-memory fingerprint arenas with more
     than 50,000 fingerprints
  - perform Tversky searches
  - perform Tanimoto searches of FPS files with
     more than 20 queries at a time.

In the interest of clarity, you are explicitly permitted to use
chemfp's "toolkit" API implementations, fingerprint type API
implementations, and "bitops" functions.

You may modify, reverse-engineer, decompile, or disassemble chemfp.
However, you may not do so for the purpose of circumventing the
license key system or circumventing any of the terms and restrictions
of this license or any other provision of law.

(Look, I know the license key is not hard to break - it's there to
keep honest people honest.)

Modifications must not remove relevant copyright statements and
license information.

Within the restrictions given above, you may use chemfp to validate the
accuracy of your fingerprint generation and search software, including
in the development of for-profit and commercial applications which may
be a direct competitor to chemfp.

Within the restrictions given above, you may use chemfp to generate
fingerprint data sets in FPS format for any internal use, and to
generate fingerprint data sets published at no cost for general public
download.

3. Patent Grant

You are granted a non-exclusive, worldwide, royalty-free license to
any patents that Dalke Scientific may assert on this release of
chemfp.

If you bring a patent claim against Dalke Scientific or any of its
affliates or suppliers over patents that you claim are infringed by
any version of chemfp then your license to use chemfp is terminated as
of the date such litigation is filed.

4. Disclaimers and Limitation of Liability

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

NEITHER DALKE SCIENTIFIC NOR ITS AFFILIATES OR SUPPLIERS MAKE ANY
ASSURANCES WITH REGARD TO THE ACCURACY OF THE RESULTS OR OUTPUT THAT
DERIVES FROM ANY USE OF THIS SOFTWARE.

If your jurisdiction does not allow the exclusion or limitation of the
liability for consequential or incidental damages, then you may not
use chemfp.

NOTWITHSTANDING ANY DAMAGES THAT YOU MIGHT INCUR FOR ANY REASON
WHATSOEVER (INCLUDING, WITHOUT LIMITATION, ALL DAMAGES REFERENCED
ABOVE AND ALL DIRECT OR GENERAL DAMAGES), THE ENTIRE CUMULATIVE
LIABILITY OF DALKE SCIENTIFIC, ITS AFFILIATES AND ANY OF THEIR
SUPPLIERS, WHETHER IN CONTRACT (INCLUDING ANY PROVISION OF THIS
LICENSE AGREEMENT), TORT, OR OTHERWISE, AND YOUR EXCLUSIVE REMEDY FOR
ALL OF THE FOREGOING, SHALL BE LIMITED TO THE GREATER OF DIRECT
DAMAGES IN THE AMOUNT ACTUALLY PAID BY YOU FOR THE SOFTWARE AND/OR
SERVICES OR U.S.$5.00. THE FOREGOING LIMITATIONS, EXCLUSIONS, AND
DISCLAIMERS SHALL APPLY TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE
LAW, EVEN IF DALKE SCIENTIFIC, ITS AFFILIATES OR SUPPLIERS HAVE BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES AND EVEN IF ANY REMEDY
FAILS ITS ESSENTIAL PURPOSE.

5. Your Warranty to Dalke Scientific

You warrant that all individuals having access to and/or using chemfp
will observe and perform all the terms and conditions of this License
Agreement. You shall use all reasonable efforts to see that employees,
agents, or other persons under your direction or control who have
access to and/or use the chemfp distribution abide by the terms and
conditions of this License Agreement. You shall, at your own expense,
promptly enforce the restrictions in this License Agreement against
any person who gains access to your copy of chemfp (i.e. the copy you
obtain upon agreeing to this License Agreement or any other lawful
copy you have made from such copy) with your permission or while your
employee or agent and who violates such restrictions, by instituting
and diligently pursuing all legal and equitable remedies against him
or her.

You agree to immediately notify Dalke Scientific in writing of any
misuse, misappropriation or unauthorized use of the chemfp
distribution that may come to your attention. If you authorize,
assist, encourage or facilitate another person or entity to take any
action related to the subject matter of this License Agreement, you
shall be deemed to have taken the action yourself. You agree to
defend, indemnify and hold harmless Dalke Scientific, its affiliates
and their suppliers from any and all claims resulting from or arising
out of any your, including any employee’s or agent’s (a) use or misuse
of chemfp, (b) violation of any law or the rights of any third party,
including but not limited to infringement or misappropriation of any
intellectual or proprietary rights of any third party, or (c) breach
of this License Agreement, including any breach of any warranty or
representation you make to Dalke Scientific.

6. Injunctive Relief

Because of the unique nature of  chemfp, you understand and agree
that Dalke Scientific will suffer irreparable injury in the event you
fail to comply with any of the terms and conditions this License
Agreement and that monetary damages may be inadequate to compensate
Dalke Scientific for such breach. Accordingly, you agree that Dalke
Scientific will, in addition to any other remedies available to it at
law or in equity, be entitled to injunctive relief, without posting a
bond, to enforce the terms and conditions of this License Agreement.

7. Termination

You may terminate this License agreement at any time. Dalke Scientific
may immediately terminate this License Agreement if you breach any
representation, warranty, agreement or obligation contained or
referred to in this License Agreement. Upon termination, you must
dispose of chemfp and all copies or versions of chemfp.

The provisions of Sections 4, 5, 6, 7, and 8 shall survive
termination or expiration of this Agreement for any reason.

8. Venue

In any suit or other action to enforce any right or remedy under or
arising out of this License Agreement, the prevailing party shall be
entitled reasonable attorneys' fees together with expenses and costs
that such prevailing party incurs. This License Agreement shall be
governed by the laws Sweden, provided that Dalke Scientific may pursue
injunctive relief in any forum in order to protect intellectual
property rights. You consent to the personal jurisdiction of the
courts of such venue. This License Agreement will be binding upon, and
inure to the benefit of the parties and their respective successors
and assigns.

The failure by Dalke Scientific to enforce any provision of this
License Agreement shall in no way be construed to be a present or
future waiver of such provision nor in any way affect our right to
enforce such provision thereafter. All waivers by us must be in
writing to be effective. If you have not received a different license
agreement from Dalke Scientific or its authorized redistributors then
this License Agreement, together with any addendum or amendment
included with chemfp, is the complete agreement between Dalke
Scientific and you and supersedes all prior agreements, oral or
written, with respect to the subject matter hereof.

All communications and notices to be made or given pursuant to this
License Agreement shall be in the English language.

9. Copyright Notices

Copyright © 2010-2020 Andrew Dalke Scientific AB, Storgatan 50, 461 30
Trollhättan, Sweden. All rights reserved. Any rights not expressly
granted in this License Agreement are reserved.

Other copyright holders are:
 - Kim Walisch, <kim.walisch@gmail.com> (several popcount implementations,
     under the MIT license)
 - Stanford University (written by Imran S. Haque <ihaque@cs.stanford.edu>,
     under the 3-Clause BSD License)
 - Python Software Foundation (the ascii_buffer_converter, under the Python license)
 - Christopher Swenson (the TimSort code in hits.c, under the MIT license)
 - Daniel Lemire, Nathan Kurz, Owen Kaser, et al. (the AVX2 popcount
     implementation, under the Apache 2 license)
 - Rational Discovery LLC, Greg Landrum, and Julie Penzotti (the MACCS
     pattern definitions in rdmaccs.patterns and rdmaccs2.patterns)

Future

The chemfp code base is solid and in use at many companies, some of whom have paid for the commercial version. It has great support for fingerprint generation, fast similarity search, and toolkit portability, but there’s plenty left to do in future. Here’s a mixture of things that are likely and things which are possibilties. Of course funding and feedback would help prioritize things. Let me know if you need something like one of these.

The current FPB format is limited to about 200M fingerprints, while the largest current databases are nearing 1B fingerprints. One workaround is to split the data set into multiple FPB files. Better would be to have a format which handles everything in a single file.

Right now you’re limited to the built-in toolkit fingerprint types, plus chemfp’s own SMARTS-based fingerprints. There should be a registration system so you can tell chemfp about user-defined fingerprint types.

I would like some way to select fingerprint subsets. My original thought was something like an awk for the FPS format, with the ability to select N fingerprints at random, or those matching a given set of identifiers, etc. My current thought is to implement it as a sqlite virtual table.

Chemfp supports Tanimoto and Tversky similarity. I could also add support for other measures; cosine and Hamming seem like the most useful other alternatives.

Chemfp does not currently support Microsoft Windows computer because the code assumes the LP64 model, where “int” is 32 bits and “long” is 64 bits. It will require a lot of low-level work to tweak everything correctly for the Windows LLP64 model, where “int” and “long” are 32 bits and “long long” is 64 bits. Once that’s done, I’ll have to figure out how to make an installer. I’ve decided to put it off until a someone (or someones) fund it.

The threshold and k-nearest arena search results store hits using compressed sparse rows. These work well for sparse results, but when you want the entire similarity matrix (ie, with a minimum threshold of 0.0) of a large arena, then time and space to maintain the sparse data structure becomes noticable. It’s likely in that case that you want to store the scores in a 2D NumPy matrix.

I’m really interested in using chemfp to handle different sorts of clustering. Let me know if there are things I can add to the API which would help you do that.

If you are not a Python programmer then you might prefer that the core search routines be made accessible through a C API. That’s possible, in that the software was designed with that in mind, but it needs more development and testing.

Chemfp ever since version 1.1 supports OpenMP. That’s great for shared-memory machines. Are you interested in supporting a distributed computing version?

There are any number of higher-level tools which can be built on the chemfp components. For example, what about a wsgi component which implements a web-based search API for your local network? Wouldn’t it be nice to say:

fpserver filename1.fpb

and have a simple search service?

What about an IPython visualization tool?

There’s a paper (doi:10.1093/bioinformatics/byq067) on using locality-sensitive hashing to find highly similar fingerprints and a more recent one (doi:10.1186/s13321-018-0321-8) on LSH trees. Are there cases where it’s more useful than chemfp’s direct search?

Several people have asked about GPU implementations. My feeling is that the CPU is fast enough, and much easier to deploy. That’s not saying I wouldn’t be interested in a GPU implementation, only describing why it’s not at the top of the list.

Thanks

In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, and Brian Cole.

Thanks also to my wife, Sara Marie, for her many years of support.

Indices and tables