.. _intro: ======================== chemfp 3.1 documentation ======================== `chemfp `_ is a set of tools for working with cheminformatics fingerprints. This is the documentation for the commerical version of chemfp. The documentation for chemfp 1.3, the no-cost version of chemfp, is available from `http://chemfp.readthedocs.io/en/chemfp-1.3/ `_. Most people will use the command-line programs to generate and search fingerprint files. :ref:`ob2fps `, :ref:`oe2fps `, and :ref:`rdkit2fps ` use respectively the `Open Babel `_, `OpenEye `_, and `RDKit `_ chemistry toolkits to convert structure files into fingerprint files. :ref:`sdf2fps ` extracts fingerprints encoded in SD tags to make the fingerprint file. :ref:`simsearch ` finds targets in a fingerprint file which are sufficiently similar to the queries. :ref:`fpcat ` converts between FPS and FPB formats and merges multiple fingerprint files into one. The programs are built using the :ref:`chemfp Python library API `. The search capabilities are part of the public API, as well as a cross-toolkit API for reading and writing molecules from structure files or strings, and for computing molecular fingerprints. Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit. Chemfp 3.1 was released on 18 September 2017. It supports Python 2.7 and 3.5+ and can be used with any recent version of OEChem/OEGraphSim, Open Babel, or RDKit. List of chapters ================ .. toctree:: :maxdepth: 3 installing using-tools tool-help using-api fingerprint_types toolkit text_toolkit api License and advertisement ========================= This program was developed by Andrew Dalke , Andrew Dalke Scientific, AB. It is distributed under the "MIT" license, shown below. Further chemfp development depends on funding from people like you. Asking for voluntary contributions almost never works. Instead, starting with chemfp 1.1, the source code is distributed under an incentive program. You can pay for the commerical distribution, or use the no-cost version. If you pay for the commercial distribution then you will get the most recent version of chemfp, free upgrades for one year, support, and a discount on renewing participation in the incentive program. I also maintain the chemfp-1.x series. Version chemfp-1.3 is available at no cost from chemfp.com, or if you know someone with chemfp 2.x or 3.x you might be able to get it from them at no cost. It's free/open source software, after all. If you have questions about or with to purchase the commercial distribution, send an email to sales@dalkescientific.com . .. highlight:: none :: Copyright (c) 2010-2017 Andrew Dalke Scientific, AB (Sweden) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Copyright to portions of the code are held by other people or organizations, and may be under a different license. See the specific code for details. These are: - OpenMP, cpuid, POPCNT, and Lauradoux implementations by Kim Walisch, , under the MIT license - SSSE3.2 popcount implementation by Stanford Univeristy (written by Imran S. Haque ) under the BSD license - The AVX2 popcount implementation by Daniel Lemire, Nathan Kurz, Owen Kaser, et al. under the Apache 2 license - heapq and ascii_buffer_converter by the Python Software Foundation under the Python license - TimSort code by Christopher Swenson under the MIT License - tests/unittest2 by Steve Purcell, the Python Software Foundation, and others, under the Python license - chemfp/rdmaccs.patterns and chemfp/rdmaccs2.patterns by Rational Discovery LLC, Greg Landrum, and Julie Penzotti, under the 3-Clause BSD License What's new in version 3.1 ========================= Released 17 September 2017 The new specialized POPCNT implementation for PubChem/CACTVS keys increases search performance for that case by about 15%. The SearchResults object gained the :meth:`to_csr() <.SearchResults.to_csr>` method and the :attr:`shape <.SearchResults.shape>` attribute. The new method returns a `SciPy compressed sparse row matrix `_ containing the similarity scores, which can be passed into `scikit-learn `_ for clustering. The fall 2017 release of OEChem will accept InChI strings as structure input. The chemfp wrapper now knows about this, as well as the two new InChI output flavors "RelativeStereo" and "RacemicStereo". The fall 2017 release of RDKit will fix a bug in the pattern fingerprint definitions. The new chemfp fingerprint type is RDKit-Pattern/4. Changed how oe2fps, rdkit2fps, and ob2fps report missing or empty identifiers. Previously the default :option:`--errors` setting of "ignore" simply skipped those records, without any warning messages. This caused problems processing the ChEBI SD file. Most of the records have an empty title line, so only a few fingerprint records were generated. It wasn't obvious that the resulting data set was useless. The new code always reports a warning for empty or missing identifiers, even with "ignore". If the :option:`--errors` is "strict" then the warning becomes an error and processing stops. Updated the #software line to include "chemfp/1.3" in addition to the toolkit information. This helps distinguish between, say, two different programs which generate RDKit Morgan fingerprints. It's also possible that a chemfp bug can affect the fingerprint output, so the extra term makes it easier to identify a bad dataset. There are several small fixes related to memory leaks, the bytes/Unicode distinction in Python 3, error messages, and error handling. Removed chemfp.progressbar and chemfp.futures. These were included in chemfp 1.1 because I used them in a project for one customer and thought they might be useful in future chemfp projects. They were not. Also removed chemfp.argparse because chemfp 3.0 dropped support for Python 2.6. What's new in version 3.0.1 =========================== Released 28 August 2017 This is a bug-fix release. This fixes a critical bug in the general-purpose POPCNT popcount implementation and a bug in the code to detect the RDKit Pattern fingerprint change in 2017.3. See the CHANGELOG for details. What's new in version 3.0 ========================= Released 2 May 2017 Chemfp now supports both Python 2.7 and Python 3.5 or later. It no longer supports version before Python 2.7. Chemfp will support Python 2.7 at least until 2020, which is the end-of-life for Python 2.7. This required extensive changes to distinguish between text/Unicode strings and byte strings. The biggest user-facing change is that identifiers are now treated as Unicode strings. Fingerprints are still treated as byte strings. This change is not backwards compatible. The APIs function parameters are polymorphic, so in most cases you can pass in either a Unicode string or a UTF-8 encoded byte string. However, the return type for an identifier is Unicode, which will likely cause problems with existing code which expects bytes. All of the chemistry toolkits have decided to treat files as UTF-8 encoded. Chemfp's "text toolkit" offers limited support for reading Latin-1 encoded files. This is a tricky topic so contact me if you have questions or problems. I have removed the "make_string_creator()" function because it was hard to explain, hard to maintain, and had little performance improvement over passing in the arguments to :func:`chemfp.create_string`. This will break compatibility, but then again, I don't think anyone used it. If it is a problem, I suggest creating a function, as in the following:: >>> from chemfp import rdkit_toolkit as T >>> mol = T.parse_molecule("c1ccccc1O", "smistring") >>> T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True}) u'O-c1:c:c:c:c:c:1' >>> def make_string(mol): ... return T.create_string(mol, "smistring", writer_args = {"allBondsExplicit": True}) ... >>> make_string(mol) u'O-c1:c:c:c:c:c:1' If you look carefully at the previous example, you'll see the other major backwards incompatibility. The function :func:`chemfp.create_string` now return a Unicode string instead of a byte string. This also means its `format` parameter no longer accepts the ".zlib" or ".gzip" extensions. Instead, to get the old behavior use the new API function :func:`chemfp.create_bytes`":: >>> T.create_bytes(mol, "smistring", writer_args = {"allBondsExplicit": True}) 'O-c1:c:c:c:c:c:1' >>> T.create_bytes(mol, "smistring.zlib", writer_args = {"allBondsExplicit": True}) 'x\x9c\xf3\xd7M6\xb4J\x86CC\x00&\xc8\x04\x8d' There's a similar change between :func:`chemfp.open_molecule_writer_to_string` and the new function :func:`chemfp.open_molecule_writer_to_bytes`. There are also some new features in version 3.0 which don't break compatibility. Similarity search is faster because there are now specialized popcount implementations based on the fingerprint length. On one benchmark, 166-bit searches are 35% faster, 1024-bit searches are 25% faster, and 2048-bit searches are 5% faster. There is a new popcount implementation for processors with the AVX2 instruction set. It is about 15% faster than the POPCNT version for 2048 bit fingerprints. To test it out you will have to compile chemfp with :option:`--with-avx2` enabled. Added support for the Avalon fingerprints in RDKit, if RDKit has been compiled with Avalon support. What's new in version 2.1 ========================= Released 2 July 2015 Version 2.1 adds Tversky support for every place there was Tanimoto search (except the handful of deprecated APIs). There are new search routines for FPS and arena searches, including OpenMP support, and new bitops functions to compute a Tversky index between two fingerprints. The k-nearest arena searches now support OpenMP. Previously they were single threaded even though the other search functions supported multiple threads. The built-in SDF parser saw a couple improvements, including support for both "\\n" and "\\r\\n" newlines, instead of only "\\n" newlines. There were a number of bug fixes that concern edge cases. For example, some 64-bit double calculations could be off-by-one in the last digit, and fingerprints with 0 bits set could cause a few problems. What's new in version 2.0 ========================= Released 8 April 2015 Version 2.0 includes many new features designed for web service development. The new "FPB" binary fingerprint file format is very fast to load, which is great for web server reloading during development and on the command-line. The speed comes from using a memory-mapped file, which also means that multiple chemfp instances can use the same file on the same machine without extra memory overhead. The most extensive improvement is the new portable API for working with structure files and fingerprint types. The moment you start working with multiple chemistry toolkits, you realize that they all have different ways to read and write molecules, and to generate fingerprints from a molecule. Chemfp tries hard to have a consistent API for these common tasks, without sacrificing performance, so you can get get your work done. For example, with the new API it's easy to take an SD record as an input string, compute the MACCS fingerprints for each available toolkit, add the results as new SD tags, and return the updated record. This sounds so easy, doesn't it? It took about a year to develop. The API is quite extensive, and includes the ability to pass toolkit-specific options to the underlying parsers, a low-level SDF parser that can be used to index a file, a way to get a list of available formats and fingerprint types, and methods to parse fingerprint arguments from strings. New with version 2.0 is the ability to handle PubChem-sized data. Previous versions used 32 bit indexing and had a limit of 4GB, which is enough for 33M 1024-bit fingerprints, but PubChem has about twice that many structures. There are also a lot of improvements, bug fixes, and performance tweaks. For example, the FPS reader is now almost twice as fast! For details, see the CHANGELOG file of the release. Future ====== The chemfp code base is solid and in use at many companies, some of whom have paid for the commercial version. It has great support for fingerprint generation, fast similarity search, and toolkit portability, but there's plenty left to do in future. Here's a mixture of things that are likely and things which are possibilties. Of course funding and feedback would help prioritize things. `Let me know `_ if you need something like one of these. Right now you're limited to the built-in toolkit fingerprint types, plus chemfp's own SMARTS-based fingerprints. There should be a registration system so you can tell chemfp about user-defined fingerprint types. I would like some way to select fingeprint subsets. My original thought was something like an awk for the FPS format, with the ability to select N fingerprints at random, or those matching a given set of identifiers, etc. My current thought is to implement it as a sqlite virtual table. Chemfp supports Tanimoto and Tversky similarity. I could also add support for other measures; cosine and Hamming seem like the most useful other alternatives. Chemfp does not currently support Microsoft Windows computer because the code assumes the LP64 model, where "int" is 32 bits and "long" is 64 bits. It will require a lot of low-level work to tweak everything correctly for the Windows LLP64 model, where "int" and "long" are 32 bits and "long long" is 64 bits. Once that's done, I'll have to figure out how to make an installer. I've decided to put it off until a someone (or someones) fund it. The threshold and k-nearest arena search results store hits using compressed sparse rows. These work well for sparse results, but when you want the entire similarity matrix (ie, with a minimum threshold of 0.0) of a large arena, then time and space to maintain the sparse data structure becomes noticable. It's likely in that case that you want to store the scores in a 2D NumPy matrix. I'm really interested in using chemfp to handle different sorts of clustering. Let me know if there are things I can add to the API which would help you do that. If you are not a Python programmer then you might prefer that the core search routines be made accessible through a C API. That's possible, in that the software was designed with that in mind, but it needs more development and testing. Chemfp ever since version 1.1 supports OpenMP. That's great for shared-memory machines. Are you interested in supporting a distributed computing version? There are any number of higher-level tools which can be built on the chemfp components. For example, what about a wsgi component which implements a web-based search API for your local network? Wouldn't it be nice to say:: fpserver filename1.fpb and have a simple search service? What about an IPython visualization tool? There's a paper (doi:10.1093/bioinformatics/byq067) on using locality-sensitive hashing to find highly similar fingerprints. Are there cases where it's more useful than chemfp's direct search? Several people have asked about GPU implementations. My feeling is that the CPU is fast enough, and much easier to deploy. That's not saying I wouldn't be interested in a GPU implementation, only describing why it's not at the top of the list. Thanks ====== In no particular order, the following contributed to chemfp in some way: Noel O'Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, and Lionel Uran Landaburu. Thanks also to my wife, Sara Marie, for her many years of support. Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`