.. _intro: ######################## chemfp 3.4 documentation ######################## `chemfp `_ is a set of command-line tools and a Python package for working with cheminformatics fingerprints. This is the documentation for the commerical version of chemfp, which support Python 2.7 and 3.6 or later. The documentation for chemfp 1.6, the most recent version of the no-cost/open source version of chemfp, is available from `http://chemfp.readthedocs.io/en/chemfp-1.6/ `_. Chemfp 1.6 only supports Python 2.7. Most people will use the command-line programs to generate and search fingerprint files. :ref:`ob2fps `, :ref:`oe2fps `, and :ref:`rdkit2fps ` use respectively the `Open Babel `_, `OpenEye `_, and `RDKit `_ chemistry toolkits to convert structure files into fingerprint files. :ref:`sdf2fps ` extracts fingerprints encoded in SD tags to make the fingerprint file. :ref:`simsearch ` finds targets in a fingerprint file which are sufficiently similar to the queries. :ref:`fpcat ` converts between FPS and FPB formats and merges multiple fingerprint files into one. The programs are built using the :ref:`chemfp Python library API `. The search capabilities are part of the public API, as well as a cross-toolkit API for reading and writing molecules from structure files or strings, and for computing molecular fingerprints. Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit. Chemfp 3.4 was released on 24 June 2020. It supports Python 2.7 and 3.6+ and can be used with any recent version of OEChem/OEGraphSim, Open Babel, or RDKit. See :ref:`What's New ` for a description of the changes. For a different, more scholarly discussion of chemfp see "`The chemfp project `_" in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project. To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 . .. toctree:: :caption: Table of Contents installing using-tools tool-help using-api fingerprint_types toolkit text_toolkit api whatsnew ************************* License and advertisement ************************* This program was developed by Andrew Dalke , Andrew Dalke Scientific, AB. It is available for purchase under an academic license, a commerical proprietary license, or an open source (MIT) license. A purchase of a license includes free upgrades and support for one year, and a discount on support renewal. (The support for the academic license is more limited than the other two options.) I also maintain the chemfp-1.x series. Version chemfp-1.6 is available at no cost from chemfp.com, or if you know someone with a copy of chemfp 2.x or 3.x under the MIT license, you might be able to get it from them at no cost. If you have questions about or with to purchase the commercial distribution, send an email to `sales@dalkescientific.com `_. You may also request a demo license for evaluation. Chemfp may be used without a valid license key under the following license: .. highlight:: none :: Chemfp Base License Agreement v1.1 18 Jun 2020 This is the default License Agreement for chemfp, a high-performance similarity search tool for cheminformatics fingerprints. It applies to anyone who has a copy of a pre-compiled chemfp distribution and who did not purchase or otherwise acquire an alternate License Agreement from Andrew Dalke Scientific AB ("Dalke Scientific") or its authorized redistributors. This License Agreement, which covers the chemfp source code, is neither open source nor free software. It is a proprietary License Agreement for software made available to you at no cost. 1. Reservation of Rights and Ownership Chemfp is licensed, not sold. Dalke Scientific, its affiliates and suppliers own and retain all right, title and interest in and to chemfp, including all copyrights, patents, trade secret rights, trademarks and other intellectual property rights therein, except as explicitly described below or explicitly covered under another License Agreement as stated in the relevant part of the source code. The chemfp distribution is protected by Swedish copyright laws and other intellectual property laws and international treaty provisions. You may make copies for internal use of chemfp, including for use on third-party hardware such as cloud providers, so long as the users of chemfp are internal to your organization (i.e. employees, contractors, interns, agents, and other persons under your control and direction). You may not distribute modified copies of chemfp, in whole or in part, to any third party, nor may you rent, sublicense, or lease, with or without consideration, chemfp to third parties. You further may not use chemfp to act as a service bureau or application service provider or use chemfp for commercial software hosting services. In addition, you may not publish chemfp for others to use it in any way that is against the law. 2. Other License Restrictions and Grants If you develop software for internal use then you may use any chemfp functionality, except that you may not use chemfp to: - generate FPB files - create or search in-memory fingerprint arenas with more than 50,000 fingerprints - perform Tversky searches - perform Tanimoto searches of FPS files with more than 20 queries at a time. In the interest of clarity, you are explicitly permitted to use chemfp's "toolkit" API implementations, fingerprint type API implementations, and "bitops" functions. You may modify, reverse-engineer, decompile, or disassemble chemfp. However, you may not do so for the purpose of circumventing the license key system or circumventing any of the terms and restrictions of this license or any other provision of law. (Look, I know the license key is not hard to break - it's there to keep honest people honest.) Modifications must not remove relevant copyright statements and license information. Within the restrictions given above, you may use chemfp to validate the accuracy of your fingerprint generation and search software, including in the development of for-profit and commercial applications which may be a direct competitor to chemfp. Within the restrictions given above, you may use chemfp to generate fingerprint data sets in FPS format for any internal use, and to generate fingerprint data sets published at no cost for general public download. 3. Patent Grant You are granted a non-exclusive, worldwide, royalty-free license to any patents that Dalke Scientific may assert on this release of chemfp. If you bring a patent claim against Dalke Scientific or any of its affliates or suppliers over patents that you claim are infringed by any version of chemfp then your license to use chemfp is terminated as of the date such litigation is filed. 4. Disclaimers and Limitation of Liability THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. NEITHER DALKE SCIENTIFIC NOR ITS AFFILIATES OR SUPPLIERS MAKE ANY ASSURANCES WITH REGARD TO THE ACCURACY OF THE RESULTS OR OUTPUT THAT DERIVES FROM ANY USE OF THIS SOFTWARE. If your jurisdiction does not allow the exclusion or limitation of the liability for consequential or incidental damages, then you may not use chemfp. NOTWITHSTANDING ANY DAMAGES THAT YOU MIGHT INCUR FOR ANY REASON WHATSOEVER (INCLUDING, WITHOUT LIMITATION, ALL DAMAGES REFERENCED ABOVE AND ALL DIRECT OR GENERAL DAMAGES), THE ENTIRE CUMULATIVE LIABILITY OF DALKE SCIENTIFIC, ITS AFFILIATES AND ANY OF THEIR SUPPLIERS, WHETHER IN CONTRACT (INCLUDING ANY PROVISION OF THIS LICENSE AGREEMENT), TORT, OR OTHERWISE, AND YOUR EXCLUSIVE REMEDY FOR ALL OF THE FOREGOING, SHALL BE LIMITED TO THE GREATER OF DIRECT DAMAGES IN THE AMOUNT ACTUALLY PAID BY YOU FOR THE SOFTWARE AND/OR SERVICES OR U.S.$5.00. THE FOREGOING LIMITATIONS, EXCLUSIONS, AND DISCLAIMERS SHALL APPLY TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, EVEN IF DALKE SCIENTIFIC, ITS AFFILIATES OR SUPPLIERS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES AND EVEN IF ANY REMEDY FAILS ITS ESSENTIAL PURPOSE. 5. Your Warranty to Dalke Scientific You warrant that all individuals having access to and/or using chemfp will observe and perform all the terms and conditions of this License Agreement. You shall use all reasonable efforts to see that employees, agents, or other persons under your direction or control who have access to and/or use the chemfp distribution abide by the terms and conditions of this License Agreement. You shall, at your own expense, promptly enforce the restrictions in this License Agreement against any person who gains access to your copy of chemfp (i.e. the copy you obtain upon agreeing to this License Agreement or any other lawful copy you have made from such copy) with your permission or while your employee or agent and who violates such restrictions, by instituting and diligently pursuing all legal and equitable remedies against him or her. You agree to immediately notify Dalke Scientific in writing of any misuse, misappropriation or unauthorized use of the chemfp distribution that may come to your attention. If you authorize, assist, encourage or facilitate another person or entity to take any action related to the subject matter of this License Agreement, you shall be deemed to have taken the action yourself. You agree to defend, indemnify and hold harmless Dalke Scientific, its affiliates and their suppliers from any and all claims resulting from or arising out of any your, including any employee’s or agent’s (a) use or misuse of chemfp, (b) violation of any law or the rights of any third party, including but not limited to infringement or misappropriation of any intellectual or proprietary rights of any third party, or (c) breach of this License Agreement, including any breach of any warranty or representation you make to Dalke Scientific. 6. Injunctive Relief Because of the unique nature of chemfp, you understand and agree that Dalke Scientific will suffer irreparable injury in the event you fail to comply with any of the terms and conditions this License Agreement and that monetary damages may be inadequate to compensate Dalke Scientific for such breach. Accordingly, you agree that Dalke Scientific will, in addition to any other remedies available to it at law or in equity, be entitled to injunctive relief, without posting a bond, to enforce the terms and conditions of this License Agreement. 7. Termination You may terminate this License agreement at any time. Dalke Scientific may immediately terminate this License Agreement if you breach any representation, warranty, agreement or obligation contained or referred to in this License Agreement. Upon termination, you must dispose of chemfp and all copies or versions of chemfp. The provisions of Sections 4, 5, 6, 7, and 8 shall survive termination or expiration of this Agreement for any reason. 8. Venue In any suit or other action to enforce any right or remedy under or arising out of this License Agreement, the prevailing party shall be entitled reasonable attorneys' fees together with expenses and costs that such prevailing party incurs. This License Agreement shall be governed by the laws Sweden, provided that Dalke Scientific may pursue injunctive relief in any forum in order to protect intellectual property rights. You consent to the personal jurisdiction of the courts of such venue. This License Agreement will be binding upon, and inure to the benefit of the parties and their respective successors and assigns. The failure by Dalke Scientific to enforce any provision of this License Agreement shall in no way be construed to be a present or future waiver of such provision nor in any way affect our right to enforce such provision thereafter. All waivers by us must be in writing to be effective. If you have not received a different license agreement from Dalke Scientific or its authorized redistributors then this License Agreement, together with any addendum or amendment included with chemfp, is the complete agreement between Dalke Scientific and you and supersedes all prior agreements, oral or written, with respect to the subject matter hereof. All communications and notices to be made or given pursuant to this License Agreement shall be in the English language. 9. Copyright Notices Copyright © 2010-2020 Andrew Dalke Scientific AB, Storgatan 50, 461 30 Trollhättan, Sweden. All rights reserved. Any rights not expressly granted in this License Agreement are reserved. Other copyright holders are: - Kim Walisch, (several popcount implementations, under the MIT license) - Stanford University (written by Imran S. Haque , under the 3-Clause BSD License) - Python Software Foundation (the ascii_buffer_converter, under the Python license) - Christopher Swenson (the TimSort code in hits.c, under the MIT license) - Daniel Lemire, Nathan Kurz, Owen Kaser, et al. (the AVX2 popcount implementation, under the Apache 2 license) - Rational Discovery LLC, Greg Landrum, and Julie Penzotti (the MACCS pattern definitions in rdmaccs.patterns and rdmaccs2.patterns) ****** Future ****** The chemfp code base is solid and in use at many companies, some of whom have paid for the commercial version. It has great support for fingerprint generation, fast similarity search, and toolkit portability, but there's plenty left to do in future. Here's a mixture of things that are likely and things which are possibilties. Of course funding and feedback would help prioritize things. `Let me know `_ if you need something like one of these. The current FPB format is limited to about 200M fingerprints, while the largest current databases are nearing 1B fingerprints. One workaround is to split the data set into multiple FPB files. Better would be to have a format which handles everything in a single file. Right now you're limited to the built-in toolkit fingerprint types, plus chemfp's own SMARTS-based fingerprints. There should be a registration system so you can tell chemfp about user-defined fingerprint types. I would like some way to select fingerprint subsets. My original thought was something like an awk for the FPS format, with the ability to select N fingerprints at random, or those matching a given set of identifiers, etc. My current thought is to implement it as a sqlite virtual table. Chemfp supports Tanimoto and Tversky similarity. I could also add support for other measures; cosine and Hamming seem like the most useful other alternatives. Chemfp does not currently support Microsoft Windows computer because the code assumes the LP64 model, where "int" is 32 bits and "long" is 64 bits. It will require a lot of low-level work to tweak everything correctly for the Windows LLP64 model, where "int" and "long" are 32 bits and "long long" is 64 bits. Once that's done, I'll have to figure out how to make an installer. I've decided to put it off until a someone (or someones) fund it. The threshold and k-nearest arena search results store hits using compressed sparse rows. These work well for sparse results, but when you want the entire similarity matrix (ie, with a minimum threshold of 0.0) of a large arena, then time and space to maintain the sparse data structure becomes noticable. It's likely in that case that you want to store the scores in a 2D NumPy matrix. I'm really interested in using chemfp to handle different sorts of clustering. Let me know if there are things I can add to the API which would help you do that. If you are not a Python programmer then you might prefer that the core search routines be made accessible through a C API. That's possible, in that the software was designed with that in mind, but it needs more development and testing. Chemfp ever since version 1.1 supports OpenMP. That's great for shared-memory machines. Are you interested in supporting a distributed computing version? There are any number of higher-level tools which can be built on the chemfp components. For example, what about a wsgi component which implements a web-based search API for your local network? Wouldn't it be nice to say:: fpserver filename1.fpb and have a simple search service? What about an IPython visualization tool? There's a paper (doi:10.1093/bioinformatics/byq067) on using locality-sensitive hashing to find highly similar fingerprints and a more recent one (doi:10.1186/s13321-018-0321-8) on LSH trees. Are there cases where it's more useful than chemfp's direct search? Several people have asked about GPU implementations. My feeling is that the CPU is fast enough, and much easier to deploy. That's not saying I wouldn't be interested in a GPU implementation, only describing why it's not at the top of the list. ****** Thanks ****** In no particular order, the following contributed to chemfp in some way: Noel O'Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, and Brian Cole. Thanks also to my wife, Sara Marie, for her many years of support. ******************* Indices and tables ******************* * :ref:`genindex` * :ref:`modindex` * :ref:`search`