chemfp 3.4 documentation¶
chemfp is a set of command-line tools and a Python package for working with cheminformatics fingerprints.
This is the documentation for the commerical version of chemfp, which support Python 2.7 and 3.6 or later. The documentation for chemfp 1.6, the most recent version of the no-cost/open source version of chemfp, is available from http://chemfp.readthedocs.io/en/chemfp-1.6/. Chemfp 1.6 only supports Python 2.7.
Most people will use the command-line programs to generate and search fingerprint files. ob2fps, oe2fps, and rdkit2fps use respectively the Open Babel, OpenEye, and RDKit chemistry toolkits to convert structure files into fingerprint files. sdf2fps extracts fingerprints encoded in SD tags to make the fingerprint file. simsearch finds targets in a fingerprint file which are sufficiently similar to the queries. fpcat converts between FPS and FPB formats and merges multiple fingerprint files into one.
The programs are built using the chemfp Python library API. The search capabilities are part of the public API, as well as a cross-toolkit API for reading and writing molecules from structure files or strings, and for computing molecular fingerprints.
Remember: chemfp cannot generate fingerprints from a structure file without a third-party chemistry toolkit.
Chemfp 3.4 was released on 24 June 2020. It supports Python 2.7 and 3.6+ and can be used with any recent version of OEChem/OEGraphSim, Open Babel, or RDKit. See What’s New for a description of the changes.
For a different, more scholarly discussion of chemfp see “The chemfp project” in the Journal of Cheminformatics. That paper covers the purpose of the project, its architecture and design, the FPS and FPB file formats, and the experience in trying to run chemfp as a self-funded open source project.
To cite chemfp use: Dalke, A. The chemfp project. J Cheminform 11, 76 (2019). https://doi.org/10.1186/s13321-019-0398-8 .
Table of Contents
- Installing
- Working with the command-line tools
- Generate fingerprint files from PubChem SD tags
- k-nearest neighbor search
- Threshold search
- Combined k-nearest and threshold search
- NxN (self-similar) searches
- Using a toolkit to process the ChEBI dataset
- Alternate error handlers
- chemfp’s two cross-toolkit substructure fingerprints
- Generate binary FPB files from a structure file
- Convert between FPS and FPB formats
- Specify the fpcat output format
- Alternate fingerprint file formats
- Similarity search with the FPB format
- Converting large data sets to FPB format
- Generate fingerprints in parallel and merge to FPB format
- Help for the command-line tools
- Fingerprints and fingerprint search examples
- Python 2 vs. Python 3
- Unicode and byte strings
- Hex representation of a binary fingerprint
- Byte and hex fingerprints
- Fingerprint reader and metadata
- Working with a FingerprintArena
- Create an arena with user-specified fingerprints
- Save a fingerprint arena
- How to use query fingerprints to search for similar target fingerprints
- How to search an FPS file
- How do to a Tversky search using the Dice weights
- FingerprintArena searches returning indices instead of ids
- Access the fingerprint arena bytes as a NumPy array
- Access the fingerprint bits as a NumPy array
- Computing a distance matrix for clustering
- Convert SearchResults to a SciPy csr matrix
- Taylor-Butina clustering
- MinMax Diversity Selection using RDKit
- Configuring OpenMP threads
- OpenMP and multi-threaded applications
- Fingerprint Substructure Screening (experimental)
- Substructure screening with RDKit
- Reading structure fingerprints using a toolkit
- Select a random fingerprint sample
- Don’t reorder an arena by popcount
- Look up a fingerprint with a given id
- Sorting search results
- Working with raw scores and counts in a range
- Cumulative search result counts and scores
- Writing fingerprints with a fingerprint writer
- Fingerprint readers and writers are context managers
- Write fingerprints to stdout or a file-like object
- Writing fingerprints to an FPB file
- Specify the output fingerprint format
- Merging multiple structure-based fingerprint sources
- Merging multiple fingerprint files
- Check for metadata compatibility problems
- How to write very large FPB files
- FPS fingerprint writer errors
- FPS fingerprint writer location
- MACCS dependency on hydrogens
- Create similarity search web service
- Fingerprint family and type examples
- Fingerprint families and types
- Fingerprint family
- Fingerprint family discovery
- get_fingerprint_type() and get_type()
- Create a fingerprint using text settings
- FingerprintType properties and methods
- Convert a structure record to a fingerprint
- Convert a structure record to an id and fingerprint
- Make a specialized id and molecule fingerprint parser
- Read a structure file and compute fingerprints
- Structure-based fingerprint reader location
- Read fingerprints from a string containing structures
- Structure-based fingerprint reader errors
- Experimental error handler
- Compute a fingerprint for a native toolkit molecule
- Fingerprint many native toolkit molecules
- Make a specialized molecule fingerprinter
- Toolkit API examples
- Get a chemfp toolkit
- Parse and create SMILES
- Canonical, non-isomeric, and arbitrary SMILES
- Use format to create a record in SDF format
- Use zlib record compression
- Use zst record compression
- Get a list of available formats and distinguish between input and output formats
- Determine the format for a given filename
- Parse the id and the molecule at the same time
- Specify alternate error behavior
- Specify a SMILES delimiter through reader_args
- Specify an output SMILES delimiter through writer_args
- RDKit-specific SMILES reader_args and writer_args
- OpenEye-specific SMILES reader_args and writer_args
- OpenEye-specific aromaticity
- Open Babel-specific SMILES reader_args and writer_args
- Get the default reader_args or writer_args for a format
- Convert text settings into reader and writer arguments
- Multi-toolkit reader_args and writer_args
- Qualified reader and writer parameters names
- Qualified parameter priorities
- Qualified names and text settings
- Read molecules from an SD file or stdin
- Read ids and molecules from an SD file at the same time
- Read ids and molecules using an SD tag for the id
- Read from a string instead of a file
- The reader may reuse molecule objects!
- Write molecules to a SMILES file
- Reader and writer context managers
- Write molecules to stdout in a specified format
- Write molecules to a string (and a bit of InChI)
- Handling errors when reading molecules from a string
- Handling errors when reading molecules from a file
- Ignore errors in create_string() and create_bytes()
- Ignore errors when writing molecules
- Reader and writer format metadata
- Location information: filename, record_format, recno and output_recno
- Location information: record position and content
- Writing your own error handler (Experimental)
- A Babel-like structure format converter
- argparse text settings to reader and writer args
- Creating a specialized record parser
- Molecule API: Get and set the molecule id
- Molecule API: Copy a molecule
- Molecule API: Working with SD tags
- Add fingerprints to an SD file using a toolkit
- Text toolkit examples
- Toolkits may modify the molecular structure
- Toolkits may modify SDF syntax
- The text toolkit “molecules”
- The text toolkit implements the toolkit API
- Reading and adding SD tags with the text_toolkit
- Synchronizing readers from different toolkits through the text toolkit
- Add multiple toolkit fingerprints to an SD file
- Text toolkit and SDF files
- Read id and tag value pairs from an SD file
- Extract the id and atom and bond counts from an SD file
- SDF-specific parser parameters
- Working with SD records as strings
- Unicode and other character encoding
- Mixed encodings and raw bytes
- chemfp API
- chemfp top-level API
- is_licensed
- get_license_date
- open
- load_fingerprints
- read_molecule_fingerprints
- read_molecule_fingerprints_from_string
- open_fingerprint_writer
- ChemFPError
- ParseError
- Metadata
- FingerprintReader
- FingerprintIterator
- Fingerprints
- FingerprintWriter
- ChemFPProblem
- check_fingerprint_problems
- check_metadata_problems
- count_tanimoto_hits
- count_tanimoto_hits_symmetric
- threshold_tanimoto_search
- threshold_tanimoto_search_symmetric
- knearest_tanimoto_search
- knearest_tanimoto_search_symmetric
- count_tversky_hits
- count_tversky_hits_symmetric
- threshold_tversky_search
- threshold_tversky_search_symmetric
- knearest_tversky_search
- knearest_tversky_search_symmetric
- get_fingerprint_families
- get_fingerprint_family
- get_fingerprint_family_names
- get_fingerprint_type
- get_fingerprint_type_from_text_settings
- has_fingerprint_family
- get_max_threads
- get_num_threads
- set_num_threads
- get_toolkit
- get_toolkit_names
- has_toolkit
- chemfp.types - fingerprint families and types
- FingerprintFamily
- FingerprintType
- Open Babel fingerprints
- OpenBabelFP2FingerprintType_v1
- OpenBabelFP3FingerprintType_v1
- OpenBabelFP4FingerprintType_v1
- OpenBabelMACCSFingerprintType_v1
- OpenBabelMACCSFingerprintType_v2
- OpenBabelECFP0FingerprintType_v1
- OpenBabelECFP2FingerprintType_v1
- OpenBabelECFP4FingerprintType_v1
- OpenBabelECFP6FingerprintType_v1
- OpenBabelECFP8FingerprintType_v1
- OpenBabelECFP10FingerprintType_v1
- SubstructOpenBabelFingerprinter_v1
- RDMACCSOpenBabelFingerprinter_v1
- RDMACCSOpenBabelFingerprinter_v2
- OpenEye fingerprints
- OpenEyeCircularFingerprintType_v2
- OpenEyeMACCSFingerprintType_v2
- OpenEyeMACCSFingerprintType_v3
- OpenEyePathFingerprintType_v2
- OpenEyeTreeFingerprintType_v2
- OpenEyeMoleculeScreenFingerprintType_v1
- OpenEyeSMARTSScreenFingerprintType_v1
- OpenEyeMDLScreenFingerprintType_v1
- SubstructOpenEyeFingerprinter_v1
- RDMACCSOpenEyeFingerprinter_v1
- RDMACCSOpenEyeFingerprinter_v2
- RDKit fingerprints
- RDKitFingerprintType_v1
- RDKitFingerprintType_v2
- RDKitMACCSFingerprintType_v1
- RDKitMACCSFingerprintType_v2
- RDKitMorganFingerprintType_v1
- RDKitAtomPairFingerprint_v1
- RDKitAtomPairFingerprint_v2
- RDKitTorsionFingerprintType_v1
- RDKitTorsionFingerprintType_v2
- RDKitPatternFingerprint_v1
- RDKitPatternFingerprint_v2
- RDKitPatternFingerprint_v3
- RDKitSECFPFingerprintType_v1
- RDKitAvalonFingerprintType_v1
- SubstructRDKitFingerprintType_v1
- RDMACCSRDKitFingerprinter_v1
- RDMACCSRDKitFingerprinter_v2
- chemfp.arena module
- chemfp.search module
- count_tanimoto_hits_fp
- count_tanimoto_hits_arena
- count_tanimoto_hits_symmetric
- partial_count_tanimoto_hits_symmetric
- count_tversky_hits_fp
- count_tversky_hits_arena
- count_tversky_hits_symmetric
- partial_count_tversky_hits_symmetric
- threshold_tanimoto_search_fp
- threshold_tanimoto_search_arena
- threshold_tanimoto_search_symmetric
- partial_threshold_tanimoto_search_symmetric
- fill_lower_triangle
- threshold_tversky_search_fp
- threshold_tversky_search_arena
- threshold_tversky_search_symmetric
- partial_threshold_tversky_search_symmetric
- knearest_tanimoto_search_fp
- knearest_tanimoto_search_arena
- knearest_tanimoto_search_symmetric
- knearest_tversky_search_fp
- knearest_tversky_search_arena
- knearest_tversky_search_symmetric
- contains_fp
- contains_arena
- SearchResults
- SearchResult
- chemfp.bitops module
- chemfp.encodings
- chemfp.fps_io module
- chemfp.fpb_io module
- chemfp toolkit API
- is_licensed
- get_formats
- get_input_formats
- get_output_formats
- get_format
- get_input_format
- get_output_format
- get_input_format_from_source
- get_output_format_from_destination
- read_molecules
- read_molecules_from_string
- read_ids_and_molecules
- read_ids_and_molecules_from_string
- make_id_and_molecule_parser
- parse_molecule
- parse_id_and_molecule
- create_string
- create_bytes
- open_molecule_writer
- open_molecule_writer_to_string
- open_molecule_writer_to_bytes
- copy_molecule
- add_tag
- get_tag
- get_tag_pairs
- get_id
- set_id
- chemfp.base_toolkit
- Toolkit readers
- Toolkit writers
- chemfp.openbabel_toolkit module
- name
- software
- is_licensed (openbabel_toolkit)
- get_formats (openbabel_toolkit)
- get_input_formats (openbabel_toolkit)
- get_output_formats (openbabel_toolkit)
- get_format (openbabel_toolkit)
- get_input_format (openbabel_toolkit)
- get_output_format (openbabel_toolkit)
- get_input_format_from_source (openbabel_toolkit)
- get_output_format_from_destination (openbabel_toolkit)
- read_molecules (openbabel_toolkit)
- read_molecules_from_string (openbabel_toolkit)
- read_ids_and_molecules (openbabel_toolkit)
- read_ids_and_molecules_from_string (openbabel_toolkit)
- make_id_and_molecule_parser (openbabel_toolkit)
- parse_molecule (openbabel_toolkit)
- parse_id_and_molecule (openbabel_toolkit)
- create_string (openbabel_toolkit)
- create_bytes (openbabel_toolkit)
- open_molecule_writer (openbabel_toolkit)
- open_molecule_writer_to_string (openbabel_toolkit)
- open_molecule_writer_to_bytes (openbabel_toolkit)
- copy_molecule (openbabel_toolkit)
- add_tag (openbabel_toolkit)
- get_tag (openbabel_toolkit)
- get_tag_pairs (openbabel_toolkit)
- get_id (openbabel_toolkit)
- set_id (openbabel_toolkit)
- chemfp.openeye_toolkit module
- name
- software
- is_licensed (openeye_toolkit)
- get_formats (openeye_toolkit)
- get_input_formats (openeye_toolkit)
- get_output_formats (openeye_toolkit)
- get_format (openeye_toolkit)
- get_input_format (openeye_toolkit)
- get_output_format (openeye_toolkit)
- get_input_format_from_source (openeye_toolkit)
- get_output_format_from_destination (openeye_toolkit)
- read_molecules (openeye_toolkit)
- read_molecules_from_string (openeye_toolkit)
- read_ids_and_molecules (openeye_toolkit)
- read_ids_and_molecules_from_string (openeye_toolkit)
- make_id_and_molecule_parser (openeye_toolkit)
- parse_molecule (openeye_toolkit)
- parse_id_and_molecule (openeye_toolkit)
- create_string (openeye_toolkit)
- create_bytes (openeye_toolkit)
- open_molecule_writer (openeye_toolkit)
- open_molecule_writer_to_string (openeye_toolkit)
- open_molecule_writer_to_bytes (openeye_toolkit)
- copy_molecule (openeye_toolkit)
- add_tag (openeye_toolkit)
- get_tag (openeye_toolkit)
- get_tag_pairs (openeye_toolkit)
- get_id (openeye_toolkit)
- set_id (openeye_toolkit)
- chemfp.rdkit_toolkit module
- name
- software
- is_licensed (rdkit_toolkit)
- get_formats (rdkit_toolkit)
- get_input_formats (rdkit_toolkit)
- get_output_formats (rdkit_toolkit)
- get_format (rdkit_toolkit)
- get_input_format (rdkit_toolkit)
- get_output_format (rdkit_toolkit)
- get_input_format_from_source (rdkit_toolkit)
- get_output_format_from_destination (rdkit_toolkit)
- read_molecules (rdkit_toolkit)
- read_molecules_from_string (rdkit_toolkit)
- read_ids_and_molecules (rdkit_toolkit)
- read_ids_and_molecules_from_string (rdkit_toolkit)
- make_id_and_molecule_parser (rdkit_toolkit)
- parse_molecule (rdkit_toolkit)
- parse_id_and_molecule (rdkit_toolkit)
- create_string (rdkit_toolkit)
- create_bytes (rdkit_toolkit)
- open_molecule_writer (rdkit_toolkit)
- open_molecule_writer_to_string (rdkit_toolkit)
- open_molecule_writer_to_bytes (rdkit_toolkit)
- copy_molecule (rdkit_toolkit)
- add_tag (rdkit_toolkit)
- get_tag (rdkit_toolkit)
- get_tag_pairs (rdkit_toolkit)
- get_id (rdkit_toolkit)
- set_id (rdkit_toolkit)
- chemfp.text_toolkit module
- name
- software
- is_licensed (text_toolkit)
- get_formats (text_toolkit)
- get_input_formats (text_toolkit)
- get_output_formats (text_toolkit)
- get_format (text_toolkit)
- get_input_format (text_toolkit)
- get_output_format (text_toolkit)
- get_input_format_from_source (text_toolkit)
- get_output_format_from_destination (text_toolkit)
- read_molecules (text_toolkit)
- read_molecules_from_string (text_toolkit)
- read_ids_and_molecules (text_toolkit)
- read_ids_and_molecules_from_string (text_toolkit)
- make_id_and_molecule_parser (text_toolkit)
- parse_molecule (text_toolkit)
- parse_id_and_molecule (text_toolkit)
- create_string (text_toolkit)
- create_bytes (text_toolkit)
- open_molecule_writer (text_toolkit)
- open_molecule_writer_to_string (text_toolkit)
- open_molecule_writer_to_bytes (text_toolkit)
- copy_molecule (text_toolkit)
- add_tag (text_toolkit)
- get_tag (text_toolkit)
- get_tag_pairs (text_toolkit)
- get_id (text_toolkit)
- set_id (text_toolkit)
- read_sdf_records (text_toolkit)
- read_sdf_ids_and_records (text_toolkit)
- read_sdf_ids_and_values (text_toolkit)
- read_sdf_records_from_string (text_toolkit)
- read_sdf_ids_and_records_from_string (text_toolkit)
- read_sdf_ids_and_values_from_string (text_toolkit)
- get_sdf_tag (text_toolkit)
- add_sdf_tag (text_toolkit)
- get_sdf_tag_pairs (text_toolkit)
- get_sdf_id (text_toolkit)
- set_sdf_id (text_toolkit)
- chemfp._text_toolkit module (private)
- chemfp.io module
- chemfp top-level API
- What’s New / CHANGELOG
- What’s new in 3.4 (24 June 2020)
- What’s new in 3.4b3 (18 June 2020)
- What’s new in 3.4b2 (12 June 2020)
- What’s new in 3.4b1 (24 April 2020)
- What’s new in 3.4a4 (18 March 2020)
- What’s new in version 3.4a2
- What’s new in version 3.4a1
- What’s new in version 3.3
- What’s new in version 3.2.1
- What’s new in version 3.2
- What’s new in version 3.1
- What’s new in version 3.0.1
- What’s new in version 3.0
- What’s new in version 2.1
- What’s new in version 2.0
License and advertisement¶
This program was developed by Andrew Dalke <dalke@dalkescientific.com>, Andrew Dalke Scientific, AB. It is available for purchase under an academic license, a commerical proprietary license, or an open source (MIT) license. A purchase of a license includes free upgrades and support for one year, and a discount on support renewal. (The support for the academic license is more limited than the other two options.)
I also maintain the chemfp-1.x series. Version chemfp-1.6 is available at no cost from chemfp.com, or if you know someone with a copy of chemfp 2.x or 3.x under the MIT license, you might be able to get it from them at no cost.
If you have questions about or with to purchase the commercial distribution, send an email to sales@dalkescientific.com. You may also request a demo license for evaluation.
Chemfp may be used without a valid license key under the following license:
Chemfp Base License Agreement v1.1
18 Jun 2020
This is the default License Agreement for chemfp, a high-performance
similarity search tool for cheminformatics fingerprints. It applies to
anyone who has a copy of a pre-compiled chemfp distribution and who
did not purchase or otherwise acquire an alternate License Agreement
from Andrew Dalke Scientific AB ("Dalke Scientific") or its authorized
redistributors.
This License Agreement, which covers the chemfp source code, is
neither open source nor free software. It is a proprietary License
Agreement for software made available to you at no cost.
1. Reservation of Rights and Ownership
Chemfp is licensed, not sold. Dalke Scientific, its affiliates and
suppliers own and retain all right, title and interest in and to
chemfp, including all copyrights, patents, trade secret rights,
trademarks and other intellectual property rights therein, except as
explicitly described below or explicitly covered under another License
Agreement as stated in the relevant part of the source code.
The chemfp distribution is protected by Swedish copyright laws and
other intellectual property laws and international treaty provisions.
You may make copies for internal use of chemfp, including for use on
third-party hardware such as cloud providers, so long as the users of
chemfp are internal to your organization (i.e. employees,
contractors, interns, agents, and other persons under your control and
direction).
You may not distribute modified copies of chemfp, in whole or in part,
to any third party, nor may you rent, sublicense, or lease, with or
without consideration, chemfp to third parties. You further may not
use chemfp to act as a service bureau or application service provider
or use chemfp for commercial software hosting services.
In addition, you may not publish chemfp for others to use it in any
way that is against the law.
2. Other License Restrictions and Grants
If you develop software for internal use then you may use any chemfp
functionality, except that you may not use chemfp to:
- generate FPB files
- create or search in-memory fingerprint arenas with more
than 50,000 fingerprints
- perform Tversky searches
- perform Tanimoto searches of FPS files with
more than 20 queries at a time.
In the interest of clarity, you are explicitly permitted to use
chemfp's "toolkit" API implementations, fingerprint type API
implementations, and "bitops" functions.
You may modify, reverse-engineer, decompile, or disassemble chemfp.
However, you may not do so for the purpose of circumventing the
license key system or circumventing any of the terms and restrictions
of this license or any other provision of law.
(Look, I know the license key is not hard to break - it's there to
keep honest people honest.)
Modifications must not remove relevant copyright statements and
license information.
Within the restrictions given above, you may use chemfp to validate the
accuracy of your fingerprint generation and search software, including
in the development of for-profit and commercial applications which may
be a direct competitor to chemfp.
Within the restrictions given above, you may use chemfp to generate
fingerprint data sets in FPS format for any internal use, and to
generate fingerprint data sets published at no cost for general public
download.
3. Patent Grant
You are granted a non-exclusive, worldwide, royalty-free license to
any patents that Dalke Scientific may assert on this release of
chemfp.
If you bring a patent claim against Dalke Scientific or any of its
affliates or suppliers over patents that you claim are infringed by
any version of chemfp then your license to use chemfp is terminated as
of the date such litigation is filed.
4. Disclaimers and Limitation of Liability
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
NEITHER DALKE SCIENTIFIC NOR ITS AFFILIATES OR SUPPLIERS MAKE ANY
ASSURANCES WITH REGARD TO THE ACCURACY OF THE RESULTS OR OUTPUT THAT
DERIVES FROM ANY USE OF THIS SOFTWARE.
If your jurisdiction does not allow the exclusion or limitation of the
liability for consequential or incidental damages, then you may not
use chemfp.
NOTWITHSTANDING ANY DAMAGES THAT YOU MIGHT INCUR FOR ANY REASON
WHATSOEVER (INCLUDING, WITHOUT LIMITATION, ALL DAMAGES REFERENCED
ABOVE AND ALL DIRECT OR GENERAL DAMAGES), THE ENTIRE CUMULATIVE
LIABILITY OF DALKE SCIENTIFIC, ITS AFFILIATES AND ANY OF THEIR
SUPPLIERS, WHETHER IN CONTRACT (INCLUDING ANY PROVISION OF THIS
LICENSE AGREEMENT), TORT, OR OTHERWISE, AND YOUR EXCLUSIVE REMEDY FOR
ALL OF THE FOREGOING, SHALL BE LIMITED TO THE GREATER OF DIRECT
DAMAGES IN THE AMOUNT ACTUALLY PAID BY YOU FOR THE SOFTWARE AND/OR
SERVICES OR U.S.$5.00. THE FOREGOING LIMITATIONS, EXCLUSIONS, AND
DISCLAIMERS SHALL APPLY TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE
LAW, EVEN IF DALKE SCIENTIFIC, ITS AFFILIATES OR SUPPLIERS HAVE BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES AND EVEN IF ANY REMEDY
FAILS ITS ESSENTIAL PURPOSE.
5. Your Warranty to Dalke Scientific
You warrant that all individuals having access to and/or using chemfp
will observe and perform all the terms and conditions of this License
Agreement. You shall use all reasonable efforts to see that employees,
agents, or other persons under your direction or control who have
access to and/or use the chemfp distribution abide by the terms and
conditions of this License Agreement. You shall, at your own expense,
promptly enforce the restrictions in this License Agreement against
any person who gains access to your copy of chemfp (i.e. the copy you
obtain upon agreeing to this License Agreement or any other lawful
copy you have made from such copy) with your permission or while your
employee or agent and who violates such restrictions, by instituting
and diligently pursuing all legal and equitable remedies against him
or her.
You agree to immediately notify Dalke Scientific in writing of any
misuse, misappropriation or unauthorized use of the chemfp
distribution that may come to your attention. If you authorize,
assist, encourage or facilitate another person or entity to take any
action related to the subject matter of this License Agreement, you
shall be deemed to have taken the action yourself. You agree to
defend, indemnify and hold harmless Dalke Scientific, its affiliates
and their suppliers from any and all claims resulting from or arising
out of any your, including any employee’s or agent’s (a) use or misuse
of chemfp, (b) violation of any law or the rights of any third party,
including but not limited to infringement or misappropriation of any
intellectual or proprietary rights of any third party, or (c) breach
of this License Agreement, including any breach of any warranty or
representation you make to Dalke Scientific.
6. Injunctive Relief
Because of the unique nature of chemfp, you understand and agree
that Dalke Scientific will suffer irreparable injury in the event you
fail to comply with any of the terms and conditions this License
Agreement and that monetary damages may be inadequate to compensate
Dalke Scientific for such breach. Accordingly, you agree that Dalke
Scientific will, in addition to any other remedies available to it at
law or in equity, be entitled to injunctive relief, without posting a
bond, to enforce the terms and conditions of this License Agreement.
7. Termination
You may terminate this License agreement at any time. Dalke Scientific
may immediately terminate this License Agreement if you breach any
representation, warranty, agreement or obligation contained or
referred to in this License Agreement. Upon termination, you must
dispose of chemfp and all copies or versions of chemfp.
The provisions of Sections 4, 5, 6, 7, and 8 shall survive
termination or expiration of this Agreement for any reason.
8. Venue
In any suit or other action to enforce any right or remedy under or
arising out of this License Agreement, the prevailing party shall be
entitled reasonable attorneys' fees together with expenses and costs
that such prevailing party incurs. This License Agreement shall be
governed by the laws Sweden, provided that Dalke Scientific may pursue
injunctive relief in any forum in order to protect intellectual
property rights. You consent to the personal jurisdiction of the
courts of such venue. This License Agreement will be binding upon, and
inure to the benefit of the parties and their respective successors
and assigns.
The failure by Dalke Scientific to enforce any provision of this
License Agreement shall in no way be construed to be a present or
future waiver of such provision nor in any way affect our right to
enforce such provision thereafter. All waivers by us must be in
writing to be effective. If you have not received a different license
agreement from Dalke Scientific or its authorized redistributors then
this License Agreement, together with any addendum or amendment
included with chemfp, is the complete agreement between Dalke
Scientific and you and supersedes all prior agreements, oral or
written, with respect to the subject matter hereof.
All communications and notices to be made or given pursuant to this
License Agreement shall be in the English language.
9. Copyright Notices
Copyright © 2010-2020 Andrew Dalke Scientific AB, Storgatan 50, 461 30
Trollhättan, Sweden. All rights reserved. Any rights not expressly
granted in this License Agreement are reserved.
Other copyright holders are:
- Kim Walisch, <kim.walisch@gmail.com> (several popcount implementations,
under the MIT license)
- Stanford University (written by Imran S. Haque <ihaque@cs.stanford.edu>,
under the 3-Clause BSD License)
- Python Software Foundation (the ascii_buffer_converter, under the Python license)
- Christopher Swenson (the TimSort code in hits.c, under the MIT license)
- Daniel Lemire, Nathan Kurz, Owen Kaser, et al. (the AVX2 popcount
implementation, under the Apache 2 license)
- Rational Discovery LLC, Greg Landrum, and Julie Penzotti (the MACCS
pattern definitions in rdmaccs.patterns and rdmaccs2.patterns)
Future¶
The chemfp code base is solid and in use at many companies, some of whom have paid for the commercial version. It has great support for fingerprint generation, fast similarity search, and toolkit portability, but there’s plenty left to do in future. Here’s a mixture of things that are likely and things which are possibilties. Of course funding and feedback would help prioritize things. Let me know if you need something like one of these.
The current FPB format is limited to about 200M fingerprints, while the largest current databases are nearing 1B fingerprints. One workaround is to split the data set into multiple FPB files. Better would be to have a format which handles everything in a single file.
Right now you’re limited to the built-in toolkit fingerprint types, plus chemfp’s own SMARTS-based fingerprints. There should be a registration system so you can tell chemfp about user-defined fingerprint types.
I would like some way to select fingerprint subsets. My original thought was something like an awk for the FPS format, with the ability to select N fingerprints at random, or those matching a given set of identifiers, etc. My current thought is to implement it as a sqlite virtual table.
Chemfp supports Tanimoto and Tversky similarity. I could also add support for other measures; cosine and Hamming seem like the most useful other alternatives.
Chemfp does not currently support Microsoft Windows computer because the code assumes the LP64 model, where “int” is 32 bits and “long” is 64 bits. It will require a lot of low-level work to tweak everything correctly for the Windows LLP64 model, where “int” and “long” are 32 bits and “long long” is 64 bits. Once that’s done, I’ll have to figure out how to make an installer. I’ve decided to put it off until a someone (or someones) fund it.
The threshold and k-nearest arena search results store hits using compressed sparse rows. These work well for sparse results, but when you want the entire similarity matrix (ie, with a minimum threshold of 0.0) of a large arena, then time and space to maintain the sparse data structure becomes noticable. It’s likely in that case that you want to store the scores in a 2D NumPy matrix.
I’m really interested in using chemfp to handle different sorts of clustering. Let me know if there are things I can add to the API which would help you do that.
If you are not a Python programmer then you might prefer that the core search routines be made accessible through a C API. That’s possible, in that the software was designed with that in mind, but it needs more development and testing.
Chemfp ever since version 1.1 supports OpenMP. That’s great for shared-memory machines. Are you interested in supporting a distributed computing version?
There are any number of higher-level tools which can be built on the chemfp components. For example, what about a wsgi component which implements a web-based search API for your local network? Wouldn’t it be nice to say:
fpserver filename1.fpb
and have a simple search service?
What about an IPython visualization tool?
There’s a paper (doi:10.1093/bioinformatics/byq067) on using locality-sensitive hashing to find highly similar fingerprints and a more recent one (doi:10.1186/s13321-018-0321-8) on LSH trees. Are there cases where it’s more useful than chemfp’s direct search?
Several people have asked about GPU implementations. My feeling is that the CPU is fast enough, and much easier to deploy. That’s not saying I wouldn’t be interested in a GPU implementation, only describing why it’s not at the top of the list.
Thanks¶
In no particular order, the following contributed to chemfp in some way: Noel O’Boyle, Geoff Hutchison, the Open Babel developers, Greg Landrum, OpenEye, Roger Sayle, Phil Evans, Evan Bolton, Wolf-Dietrich Ihlenfeldt, Rajarshi Guha, Dmitry Pavlov, Roche, Kim Walisch, Daniel Lemire, Nathan Kurz, Chris Morely, Jörg Kurt Wegner, Phil Evans, Björn Grüning, Andrew Henry, Brian McClain, Pat Walters, Brian Kelley, Lionel Uran Landaburu, Sereina Riniker, and Brian Cole.
Thanks also to my wife, Sara Marie, for her many years of support.