.. highlight: none
.. _using_tools:
Working with the command-line tools
================================================
The sections in this chapter describe examples of using the
command-line tools to generate and work with fingerprint files.
.. _pubchem_fingerprints:
Generate fingerprint files from PubChem SD tags
----------------------------------------------------------------
In this section you'll learn how to create a fingerprint file from an
SD file which contains pre-computed CACTVS fingerprints. You do not
need a chemistry toolkit for this section.
These files will be re-used in many parts of the documentation.
`PubChem `_ is a great resource of
publically available chemistry information. The data is available for
`download `_ (and `ftp download
`_). We'll use some of their `SD formatted
`_ files. Each
record has a PubChem/CACTVS fingerprint field, which we'll extract to
generate an FPS file.
Start by downloading the files Compound_099000001_099500000.sdf.gz
(from
https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_099000001_099500000.sdf.gz
) and Compound_048500001_049000000.sdf.gz (from
https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_048500001_049000000.sdf.gz
). At the time of writing they contain 10,740 and 14,478 records,
respectively. (I chose some of the smallest files so they would be
easier to open and review.)
Next, convert the files into fingerprint files. On the command line
do the following two commands::
sdf2fps --pubchem Compound_099000001_099500000.sdf.gz -o pubchem_queries.fps
sdf2fps --pubchem Compound_048500001_049000000.sdf.gz -o pubchem_targets.fps
You'll see a progress bar for each command, which looks like:
Compound_099000001_099500000.sdf.gz: 100%|█████████████| 6.77M/6.77M [00:00<00:00, 20.1Mbytes/s]
Add the `--no-progress` option to turn off the progess bar, as in::
sdf2fps --pubchem Compound_099000001_099500000.sdf.gz -o
pubchem_queries.fps --no-progress
Congratulations, that was it!
If you're curious about what an FPS file looks like, here are the
first 10 lines of pubchem_queries.fps, with some of the lengthy
fingerprint lines replaced with an ellipsis::
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_099000001_099500000.sdf.gz
#date=2022-03-01T10:58:36
07de0d00000000 ... 38d1017100000000204000000000000010200000000000000000 99000039
07de1c00020000 ... 398a405000010000000000008000000000000000000000000000 99000230
074e1c00000000 ... 0103057000000000000000000000000000000000000000000000 99001517
07de0c00000000 ... 1913097110008000000008000800400000000400000000000000 99002251
How does this work? Each PubChem record contains the precomputed
CACTVS substructure keys in the PUBCHEM_CACTVS_SUBSKEYS tag. Here's
what it looks like for record 99000039, which is the first record in
Compound_099000001_099500000.sdf.gz::
>
AAADceB7sAAAAAAAAAAAAAAAAAAAAAAAAAA8YIAABYAAAACx9AAAHgAQAA
AADCjBngQ8wPLIEACoAzV3VACCgCA1AiAI2KG4ZNgIYPrA1fGUJYhglgDI
yccci4COAAAAAAQCAAAAAAAACAQAAAAAAAAAAA==
The :option:`--pubchem` flag tells :ref:`sdf2fps ` to get the value of that
tag and decode it to get the fingerprint. It also adds a few metadata
fields to the fingerprint file header.
The order of the FPS fingerprints are the same as the order of the
corresponding record in the SDF. You can see that in the output, where
99000039 is the first record in the FPS fingerprints.
If you store records in an SD file then you almost certainly don't use
the same fingerprint encoding as PubChem. :ref:`sdf2fps ` can
decode from a number of encodings, like hex and base64. Use
:option:`--help` to see the list of available decoders.
The example uses :option:`-o` to have sdf2fps write the output to a
file instead of to stdout. By default, filenames ending in ".fps" are
saved in FPS format. Use ".fps.gz" for the gzip-compressed FPS format
and ".fps.zst" for the zstandard-compressed FPS format.
Filenames ending with ".fpb" are saved in FPB format. This is a binary
format which is significantly faster to load.
k-nearest neighbor search
------------------------------------
In this section you'll learn how to search a fingerprint file to find
the k-nearest neighbors. You will need the FPS fingerprint files
generated in :ref:`pubchem_fingerprints` but you do not need a
chemistry toolkit.
We'll use the pubchem_queries.fps as the queries for a k=2 nearest
neighor similarity search of the target file puchem_targets.gps::
simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps
That's all! You should get output which starts::
#Simsearch/1
#num_bits=881
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.0
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
2 99000039 48503376 0.878453 48503380 0.872928
2 99000230 48563034 0.858824 48731730 0.852273
2 99001517 48675145 0.569620 48662654 0.554217
2 99002251 48798046 0.810976 48625236 0.810651
Here's how to interpret the output. The lines starting with '#' are
header lines. They contain metadata information describing the
similarity search. You can see the search parameters, the name of the
tool which did the search, and the filenames which went into the
search.
After the '#' header lines come the search results, with one query
result per line. There are in the same order as the query
fingerprints. Each result line contains tab-delimited columns. The
first column is the number of hits. The second column is the query
identifier used. The remaining columns contain the hit data, with
alternating target id and its score.
For example, the first result line contains the 2 hits for the
query 99000039. The first hit is the target id 48503376 with score
0.878453 and the second hit is 48503380 with score 0.872928. Since
this is a k-nearest neighor search, the hits are sorted by score,
starting with the highest score. Do be aware that ties are broken
arbitrarily. There may be additional hits with the score 0.8729 which
are not reported.
simsearch CSV output
----------------------------
In this section you'll learn how to change the simsearch output format
to CSV or TSV. You will need the FPS fingerprint files generated in
:ref:`pubchem_fingerprints` but you do not need a chemistry toolkit.
The default simsearch output format is unique to chemfp, and therefore
not so easy for other tools to parse directly. Use the :option:`--out`
option to change the output format to "csv" or "tsv" formats::
.. highlight: console
% simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps --out csv | head -8
query_id,target_id,score
99000039,48503376,0.878453
99000039,48503380,0.872928
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,48675145,0.569620
99001517,48662654,0.554217
99002251,48798046,0.810976
% simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps --out tsv | head -8
query_id target_id score
99000039 48503376 0.878453
99000039 48503380 0.872928
99000230 48563034 0.858824
99000230 48731730 0.852273
99001517 48675145 0.569620
99001517 48662654 0.554217
99002251 48798046 0.810976
These alternatives have one line for each hit, and no metadata.
NOTE: There are many variants of CSV and TSV output, especially with
how to handle spaces and embedded commas and tabs. Chemfp uses the
"excel" and "excel-tab" formats in Python's `csv module
`_.
Threshold search
----------------------------
In this section you'll learn how to search a fingerprint file to find
all of the neighbors at or above a given threshold. You will need the
FPS fingerprint files generated in :ref:`pubchem_fingerprints` but you
do not need a chemistry toolkit.
Let's do a threshold search and find all hits which are at least 0.85
similar to the queries::
.. highlight: none
simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps
The first 15 lines of output from this are::
#Simsearch/1
#num_bits=881
#type=Tanimoto k=all threshold=0.85
#software=chemfp/4.0
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
4 99000039 48732162 0.859551 48503380 0.872928 48503376 0.878453 48520532 0.854054
2 99000230 48563034 0.858824 48731730 0.852273
0 99001517
0 99002251
4 99003537 48566113 0.872449 48998000 0.853535 48997697 0.898477 48997075 0.903553
4 99003538 48566113 0.872449 48998000 0.853535 48997697 0.898477 48997075 0.903553
0 99005028
Take a look at the first result line, which contains the 4 hits for
the query id 99000039. As before, the hit information alternates
between the target ids and the target scores, but unlike the k-nearest
search, the hits are not in a particular order. You can see that here
where the scores are 0.859551, 0.872928, 0.878453, and 0.854054.
You might be wondering why I chose the 0.85 threshold. Quite simply,
it was for presentation. With a threshold of 0.8, the first record has
41 hits, which requires rather overwhelming 84 columns to show.
simsearch CSV output when no hits
-------------------------------------------
In this section you'll learn how to change the simsearch csv output
behavior when a query has no hits. You will need the FPS fingerprint
files generated in :ref:`pubchem_fingerprints` but you do not need a
chemistry toolkit.
The CSV output format writes one output line for each query hit. What
happens if a query has no hits? The previous section shows queries
99001517 and 99002251 have no hits with a threshold of 0.85, so let's
see what happens::
.. highlight: console
% simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps --out csv | head
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,*,NaN
99002251,*,NaN
99003537,48566113,0.872449
You can see the queries with no hits get a synthetic output, by
default with the target id "*" and the score of "NaN". This makes it
possible to identify queries with no hits.
Use :option:`--empty-target-id` and :option:`--empty-score` to change
these::
% simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps \
--out csv --empty-target-id MISSING --empty-score N/A | head
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99001517,MISSING,N/A
99002251,MISSING,N/A
99003537,48566113,0.872449
Alternatively, use :option:`--no-include-empty` to not generate an
output line when there are not hits::
% simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps \
--out csv --no-include-empty | head
query_id,target_id,score
99000039,48732162,0.859551
99000039,48503380,0.872928
99000039,48503376,0.878453
99000039,48520532,0.854054
99000230,48563034,0.858824
99000230,48731730,0.852273
99003537,48566113,0.872449
99003537,48998000,0.853535
99003537,48997697,0.898477
Combined k-nearest and threshold search
--------------------------------------------------------
In this section you'll learn how to search a fingerprint file to find
the k-nearest neighbors, where all of the hits must be at or above
given threshold. You will need the fingerprint files generated in
:ref:`pubchem_fingerprints` but you do not need a chemistry toolkit.
You can combine the :option:`-k` and :option:`--threshold` queries to
find the k-nearest neighbors which are all at or above a given threshold:
.. highlight: none
simsearch -k 2 --threshold 0.7 -q pubchem_queries.fps pubchem_targets.fps
This finds the nearest 2 structures, which all must be at least 0.7
similar to the query fingerprint. The output from the above starts::
#Simsearch/1
#num_bits=881
#type=Tanimoto k=2 threshold=0.7
#software=chemfp/4.0
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_source=Compound_099000001_099500000.sdf.gz
#target_source=Compound_048500001_049000000.sdf.gz
2 99000039 48503376 0.878453 48503380 0.872928
2 99000230 48563034 0.858824 48731730 0.852273
0 99001517
2 99002251 48798046 0.810976 48625236 0.810651
2 99003537 48997075 0.903553 48997697 0.898477
2 99003538 48997075 0.903553 48997697 0.898477
2 99005028 48651160 0.828829 48848576 0.816667
2 99005031 48651160 0.828829 48848576 0.816667
2 99006292 48945841 0.965217 48737522 0.879310
2 99006293 48945841 0.965217 48737522 0.879310
0 99006597
2 99006753 48655580 0.931034 48662591 0.924855
Because this is a k-nearest search, the hits are sorted from highest
score to lowest.
NxN (self-similar) searches
------------------------------------------
In this section you'll learn how to use the same fingerprints as both
the queries and targets, that is, a self-similarity search. You will
need the pubchem_queries.fps fingerprint file generated in
:ref:`pubchem_fingerprints` but you do not need a chemistry toolkit.
Use the :option:`--NxN` option if you want to use the same set of fingerprints
as both the queries and targets. Using the pubchem_queries.fps from
the previous sections::
simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps
This code is very fast because there are so few fingerprints. For
larger files the :option:`--NxN` will be about twice as fast and use
half as much memory compared to::
simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps
In addition, the :option:`--NxN` option excludes matching a
fingerprint to itself (the diagonal term).
.. _chebi_fingerprints:
Using a toolkit to process the ChEBI dataset
------------------------------------------------------------
In this section you'll learn how to create a fingerprint file from a
structure file. The structure processing and fingerprint generation
are done with a third-party chemisty toolkit. chemfp supports Open
Babel, OpenEye, RDKit and CDK. (OpenEye users please note that you
will need an OEGraphSim license to use the OpenEye-specific
fingerprinters.)
We'll work with data from `ChEBI `_,
which are "Chemical Entities of Biological Interest". They distribute
their structures in several formats, including as an SD file. For this
section, download the "lite" version from
https://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_lite.sdf.gz . It
contains the same structure data as the complete version but many
fewer tag data fields. For ChEBI 208 this file contains 146,491 records
and the compressed file is 51M.
ChEBI record titles don't contain the id
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Strangely, the ChEBI dataset does not use the title line of the SD
file to store the record id. A simple examination shows that 68,457 of
the title lines are empty, 39,478 have the title "null", 4,262 have
the title " " (with a single space), 1,959 have the title "ChEBI", 56
of them are labeled "Structure #1", and the others are usually
compound names like "fluocortolone" and "bkas#30-CoA(4-)", or
identifiers like "Compound 92" or "145453870".
(I've asked ChEBI to fix this, to no success after many years. Perhaps
you have more influence?)
Instead, the record id is stored as value of the "ChEBI ID" tag, which
looks like::
>
CHEBI:90
By default the toolkit-based fingerprint generation tools use the
title as the identifier, and print a warning and skip the record if
the identifier is missing. Here's an example with :ref:`rdkit2fps
` ::
.. highlight: console
% rdkit2fps ChEBI_lite.sdf.gz
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 1, record #1. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping.
... keeps on going ...
ERROR: Empty title in SD record after cleanup, file 'ChEBI_lite.sdf.gz', line 2093, record #33: first line is ' '. Skipping.
... keeps on going ...
These error messages come from records where no identifier could be
found. The default is to report the problem to stderr, skip processing
the record, and continue on to the next record. Use the
:option:`--errors` option to change the default behavior.
(If the first 100 records have no identifiers then the command-line
tools will exit even if :option:`--errors` is ignore. This is a safety
mechanism. Let me know if it's a problem.)
The `--id-tag` option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the identifier isn't in the title line but is in one of the SD data
items, then use :option:`--id-tag` option to specify of the name of
the data tag containing the id. For this data set you'll need to write
it as:
.. highlight: none
--id-tag "ChEBI ID"
The quotes are important because of the space in the tag name.
Here's what that looks like:
.. highlight: console
% rdkit2fps ChEBI_lite.sdf.gz --id-tag "ChEBI ID" | head -8 | fold
#FPS1
#num_bits=2048
#type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
#software=RDKit/2021.09.4 chemfp/4.0
#source=ChEBI_lite.sdf.gz
#date=2022-03-01T12:33:54
10208220141258c184490038b4124609db0030024a0765883c62c9e1288a1dc224de62f445743b8b
30ad542718468104d521a214227b29ba3822fbf20e15491802a051532cd10d902c39b02b51648981
9c87eb41142811026d510a890a711cb02f2090ddacd990c5240cc282090640103d0a0a8b460184f5
11114e2a8060200804529804532313bb03912d5e2857a6028960189e370100052c63474748a1c000
8079f49c484ca04c0d0bcb2c64b72401042a1f82002b097e852830e5898302021a1203e412064814
a598741c014e9210bc30ab180f0162029d4c446aa01c34850071e4ff037a60e732fd85014344f82a
344aa98398654481b003a84f201f518f CHEBI:90
00000000080200412008000008000004000010100022008000400002000020100020006000800001
01000100080001000010000002002200000200000008000000400002100000000080000004401000
80200020800200002000001400022064000004244810000000000080000a80012002020004198002
00080200020020120040203001000802010100024211000004400000000100200003000001000100
0100021000a200601080002a00002020048004030000884084000008000002040200010800000000
2000010022000800002000020001400020800100025040000000200a080244000060008000000802
8100c801108000000041c00200800002 CHEBI:165
In addition to "ChEBI ID" there's also a "ChEBI Name" tag which
includes data values like "tropic acid" and
"(+)-guaia-6,9-diene". Every ChEBI record has a unique name so the
names could also be used as the primary identifier instead of its id.
To use the ChEBI Name as the primary chemfp identifier, specify:
.. highlight: none
--id-tag "ChEBI Name"
The FPS fingerprint file format allows identifiers with a space, or
comma, or anything other tab, newline, and a couple of other bytes, so
it's no problem using those names directly.
Generate fingerprints with Open Babel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have the Open Babel Python library installed then you can use
:ref:`ob2fps ` to generate fingerprints::
ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps
This takes about 4 minutes on my 2019-era laptop to process all of the
records, and generates messages like::
==============================
*** Open Babel Warning in Expand
Alias R was not chemically interpreted
==============================
*** Open Babel Warning in ReadMolecule
WARNING: Problem interpreting the valence field of an atom
The valence field specifies a valence 3 that is
less than the observed explicit valence 4.
==============================
*** Open Babel Warning in ReadMolecule
Failed to kekulize aromatic bonds in MOL file
==============================
*** Open Babel Warning in ReadMolecule
Invalid line: M RGP must only refer to pseudoatoms
M RGP 2 12 1 15 2
The default generates FP2 fingerprints, so the above is the same as::
ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps
ob2fps can generate several other types of fingerprints. (Use
:option:`--help` for a list.) For example, to generate the Open Babel
implementation of the MACCS definition specify::
ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
By default ob2fps shows a progress bar which looks like::
ChEBI_lite.sdf.gz: 90365 recs [02:21, 507.25 recs/s]
Use :option:`--no-progress` to not use a progress bar.
The underlying Open Babel toolkit does not have a way to get the
current location in the file, so the progress bar is only able to show
the number of records processed, and not the percentage complete.
ob2fps has an alternative implementation which uses chemfp's text
toolkit to parse each record as a string, which is then passed to Open
Babel. The "chemfp" implementation is able to get the current file
position, letting ob2fps show a percentage complete progress bar. Use
the :option:`-R` option to set the "implementation" reader argument to
"chemfp", as in the following:
% ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi2.fps \
-R implementation=chemfp
which shows:
ChEBI_lite.sdf.gz: 58%|███████ | 31.2M/53.6M [02:41<01:41, 220kbytes/s]
The chemfp implementation takes about 5-10 seconds longer than the
native Open Babel implementation.
Generate fingerprints with OpenEye
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have the OEChem Python library installed, with licenses for
OEChem and OEGraphSim, then you can use :ref:`oe2fps ` to
generate fingerprints::
oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps
This takes about 1 minute on my laptop and generates a number of
warnings like "Stereochemistry corrected on atom number 17 of",
"Unsupported Sgroup information ignored", and "Invalid stereochemistry
specified for atom number 9 of". Normally the record title comes after
the "... of", but the title is blank for most of the records.
As an historical note, in older ChEBI releases records CHEBI:147324 By
default OEChem's SDF reader skips empty structure records. If you
really need those records, add the SuppressEmptyMolSkip flag to the
default 'flavor' reader argument, like this::
oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps \
-R flavor=Default,SuppressEmptyMolSkip
The default settings generate OEGraphSim path fingerprint with the
values::
numbits=4096 minbonds=0 maxbonds=5
atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
Each of these can be changed through command-line options. Use
:option:`--help` for details.
oe2fps can generate several other types of fingerprints. For example,
to generate the OpenEye implementation of the MACCS definition specify::
oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
Use :option:`--help` for a list of available oe2fps fingerprints or to
see more configuration details.
By default oe2fps shows a progress bar which looks like::
ChEBI_lite.sdf.gz: 24%|██ | 13.0M/53.6M [00:14<00:50, 808kbytes/s]
Use :option:`--no-progress` to not use a progress bar.
Generate fingerprints with RDKit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have the RDKit Python library installed then you can use
:ref:`rdkit2fps ` to generate fingerprints. Based on the
previous examples you probably guessed that the command-line is::
rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps
This takes about 8 minutes and 13 seconds on my laptop, and RDKit did
not generate fingerprints for 229 of the 146,491 records. RDKit logs
warning and error messages to stderr. They look like::
[14:06:39] WARNING: not removing hydrogen atom without neighbors
[14:06:39] Explicit valence for atom # 7 O, 3, is greater than permitted
[14:06:40] Explicit valence for atom # 0 He greater than permitted
[14:06:08]
****
Post-condition Violation
Element 'hv' not found
Violation occurred on line 91 in file /Users/dalke/ftps/rdkit-Release_2021_09_4/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****
[14:06:08] Element 'hv' not found
For example, RDKit is careful to check that structures make chemical
sense. It rejects 3-valent oxygens and refuses to process that those
structures, which is the reason for the first line of that output.
The default generates RDKit's path fingerprints with parameters::
minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
Each of those can be changed through command-line options. See rdkit2fps
:option:`--help` for details, where you'll also see a list of the
other available fingerprint types.
For example, to generate the RDKit implementation of the MACCS
definition use::
rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
while the following generates the Morgan/circular fingerprint with
radius 3::
rdkit2fps --morgan --radius 3 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz
By default rdkit2fps shows a progress bar which looks like::
ChEBI_lite.sdf.gz: 52%|███ | 27.9M/53.6M [04:29<05:22, 79.8kbytes/s]
Use :option:`--no-progress` to not use a progress bar.
Generate fingerprints with CDK
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have the CDK Java JAR file on your CLASSPATH and you have
installed the JPype1 package (see the [:ref:`installation guide `
for help]) then you can use :ref:`cdk2fps ` to generate
fingerprints. Based on the previous examples you can probably guess
that the command-line is::
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps
This takes about 2 minutes and 18 seconds on my laptop, and CDK did
not generate fingerprints for 17 of the 146,491 structures. CDK
generated two warnings::
org.openscience.cdk.config.IsotopeFactory ERROR: Could not find major isotope for: 88
org.openscience.cdk.config.IsotopeFactory ERROR: Could not find major isotope for: 88
Chemfp lets you choose an :ref:`alternate error handler
` (see the next section), which can help you
figure out which structures could not be processed. I'll enable the
``report`` error handler::
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps --errors report
This generates 17 lines of the form::
ERROR: Cannot generate fingerprint: java.lang.NullPointerException,
file 'ChEBI_lite.sdf.gz', record #8612. Skipping.
That means the record caused the CDK fingerprinter function to fail,
by raising a Java NullPointerException, which chemfp catches and
reports. For reference, that record is ChEBI ID CHEBI:5015.
It's a bit tricky to figure out that record #8612 is CHEBI:5015
because the record's initial line number isn't shown, in turn because
it isn't available from the CDK API. cdk2fps has an alternative
implementation which uses chemfp's text toolkit to parse each record
as a string, which is then passed to the CDK. The "chemfp"
implementation is able to report the current line number. It is
enabled with the cdk.sdf.implementation value "chemfp", like this::
cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi2.fps \
--errors report -R implementation=chemfp
This adds line number information to the report::
ERROR: Cannot generate fingerprint: java.lang.NullPointerException,
file 'ChEBI_lite.sdf.gz', line 551247, record #8612. Skipping.
This variant implementation took 2 minutes and 22 seconds, so adds a
small bit of overhead. Let me know if you find it useful.
The default ``cdk2fps`` fingerprint type is ``CDK-Daylight`` with parameters::
size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0
Each of those can be changed through command-line options. See cdk2fps
:option:`--help` for details, where you'll also see a list of the
other available fingerprint types.
For example, to generate the CDK implementation of the MACCS
definition use::
cdk2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz \
-o chebi_maccs.fps --errors report
This generates 22 report lines of the form::
ERROR: Cannot generate fingerprint: java.lang.NullPointerException:
Aromaticity model requires implicit hydrogen count is set., file
'ChEBI_lite.sdf.gz', record #2918. Skipping.
By default cdk2fps shows a progress bar which looks like::
ChEBI_lite.sdf.gz: 25%|███ | 13.5M/53.6M [00:43<01:37, 409kbytes/s]
Use :option:`--no-progress` to not use a progress bar.
.. simsearch_with_structure_queries
Use structures as input to simsearch
----------------------------------------------
In this section you'll learn how to use structures as queries to
simsearch queries instead of fingerprints. You'll need a chemistry
toolkit and a fingerprint file generated from that toolkit. This
section assumes you have one of the ``chebi_maccs.fps`` ChEBI
fingerprint files generated in the previous section. My example will
use the CDK-generated file created using::
cdk2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
I'll search the ChEBI data set for phenol with the SMILES "c1ccccc1O".
My ``chebi_maccs.fps`` starts:
.. highlight: console
% head -3 chebi_maccs.fps
#FPS1
#num_bits=166
#type=CDK-MACCS/2.0
The type information gives a hint of how to generate a fingerprint
query for that data set, which you can do manually using cdk2fps::
#FPS1
#num_bits=166
#type=CDK-MACCS/2.0
#software=CDK/2.7.1 chemfp/4.0
#date=2022-03-01T13:57:02
00000000000000000000000000000140004480101e phenol
I could then use a pipe to pass the cdk2fps output as input to
simsearch:
% echo "c1ccccc1O phenol" | cdk2fps --MACCS | simsearch chebi_maccs.fps --threshold 1.0
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.0
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1 phenol CHEBI:15882 1.00000
That's a bit clumsy, because I have to look at the fingerprint file
and figure out which command-line tools and options to use.
A simpler way is to pass the structures directly to ``simsearch``,
either as a command-line option or from an input file. In the
following I'll pass in the SMILES using the ``--query`` parameter::
% simsearch chebi_maccs.fps --threshold 1.0 --query "c1ccccc1O phenol"
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.0
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1 phenol CHEBI:15882 1.00000
How does it work? Chemp opens the targets targets file to read the
metadata section. It then uses the type string to figure out how to
generate the fingerprints for that type, as well as figure out which
toolkit to use for structure processing.
If the query structure is not a SMILES String then use
``--query-format`` to specify the format name. Use ``--query-id`` to
specify the query id instead of using the id from the input record.
For example, the following uses the InChI for proline as the input,
sets the query id to "proline", and finds the two nearest neighbors in
ChEBI:
% simsearch chebi_maccs.fps --query-format inchi --query \
'InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H2,(H,7,8)/t4-/m0/s1' \
--query-id proline -k 2
#Simsearch/1
#num_bits=166
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.0
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
2 proline CHEBI:17203 0.96429 CHEBI:16313 0.96429
CHEBI:17203 is "L-proline" and CHEBI:16313 is "D-proline". I wonder
why I didn't get an exact match ... but not enough to investigate.
The ``--query`` simsearch option takes the structure on the
command-line. Use ``--queries`` to read queries from a file, which may
be a fingerprint file or a structure file.
I'll demonstrate with a SMILES file containing two records::
% cat simple.smi
c1ccccc1O phenol
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
which I'll use to search chebi_maccs.fps::
% simsearch --queries simple.smi chebi_maccs.fps --threshold 1.0
#Simsearch/1
#num_bits=166
#type=Tanimoto k=all threshold=1.0
#software=chemfp/4.0
#queries=simple.smi
#targets=chebi_maccs.fps
#query_source=queries.smi
#target_source=ChEBI_lite.sdf.gz
1 phenol CHEBI:15882 1.00000
1 caffeine CHEBI:27732 1.00000
By default simsearch uses the filename to figure out the format type
and compression. Use ``--query-format`` to specify a different
format. For example, if neither ``--query`` nor ``--queries`` are
specified then the default reads FPS queries from stdin. I'll use
``--query-format sdf.gz`` to have it read gzip-compressed SD records
from stdin, in this case from a PubChem file::
% cat Compound_099000001_099500000.sdf.gz | \
chemfp simsearch --query-format sdf.gz chebi_maccs.fps -k 1 | head -10
#Simsearch/1
#num_bits=166
#type=Tanimoto k=1 threshold=0.0
#software=chemfp/4.0
#targets=chebi_maccs.fps
#target_source=ChEBI_lite.sdf.gz
1 99000039 CHEBI:116650 0.80000
1 99000230 CHEBI:127468 0.88372
1 99001517 CHEBI:134851 0.64583
1 99002251 CHEBI:109790 0.70909
.. using_fingerprint_file_metadata
Make new fingerprints matching the type in an existing file
------------------------------------------------------------------------
In this section you'll learn how to generate fingerprints that match
the fingerprint type of an existing file. You'll need a chemistry
toolkit and a fingerprint file generated from that toolkit. This
section assumes you have one of the ``chebi_maccs.fps`` ChEBI
fingerprint files generated in an earlier section.
In the previous section you learned how to use structures as input to
simsearch; simsearch uses the target fingerprint file metadata to
generate the appropriate fingerprints for each structure. What if you
want to generate fingerprints appropriate for the target data set but
don't immediately want to use them for a search?
The ``--using FILENAME`` option to cdk2fps, oe2fps, rdkit2fps, and
ob2fps opens the named fingerprint file to get the appropriate
type. I'll demonstrate with a simple SMILES file, where the
fingerprint types comes from ``chebi_maccs.fps``::
% cat simple.smi
c1ccccc1O phenol
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
% cdk2fps simple.smi --using chebi_maccs.fps
#FPS1
#num_bits=166
#type=CDK-MACCS/2.0
#software=CDK/2.7.1 chemfp/4.0
#source=simple.smi
#date=2022-03-01T14:13:24
00000000000000000000000000000140004480101e phenol
000000003000000001d414d90323914380f138ea1f caffeine
How do you know to use ``cdk2fps`` instead of one of the othe
programs? You don't, but all of the chemfp programs will try to
process fingerprint types from other toolkits::
% ob2fps simple.smi --using chebi_maccs.fps
WARNING: Fingerprint type 'CDK-MACCS/2.0' from the --using file 'chebi_maccs.fps' is based on the cdk toolkit.
WARNING: ob2fps expects fingerprints based on the openbabel toolkit.
#FPS1
#num_bits=166
#type=CDK-MACCS/2.0
#software=CDK/2.7.1 chemfp/4.0
#source=simple.smi
#date=2022-03-01T14:15:27
00000000000000000000000000000140004480101e phenol
000000003000000001d414d90323914380f138ea1f caffeine
This cross-toolkit functionality is part of the long-term chemfp
design but this specific code path hasn't yet been fully tested for
all the possible error conditions, which is why it prints the two
"WARNING" lines to stderr.
If you know the chemfp fingerprint type string then you could pass
that in on the command-line via the ``--type`` option, as in::
% ob2fps --type "OpenBabel-ECFP2 nBits=128" simple.smi
#FPS1
#num_bits=128
#type=OpenBabel-ECFP2/1 nBits=128
#software=OpenBabel/3.1.0 chemfp/4.0
#source=simple.smi
#date=2022-03-01T14:15:55
00080020000a00000c00000000000000 phenol
000c4448000000880404000221002040 caffeine
.. _alternate_error_handlers:
Alternate error handlers
------------------------------------
In this section you'll learn how to change the error handler for
rdkit2fps using the :option:`--errors` option.
By default the "2fps" programs "ignore" structures which
could not be parsed into a molecule option. There are two other
options. They can "report" more information about the failure case and
keep on processing, or they can be "strict" and exit after reporting
the error.
This is configured with the :option:`--errors` option.
Here's the rdkit2fps output using :option:`--errors report`::
[12:21:03] WARNING: not removing hydrogen atom without neighbors
[12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Skipping.
[12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24338, record #381. Skipping.
The first two lines come from RDKit. The third line is from chemfp,
reporting which record could not be parsed. (The record starts at line
24228 of the file.) The fourth line is another RDKit error message, and
the last line is another chemfp error message.
Here's the rdkit2fps output using :option:`--errors strict`::
[12:24:24] WARNING: not removing hydrogen atom without neighbors
[12:24:24] Explicit valence for atom # 12 N, 4, is greater than permitted
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Exiting.
Because this is strict mode, processing exits at the first failure.
The ob2fps and oe2fps tools implement the :option:`--errors` option,
but they aren't as useful as rdkit2fps because the underlying APIs
don't give useful feedback to chemfp about which records failed. For
example, the standard OEChem file reader automatically skips records
that it cannot parse. Chemfp can't report anything when it doesn't
know there was a failure.
The default error handler in chemfp 1.1 was "strict". In practice this
proved more annoying than useful because most people want to skip the
records which could not be processed. They would then contact me
asking what was wrong, or doing some pre-processing to remove the
failure cases.
One of the few times when it is useful is for records which contain no
identifiers. When I changed the default from "strict" to "ignore" and
tried to process ChEBI, I was confused at first about why the output
file was so small. Then I realized that it's because the many records
without a title were skipped, and there was no feedback about skipping
those records.
I changed the code so missing identifiers are always reported, even if
the error setting is "ignore". Missing identifiers will still stop
processing if the error setting is "strict".
chemfp's two cross-toolkit substructure fingerprints
---------------------------------------------------------------------------
In this section you'll learn how to generate the two
substructure-based fingerprints which come as part of chemfp. These
are based on cross-toolkit SMARTS pattern definitions and can be used
with Open Babel, OpenEye, and RDKit. (For OpenEye users, these
fingerprints use the base OEChem library but do not use the separately
licensed OEGraphSim library.)
chemfp implements two platform-independent fingerprints where were
originally designed for substructure filters but which are also used
for similarity searches. One is based on the 166-bit MACCS
implementation in RDKit and the other comes from the 881-bit
PubChem/CACTVS substructure fingerprints.
The chemfp MACCS definition is called "rdmaccs" because it closely
derives from the MACCS SMARTS patterns used in RDKit. (These pattern
definitions are also used in Open Babel and the CDK, while OpenEye
has a completely independent implementation.)
Here are example of the respective rdmaccs fingerprint for phenol
using each of the toolkits.
Open Babel::
% echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-OpenBabel/2
#software=OpenBabel/3.1.0 chemfp/4.0
#date=2022-03-01T14:17:37
00000000000000000000000000000140004480101e phenol
OpenEye QQQQQ::
% echo "c1ccccc1O phenol" | oe2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-OpenEye/2
#software=OEChem/2.3.0 (20191016) chemfp/3.5
#date=2021-01-27T14:46:03
00000000000000000000000000000140004480101e phenol
RDKit::
% echo "c1ccccc1O phenol" | rdkit2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-RDKit/2
#software=RDKit/2021.09.4 chemfp/4.0
#date=2022-03-01T14:18:11
00000000000000000000000000000140004480101e phenol
CDK::
% echo "c1ccccc1O phenol" | cdk2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-CDK/2
#software=CDK/2.7.1 chemfp/4.0
#date=2022-03-01T14:18:31
00000000000000000000000000000140004480101e phenol
For more complex molecules it's possible that different toolkits
produce different fingerprint rdmaccs, even though the toolkits use
the same SMARTS definitions. Each toolkit has a different
understanding of chemistry. The most notable is the different
definition of aromaticity, so the bit for "two or more aromatic rings"
will be toolkit dependent.
substruct fingerprints
----------------------
chemp also includes a "substruct" substructure fingerprint. This is an
881 bit fingerprint derived from the PubChem/CACTVS substructure
keys. They do not match the CACTVS fingerprints exactly, in part due
to differences in ring perception. Some of the substruct bits will
always be 0. With that caution in mind, if you want to try them out,
use the :option:`--substruct` option.
The term "substruct" is a horribly generic name. If you can think of a
better one then let me know. Until chemfp 3.0 I said these
fingerprints were "experimental", in that I hadn't fully validated
them against PubChem/CACTVS and could not tell you the error rate. I
still haven't done that.
What's changed is that I've found out over the years that people are
using the substruct fingerprints, even without full validatation. That
surprised me, but use is its own form of validation. I still would
like to validate the fingerprints, but it's slow, tedious work which I
am not really interested in doing. Nor does it earn me any
money. Plus, if the validation does lead to any changes, it's easy to
simply change the version number.
.. _pubchem_fpb_fingerprints:
Generate binary FPB files from a structure file
-----------------------------------------------------------
In this section you'll learn how to generate an FPB file instead of an
FPS file. You will need the the ChEBI file from
:ref:`chebi_fingerprints` and a chemistry toolkit. The FPB format was
introduced with chemfp-2.0.
.. NOTE::
Several chemfp features, like creating FPB files, require a valid
license key. If you are using chemfp under the Base License
Agreement then contact sales@dalkescientific.com to purchase a
license key or request an evaluation license.
The FPB format was designed so the fingerprints can be memory-mapped
directly to chemfp's internal data structures. This makes it very fast
to load, but unlike the FPS format, it's not so easy to write with
your own code. You should think of the FPB format as an binary
application format, for chemfp-based tools, while the FPS format is a
text-based format for data exchange between diverse programs.
The easiest way to generate an FPB file from the command line is to
use the ".fpb" extension instead of ".fps" or ".fps.gz". Here are
examples using each of the toolkits.
Open Babel::
% ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fpb
OpenEye::
% oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fpb
RDKit::
% rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fpb
CDK::
% cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fpb
The binary format isn't human-readable. Use :ref:`fpcat` to see what's
inside::
% fpcat oe_chebi.fpb
#FPS1
#num_bits=4096
#type=OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
#software=OEGraphSim/2.4.3 (20191016) chemfp/3.4
0000000 ... many zeros ...00000000000000 CHEBI:15378
0000000 ... many zeros ...00000000000000 CHEBI:16042
0000000 ... many zeros ...00000000000000 CHEBI:17792
....
182b528 ... many hex values ... a8c10c0c CHEBI:60493
By default the fingerprints are ordered from smallest popcount to
largest, which you can see in the output. A pre-ordered index is
faster to search because the target popcounts are pre-computed and
because it often reduces the search space.
If you want to preserve the input order then you'll need to pipe the
FPS output to :ref:`fpcat ` and use its :option:`--preserve-order`
flag. See the next section for an example.
Convert between FPS and FPB formats
-----------------------------------
In this section you'll learn how to convert an FPS file into an FPB
file and back, and you'll learn how to control the fingerprint
ordering. You will need the FPS files generated in
:ref:`pubchem_fingerprints` but you do not need a chemistry
toolkit. The FPB format was introduced with chemfp-2.0.
If you already have an FPS file then you can convert it directly into
an FPB file, and without using a chemistry toolkit. The :ref:`fpcat
` program converts from one format to the other.
In an earlier section I generated the files pubchem_queries.fps and
pubchem_targets.fps . I'll convert each to FPB format::
% fpcat pubchem_targets.fps -o pubchem_targets.fpb
% fpcat pubchem_queries.fps -o pubchem_queries.fpb
The FPB format is a binary format which is difficult to read
directly. The easiest way to see what's inside is to use fpcat. If you
don't specify an output filename then it sends the results to stdout
in FPS format::
% fpcat pubchem_queries.fpb | head -5 | fold
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
00028000e00000000000000000000000000000000000000000000000000000000000009840000000
0000c001000300000000000000000000000000000000000000000200000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000 99116624
The keen-eyed reader might have noticed that the conversion does not
have a "source" or "date" field. I haven't figured out if this is a
bug. Should I keep the original date and structure file source, or use
the current date and FPS file source? Let me know if this is important
to you.
By default when fpcat generates an FPB file it reorders the
fingerprints by population count and creates a popcount index. This
improves the similarity search performance, but it means that the
order of the FPB file is likely different than the original FPS
format. You can get a sense of this by looking at the first
fingerprint in the original pubchem_queries.fps file::
% grep -v # pubchem_queries.fps | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000 99000039
and confirming that it isn't the same as the first fingerpritn in pubchem_queries.fpb.
If you want the FPB file to store the fingerprints in input order
instead of the popcount order needed for optimized similarity search,
then use the :option:`--preserve-order` flag::
% fpcat pubchem_queries.fps --preserve-order -o input_order.fpb
% fpcat input_order.fpb | grep -v # | head -1 | fold
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000 99000039
On the flip side, fpcat by default preserves the input order when it
creates FPS output. If you instead want to the output FPS file to be
in popcount order then use the :option:`--reorder` flag::
% fpcat --reorder pubchem_queries.fps | grep -v # | head -1 | fold
00028000e00000000000000000000000000000000000000000000000000000000000009840000000
0000c001000300000000000000000000000000000000000000000200000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000 99116624
Specify the fpcat output format
-----------------------------------
In this section you'll learn how to specify the output format for
fpcat using a command-line option instead of the filename
extension. You will need the pubchem_queries.fpb file from
:ref:`pubchem_fingerprints`.
If you do not specify an output filename then fpcat will output the
fingerprints in FPS format to stdout. If you specify a filename then
by default it will look at the extension to determine if the output
should be an FPB (".fpb"), FPS (".fps"), or gzip or Zstandard
compressed FPS (".fps.gz" or ".fps.zst") file. The FPS format is used
for unrecognized extensions.
In a few rare cases you may want to use a format which doesn't match
the default. To be honest, the examples I can think of aren't that
realistic, but let's suppose you want to output the contents of an FPB
file to stdout in gzip'ed FPS format, and count the number of bytes in
compressed output. I'll use the use the --out flag to change the
format to 'fps.gz' from the default of 'fps', then compare the
resulting size with the uncompressed form::
% fpcat pubchem_queries.fpb --out fps | wc -c
2511714
% fpcat pubchem_queries.fpb --out fps.gz | wc -c
314393
It's not that useful because you could pipe the uncompressed output to
gzip, which is also likely faster::
% fpcat pubchem_queries.fpb --out fps | gzip -c -9 | wc -c
11921
In case you're wondering, chemfp 3.4 added support for zstandard
compression, if the "zstandard" Python module is available.
% fpcat pubchem_queries.fpb --out fps.zst | wc -c
293806
Chemfp cannot write an FPB file to stdout. In fact, the output file
must be seek-able, which means it can't be a named pipe either.
Alternate fingerprint file formats
------------------------------------
In this section you'll learn about chemfp's support for other
fingerprint file formats.
Chemfp started as a way to promote the FPS file format for fingerprint
exchange. Chemfp 2.0 added the FPB format, which is a binary format
designed around chemfp's internal search data structure so it can be
loaded quickly.
There are many other fingerprint formats. Perhaps the best
known is the Open Babel `FastSearch
`_ format. Two others are Dave
Cosgrove's `flush `_ format,
and OpenEye's "fpbin" format.
The `chemfp_converters package
`_ contains utilities
to convert between the chemfp formats and these other formats.::
# Convert from/to Dave Cosgrove Flush format
flush2fps drugs.flush
fps2flush drugs.fps -o drugs.flush
# Convert from/to OpenEye's fpbin format
fpbin2fps drugs.fpbin --moldb drugs.sdf
fps2fpbin drugs_openeye_path.fps --moldb drugs.sdf -o drugs.fpbin
# Convert from/to Open Babel's FastSearch format
fs2fps drugs.fs --datafile drugs.sdf
fps2fs drugs_openbabel_FP2.fps --datafile drugs.sdf -o drugs.fs
Of the three formats, the flush format is closest to the FPS data
model. That is, it stores fingerprint records as an identifier and the
fingerprint bytes. By comparison, the FastSearch and fpbin formats
store the fingerprint bytes and an index into another file containing
the structure and identifier. It's impossible for chemfp to get the
data it needs without reading both files.
Chemfp has special support for the flush format. If chemfp_converters
is installed, chemfp will use it to read and write flush files nearly
everywhere that it accepts FPS files. You can use it at the output to
oe2fps, rdkit2fps, and ob2fps, and as the input queries to simsearch,
and as both input and output to fpcat. (You cannot use it as the
simsearch targets because that code has been optimized for FPS and FPB
search, and I haven't spent the time to optimize flush file support.)
This means that if chemfp_converters is installed then you can use
:ref:`fpcat ` to convert between FPS, FPB, and and flush file
formats. For examples::
fpcat drugs.flush -o drugs.fps
fpcat drugs.fps -o drugs.flush
In addition, you can use it at the API level in :func:`chemfp.open`,
:func:`chemfp.load_fingerprints`,
:func:`chemfp.open_fingerprint_writer`, and
:meth:`.FingerprintArena.save`.
Note that the flush format does not support the FPS metadata fields,
like the fingerprint type, and it only support fingerprints which are
a multiple of 32 bits long. Also, compressed flush files are not
supported.
.. _fpb_format:
The FPB format
-----------------------------
In this section you'll learn about the FPB format.
The `FPS format `_ is a human-readable
text format. It's meant to be easy to create software to read or write
FPS files so it can be used as a way to exchange fingerprint data
between diffferent programs. The downside is it's relatively slow to
process. Chemfp can search about 1M 2048-bit FPS fingerprints in one
second, or load about 250K 2048-bit fingerprints/second into memory in
the same time.
The `FPB format `_ is a binary format
which is much faster to load, though internally more complex. It can
search 1M fingerprints in about 10 millisecond, and the load time
mostly depends on the file system performance.
This makes a big difference in web development, where the web app
might restart every time a file is edited, and in command-line tools,
where the load time might be far greater than the analysis time.
Note though that the Base License Agreement does not permit people to
create FPB files, and the chemfp binary distributions contain a
license manager which restricts access to that feature without an
authorized license key. See the `chemfp licensing page
`_ for licensing options and for how to
request a license key and/or quote for license.
Internally the FPB file contains an 8-byte signature followed by a set
of chunks. Each chunk contains an 8-byte length field followed a
4-byte identifier followed by n bytes of data, where n is the value of
the length field. Those familiar with PNG or other `FourCC
`_ format will find this
familiar.
Different chucks contain different types of data. For example, the
"META" chunk contains the metadata, and the "AREN" chuck contains the
fingerprint data, organized in a way that makes them easy to load into
a chemfp arena, hence the name.
... _licensed_fpb_files
Licensed FPB files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While the Base License Agreement does not permit people to create an
FPB file, it does allow people to load an FPB file, including FPB
files generated by third-party tools.
However, the Base License Agreement also does not generally permit in-memory
searches of fingerprint data sets with more than 50,000
fingerprints. If you try you'll get a message like the following::
A valid chemfp license key is required to search an arena with more than 50,000 fingerprints.
The license key check failed. Neither CHEMFP_LICENSE nor CHEMFP_LICENSE_FILE environment variables are defined.
Email sales@dalkescientific.com to purchase a license key or request a demo license key.
Cannot run simsearch. Exiting.
The exception is if the FPB file is a *licensed* FPB file.
A licensed FPB file has an embedded license key (stored in the "CFPL"
chunk) which, if valid, unlocks all of the license manager
restrictions for using that file.
.. _chembl_as_licensed_fpb:
Get licensed FPB files containing ChEMBL 29 fingerprints
-----------------------------------------------------------------
In this section you'll learn how to get the ChEMBL 29 fingerprints as
a chemfp licensed FPB file. For background you should read the
previous section on `The FPB format `.
Short version: use the following to download and uncompress the file:
curl -O https://chemfp.com/datasets/chembl_29.fpb.gz
gunzip chembl_29.fpb.gz
If curl isn't installed, try::
wget https://chemfp.com/datasets/chembl_29.fpb.gz
The result is the licensed FPB file ``chemfp_29.fpb``.
.. NOTE::
The "licensed" in "licensed FPB file" only refers to the presence
of the embedded chemfp license key. The data in the file is
distributed under the terms of the `ChEMBL license
`_
and includes the `required attribution
`_.
Long version: The `ChEMBL 29 distribution
`_
includes `pre-computed RDKit Morgan2 fingerprints
`_
in FPS format. These can be searched directly, like::
% simsearch --query c1ccccc1CN -k 2 chembl_29.fps.gz --times
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/4.0
#targets=chembl_29.fps.gz
#target_source=chembl_29.fps.gz
2 Query1 CHEMBL522 1.0000000 CHEMBL14186 0.9333333
open 0.01 read 0.00 search 3.85 output 0.00 total 3.60
or decompressed first, using "gunzip", to skip the decompression overhead::
% gunzip chembl_29.fps.gz
% simsearch --query c1ccccc1CN -k 2 chembl_29.fps --times
... lines removed for clarity ...
2 Query1 CHEMBL522 1.0000000 CHEMBL14186 0.9333333
open 0.02 read 0.00 search 2.15 output 0.00 total 1.87
The chemfp project also distributes ChEMBL fingerprints which have
been `reformatted to the FPB format `_.
Many of the examples in the documentation will use ``chembl_29.fpb``.
Here's how to get that file.
Step 1: Download `chembl_29.fpb.gz
`.
Step 2: Decompress it with::
gunzip chembl_29.fpb.gz
Step 3: (optional) View the license terms::
chemfp fpb_text chembl_29.fpb.gz
Similarity search with the FPB format
----------------------------------------------------------------
In this section you'll learn how to do a similarity search using an
FPB file as the target. You will need `ChEMBL 29 as a chemfp-licensed
FPB file `_ and you will need the RDKit
chemistry toolkit.
The file ``chembl_29.fpb`` contains the ChEMBL-generated RDKit Morgan
circular fingerprints for ChEMBL 29, reformatted in FPB format, and
containing a license key which unlocks restrictions on using chemfp to
work with that file.
All of the chemfp tools support FPB files as input or output formats,
including simsearch. Here's an example using first FPS format then FPB
format (run three times each, reporting only the fastest time::
% time simsearch --query c1ccccc1CN -k 2 chembl_29.fps > /dev/null
2.508u 0.426s 0:03.25 89.8% 0+0k 0+0io 0pf+0w
% time simsearch --query c1ccccc1CN -k 2 chembl_29.fpb > /dev/null
0.648u 0.175s 0:00.83 97.5% 0+0k 0+0io 0pf+0w
These times are reported as "user time" (the CPU time spent by the
program), "system time" (the time spent by the operating system, in
this case, to read and transfer data from disk), and "wall clock
time", which is the overall elapsed time. In this cae, it took 3.25
second to search the FPS file and 0.83 second to search the FPB file.
That's a factor of 4, which is pretty good, but perhaps not that
impressive.
Performance breakdown
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
That's because the total time includes the time needed to load RDKit,
parse the query SMILES, and generate the query fingerprint. To start,
it takes 0.04 seconds for Python to start working::
% time python -c 'pass'
0.027u 0.013s 0:00.04 75.0% 0+0k 0+0io 0pf+0w
It takes Python about 0.28 seconds to load RDKit::
% time python -c 'from rdkit.Chem import rdMolDescriptors'
0.376u 0.081s 0:00.28 160.7% 0+0k 0+0io 0pf+0w
Finally, chemfp's wrapper to the RDKit toolkit adds another 0.2
seconds::
% time python -c 'from chemfp import rdkit_toolkit, rdkit_types'
0.522u 0.123s 0:00.47 136.1% 0+0k 0+0io 0pf+0w
.. note::
The times depend very much on your operating system, the location
and type of the file system, the file cache, and more.
To demonstrate this overhead, I'll pre-compute the fingerprints so
chemfp doesn't need to import RDKit::
% cat benzylamine.smi
c1ccccc1CN benzylamine
% rdkit2fps benzylamine.smi -o benzylamine.fps
--using chembl_29.fpb --no-progress
% tail -1 benzylamine.fps | fold -w 70
0000000000000000000001000000000000000000000000000000000000000000000000
0000000000000000000000000020000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000200000000000008000000004000001000000
0000000000000800008000000000000000000000000000000000000000000000100000
0000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000500400000000000000000000000000000200002000000000000000
0000000000000000000000 benzylamine
then use the fingerprint file as input::
% time simsearch --queries benzylamine.fps -k 2 chembl_29.fpb > /dev/null
0.245u 0.084s 0:00.38 84.2% 0+0k 0+0io 0pf+0w
The 0.47 seconds of startup time, plus the 0.38 seconds of search
time, minus the doubly-counted Python startup time of 0.04 seconds, is
0.81 seconds, which is quite close to the time of 0.83 seconds I
measured earlier.
To give a direct comparison, I'll use the same query file to search
the FPS file::
% time simsearch -q benzylamine.fps -k 2 chembl_29.fps > /dev/null
1.810u 0.300s 0:02.42 87.1% 0+0k 0+0io 0pf+0w
FPB search is now about 6 times faster than FPS search. And of that
0.38 seconds, 0.08 seconds or 20% is in file I/O.
Multi-core similarity search
------------------------------------------------
In this section you'll learn how chemfp parallelizes multiple queries
and the various configuration options. You will need `ChEMBL 29 as a
chemfp-licensed FPB file `, the `PubChem file
` Compound_099000001_099500000.sdf.gz, and the
RDKit toolkit.
Suppose you want the ChEMBL 29 record which is most common to each
record in `Compound_099000001_099500000.sdf.gz`. This can be done with
simsearch as a k=1 nearest-neighbor search, with the PubChem
fingerprints as queries, and the ChEMBL file as targets.
This means converting the PubChem structure file into a fingerprint
file using the same fingerprint type that ChEMBL uses::
rdkit2fps Compound_099000001_099500000.sdf.gz \
--using chembl_29.fpb -o Compound_099000001_099500000.fps
Then do the search, saving the results to "results.txt"::
time simsearch -k 1 -q Compound_099000001_099500000.fps \
chembl_29.fpb -o results.txt
.. NOTE:
This will take about 4 minutes to run. While it's doing that,
perhaps take a look at CPU core load.
Sadly, simsearch does not yet implement progress bars. That feature is
on the TODO list for a future release.
Once simsearch finished, "time" reported::
625.245u 6.399s 3:58.51 264.8% 0+0k 0+0io 0pf+0w
The "264.8%" refers to the total CPU time divided by the wall-clock
time. It can be greater than 100% when using multiple cores in
parallel.
Converting large data sets to FPB format
------------------------------------------
In this section you'll learn how to generate an FPB file on computers
with relatively limited memory. To be realistic, this example uses the
complete `PubChem data set `_,
and extracts the CACTVS/PubChem fingerprints which are in each
record. You do not need a chemistry toolkit for this section.
The most direct way to extract the PubChem fingerprints from a PubChem
distribution is to use :ref:`sdf2fps `::
sdf2fps --pubchem pubchem/Compound_*.sdf.gz -o pubchem.fpb
This uses the default FPB writer options, which stores all of the
fingerprints in memory, sorts them, and saves the result to the output
file. This may use several times as much memory as the final FPB
output size, which is a bit unfortunate if you want to generate a 7 GB
FPB file on a 12 GB machine.
When I updated this section in June 2020, it took around 25GB of
memory to create an FPB file with 102,768,482 PubChem fingerprints,
and the final file was about 14GB.
(Note: see :ref:`the next section `
for a two-stage solution that lets you parallelize fingerprint
generation.)
The "\*2fps" command-line tools do not have a way to change the
default writer options, although :ref:`fpcat ` does. The
:option:`--max-spool-size` option sets a rough upper bound to the
amount of memory to use. When enabled, the writer breaks the input
into parts and creates a temporary FPB file for each part. At the end,
it merges the sorted data from the temporary FPB files to get the
final FPB file. Be aware that the specified spool size is only
approximate and is not a hard limit on the maximum amount of memory to
use. You may need to experiment a bit if you have tight constraints,
and this option might not be as useful as I thought it was.
The value must be a size in bytes, though suffixes like M or MB for
megabyte and T or TB for terabyte are also allowed. These are in
base-10 units, so 1 MB = 1,000,000 bytes. Spaces are not allowed
between the number and the suffix, so "200MB" is okay but "200 MB" is
not. The size must be at least 20 MB.
Here is an example of how to convert the CACTVS fingerprints from all
of PubChem to an FPB file, using a relatively small limit of 200 MB::
sdf2fps --pubchem pubchem/Compound_*.sdf.gz | fpcat --max-spool-size 200MB -o pubchem.fpb
This will take a while! The sdf2fps alone takes almost 45 minutes on a
ca. 2017-era Haswell machine.
If I save the intermediate results to an FPS file then the in-memory
fpcat conversion from FPS to FPB takes 5½ minutes and requires 25GB of
memory.
With spool of 200MB, the conversion takes nearly 10 minutes. According
to ``htop``, the spooled conversion required, near the peak, 13.3G of
virtual memory, a resident set size of 12G, and 10.6G of shared shared
pages. The shared pages are from memory-mapping the intermediate FPB
files, so this probably required only 2GB of real memory.
If I use a 1GB spool size, the conversion time decreases from 10 to 8
minutes, and uses about the same amount of peak memory.
The temporary files will be placed under the appropriate temporary
directory for your operating system. If that disk isn't large enough
for the intermediate files then use the :option:`--tmpdir` option of
fpcat to specify an alternate directory::
fpcat --max-spool-size 1GB pubchem.fps -o pubchem.fpb --tmpdir /usr/tmp
Another option is to specify the directory location using the TMPDIR,
TEMP, or TMP environment variables, which are resolved in that
order. The details are described in the Python documentation for
`tempfile.tempdir `_.
.. _external_gzcat:
Faster gzip decompression
------------------------------------
In this section you'll learn how to use an external program to improve
the performance of reading gzip files. You will need PubChem file
`Compound_016000001_016500000.sdf.gz
`_.
The PubChem file Compound_016000001_016500000.sdf.gz is one of the
largest files from PubChem, measured by compressed size. The copy I
have is 543M compressed and 3.9G uncompressed.
It takes about 17 seconds to extract the PubChem fingerprints from
the compressed file:
% sdf2fps --pubchem Compound_016000001_016500000.sdf.gz -o pubchem.fps
Compound_016000001_016500000.sdf.gz: 100%|████████████| 569M/569M [00:17<00:00, 32.4Mbytes/s]
In comparison, it takes about 10 seconds to extract the fingerprints
from the uncompressed file:
% sdf2fps --pubchem Compound_016000001_016500000.sdf -o pubchem.fps
Compound_016000001_016500000.sdf: 100%|████████████████| 4.19G/4.19G [00:09<00:00, 434Mbytes/s]
For this case, gzip adds 70% overhead to the processing step!
There are two sources for the overhead. First, a fair amount of
chemfp's internal gzip processing uses Python. Second, it's
single-threaded, that is, a block of text is decompressed, then passes
to the FPS parser.
Chemfp has the option to run an external program to handle
decompression. Use the environment "CHEMFP_GZCAT" to specify the
program to use. If called with no arguments, it must read from
stdin. If called with one argument, it must read from the named
file. The decompressed data must be written to stdout, so it can be
read by chemfp.
In most cases you'll use "zcat" (for Linux-based OSes) or "gzcat" (for
macOS).
Here's an example::
% env CHEMFP_GZCAT=gzcat sdf2fps --pubchem \
Compound_016000001_016500000.sdf.gz -o pubchem.fps
Compound_016000001_016500000.sdf.gz: 442852 recs [00:15, 28125.27 recs/s]
(The progress bar can't give a progress bar because chemfp doesn't
know the current read position relative to the entire file.)
This took about 15 seconds, which is 2 seconds faster, or only about
50% overhead.
An improvement of two seconds may not sound like much, but 10%
performance gain is nice when processing many gzip-compressed files.
.. _generate_fingerprints_in_parallel:
Generate fingerprints in parallel and merge to FPB format
------------------------------------------------------------------------------
In this section you'll learn how to merge multiple sorted fingerprints
into a single FPB file.
The previous section used a single shell command to extract the
PubChem/CACTVS fingerprints from PubChem and generate an FPB
file. This is easy to write and understand, but more complex versions
may be more appropriate.
For one, I have four cores on my desktop computer, and I want to use
them to process the PubChem files in parallel. The previous section
was only single threaded.
I have all my PubChem files in ``~/pubchem/``. For each
"Compound_*.sdf.gz" file in that directory I want to extract the
CACTVS/PubChem fingerprints and create an intermediate FPS file in the
local directory. That's equivalent to running the following commands::
sdf2fps --pubchem ~/pubchem/Compound_000000001_000500000.sdf.gz \\
-o Compound_000000001_000500000.fps
sdf2fps --pubchem ~/pubchem/Compound_000500001_001000000.sdf.gz \\
-o Compound_000500001_001000000.fps
... 291 more lines ...
except that I want to run four at a time.
This is what `GNU Parallel `_
was designed for. It's a command-line tool which can parallelize the
execution of other command-lines.
I'll start by explaining the core command-line substitution pattern::
sdf2fps --pubchem {} -o {/..}.fps'
The ``{}`` will be replaced with a filename, and ``{/..}`` will be
replaced with the base filename, without the directory path prefix or
the two suffixes. That is, when ``{}`` is
"/Users/dalke/pubchem/Compound_000000001_000500000.sdf.gz" then
``{/..}`` will be "Compound_000000001_000500000.fps".
Since I want to generate an FPS file, I added the ".fps" as a suffix
to the second substitution parameter.
I then tell GNU parallel which command-line to use, along with a few
other parameters. Here's the full line, which I split over two lines
to make it more readable::
parallel --plus --no-notice --bar 'sdf2fps --pubchem {}
-o {/..}.fps' ::: ~/pubchem/Compound_*.sdf.gz
The :option:`--plus` tells GNU parallel to recognize an expanded set
of replacement strings. ("{/..}" is not part of the standard set of
patterns.)
The :option:`--no-notice` tells it to not display the message about
citing GNU parallel in scientific papers.
The :option:`--bar` enables a progress bar, which looks like this::
30% 88:205=11m17s /Users/dalke/pubchem/Compound_045500001_046000000.sdf.gz
This status line shows that processing is 30% complete, which is file
88 out of 205, and there's an estimated 11 minutes and 17 seconds
remaining.
Finally, the ":::" indicates that the remaining options are the list
of parameters to pass to the command-line template for
parallelization.
After about 21 minutes, using 4 CPUs on my laptop (with an effective
scaling of 2.8), I now have a large number of FPS files, which I want
to merge into a single FPB file. I'll use :ref:`fpcat `::
fpcat --max-spool-size 1GB Compound*.fps -o pubchem.fpb
Unfortunately my laptop ran out of disk space, so I'll just leave it a
that; re-doing the same command on a server machine won't provide you
any new information.