.. highlight:: none
===================================
Working with the command-line tools
===================================
The sections in this chapter describe examples of using the
command-line tools to generate fingerprint files and to do similarity
searches of those files.
.. _pubchem_fingerprints:
Generate fingerprint files from PubChem SD tags
===============================================
In this section you'll learn how to create a fingerprint file from an
SD file which contains pre-computed CACTVS fingerprints. You do not
need a chemistry toolkit for this section.
`PubChem `_ is a great resource
of publically available chemistry information. The data is available
for `ftp download `_. We'll use some of
their `SD formatted
`_ files.
Each record has a PubChem/CACTVS fingerprint field, which we'll extract
to generate an FPS file.
Start by downloading the files `Compound_027575001_027600000.sdf.gz
`_
and `Compound_014550001_014575000.sdf.gz
`_. At
the time of writing (April 2017) they contain 384 and 5167 records,
respectively. (I chose smaller than average files so they would be
easier to open and review.)
Next, convert the files into fingerprint files. On the command line
do the following two commands::
sdf2fps --pubchem Compound_027575001_027600000.sdf.gz -o pubchem_queries.fps
sdf2fps --pubchem Compound_014550001_014575000.sdf.gz -o pubchem_targets.fps
Congratulations, that was it!
If you're curious about what an FPS file looks like, here are the
first 10 lines of pubchem_queries.fps, with some of the lengthy
fingerprint lines replaced with an ellipsis::
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_027575001_027600000.sdf.gz
#date=2017-09-16T21:25:08
075e1c00020800000000 ... 1fd7e91913047100000402002001000000020100900000000000000000 27575190
035e1c00620000000000 ... 1f97e11913047100000800402000080000040020100004000000000000 27575192
075e1c00020000000000 ... 1f97e11913057101000002006800000000000100340000000000000000 27575198
075e1c00024000000000 ... 1f97e11913047100000000002000000000000000100000000000000000 27575208
How does this work? Each PubChem record contains the precomputed
CACTVS substructure keys in the PUBCHEM_CACTVS_SUBSKEYS tag. Here's
what it looks like for record 27575190, which is the first record in
Compound_027575001_027600000.sdf.gz::
>
AAADceB6OABAEAAAAAAAAAAAAAAAAAAAAAAwYAAAAAAAAAABQAAAHgRQAAABrAil2AKyyYLABAqIAiXS
WHLCAAAlChQIiBlAbOgKJjLgtZ2HMQhk1AH465eYyCCOAAAgQAAEgAAAAECAAAkAAAAAAAAAAA==
The :option:`--pubchem` flag tells :ref:`sdf2fps ` to get the value of that
tag and decode it to get the fingerprint. It also adds a few metadata
fields to the fingerprint file header.
The order of the FPS fingerprints are the same as the order of the
corresponding record in the SDF. You can see that in the output, where
27575190 is the first record in the FPS fingerprints.
If you store records in an SD file then you almost certainly don't use
the same fingerprint encoding as PubChem. :ref:`sdf2fps ` can
decode from a number of encodings, like hex and base64. Use
:option:`--help` to see the list of available decoders.
k-nearest neighbor search
=========================
In this section you'll learn how to search a fingerprint file to find
the k-nearest neighbors. You will need the FPS fingerprint files
generated in :ref:`pubchem_fingerprints` but you do not need a
chemistry toolkit.
We'll use the pubchem_queries.fps as the queries for a k=2 nearest
neighor similarity search of the target file puchem_targets.gps::
simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps
That's all! You should get output which starts::
#Simsearch/1
#num_bits=881
#type=Tanimoto k=2 threshold=0.0
#software=chemfp/3.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_sources=Compound_027575001_027600000.sdf.gz
#target_sources=Compound_014550001_014575000.sdf.gz
2 27575190 14555201 0.7236 14566941 0.7105
2 27575192 14555203 0.7158 14555201 0.7114
2 27575198 14555201 0.7286 14569555 0.7259
2 27575208 14555201 0.7701 14566941 0.7584
Here's how to interpret the output. The lines starting with '#' are
header lines. It contains metadata information describing that this is
a similarity search report. You can see the search parameters, the
name of the tool which did the search, and the filenames which went
into the search.
After the '#' header lines come the search results, with one result
per line. There are in the same order as the query fingerprints. Each
result line contains tab-delimited columns. The first column is the
number of hits. The second column is the query identifier used. The
remaining columns contain the hit data, with alternating target id and
its score.
For example, the first result line contains the 2 hits for the
query 27575190. The first hit is the target id 14555201 with score
0.7236 and the second hit is 14566941 with score 0.7105. Since this is
a k-nearest neighor search, the hits are sorted by score, starting
with the highest score. Do be aware that ties are broken arbitrarily.
Threshold search
================
In this section you'll learn how to search a fingerprint file to find
all of the neighbors at or above a given threshold. You will need the
FPS fingerprint files generated in :ref:`pubchem_fingerprints` but you
do not need a chemistry toolkit.
Let's do a threshold search and find all hits which are at least 0.738
similar to the queries::
simsearch --threshold 0.738 -q pubchem_queries.fps pubchem_targets.fps
The first 14 lines of output from this are::
#Simsearch/1
#num_bits=881
#type=Tanimoto k=all threshold=0.738
#software=chemfp/3.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_sources=Compound_027575001_027600000.sdf.gz
#target_sources=Compound_014550001_014575000.sdf.gz
0 27575190
0 27575192
0 27575198
3 27575208 14566941 0.7584 14566938 0.7542 14555201 0.7701
3 27575221 14566941 0.7473 14566938 0.7432 14555201 0.7592
3 27575223 14566941 0.7473 14566938 0.7432 14555201 0.7592
Take a look at the last line, which contains the 3 hits for the query
id 27575223. As before, the hit information alternates between the
target ids and the target scores, but unlike the k-nearest search, the
threshold search hits are not in a particular order. You can see that
here with the scores 0.7473, 0.7432, 0.7592, which are in neither
increasing nor decreasing order.
Combined k-nearest and threshold search
=======================================
In this section you'll learn how to search a fingerprint file to find
the k-nearest neighbors, where all of the hits must be at or above
given threshold. You will need the fingerprint files generated in
:ref:`pubchem_fingerprints` but you do not need a chemistry toolkit.
You can combine the :option:`-k` and :option:`--threshold` queries to
find the k-nearest neighbors which are all above a given threshold::
simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_targets.fps
This find the nearest 3 structures, which all must be at least 0.7
similar to the query fingerprint. The output from the above starts::
#Simsearch/1
#num_bits=881
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/3.2
#queries=pubchem_queries.fps
#targets=pubchem_targets.fps
#query_sources=Compound_027575001_027600000.sdf.gz
#target_sources=Compound_014550001_014575000.sdf.gz
3 27575190 14555201 0.7236 14566941 0.7105 14566938 0.7068
2 27575192 14555203 0.7158 14555201 0.7114
3 27575198 14555201 0.7286 14569555 0.7259 14553070 0.7065
3 27575208 14555201 0.7701 14566941 0.7584 14566938 0.7542
3 27575221 14555201 0.7592 14566941 0.7473 14566938 0.7432
3 27575223 14555201 0.7592 14566941 0.7473 14566938 0.7432
2 27575240 14555201 0.7150 14566941 0.7016
2 27575250 14555203 0.7128 14555201 0.7085
3 27575257 14572463 0.7468 14563588 0.7250 14561245 0.7219
The output format is identical to the previous two search examples,
and because this is a k-nearest search, the hits are sorted from
highest score to lowest.
NxN (self-similar) searches
===========================
In this section you'll learn how to use the same fingerprints as both
the queries and targets, that is, a self-similarity search. You will
need the pubchem_queries.fps fingerprint file generated in
:ref:`pubchem_fingerprints` but you do not need a chemistry toolkit.
Use the :option:`--NxN` option if you want to use the same set of fingerprints
as both the queries and targets. Using the pubchem_queries.fps from
the previous sections::
simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps
This code is very fast because there are so few fingerprints. For
larger files the :option:`--NxN` will be about twice as fast and use
half as much memory compared to::
simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps
In addition, the :option:`--NxN` option excludes matching a
fingerprint to itself (the diagonal term).
.. _chebi_fingerprints:
Using a toolkit to process the ChEBI dataset
============================================
In this section you'll learn how to create a fingerprint file from a
structure file. The structure processing and fingerprint generation
are done with a third-party chemisty toolkit. chemfp supports Open
Babel, OpenEye, and RDKit. (OpenEye users please note that you will
need an OEGraphSim license to use the OpenEye-specific
fingerprinters.)
We'll work with data from `ChEBI `_,
which are "Chemical Entities of Biological Interest". They distribute
their structures in several formats, including as an SD file. For this
section, download the "lite" version from
ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_lite.sdf.gz . It
contains the same structure data as the complete version but many
fewer tag data fields. For ChEBI 155 this file contains 95,955 records
and the compressed file is 28MB.
Unlike the PubChem data set, the ChEBI data set does not contain
fingerprints so we'll need to generate them using a toolkit.
ChEBI record titles don't contain the id
----------------------------------------
Strangely, the ChEBI dataset does not use the title line of the SD
file to store the record id. A simple examination shows that 47,376 of
the title lines are empty, 39,615 have the title "null", 4,499 have
the title " ", 2,033 have the title "ChEBI", 45 of them are labeled
"Structure #1", and the others are usually compound names.
(I've asked ChEBI to fix this, to no success. Perhaps you have more
influence?)
Instead, the record id is stored as value of the "ChEBI ID" tag, which
looks like::
>
CHEBI:776
By default the toolkit-based fingerprint generation tools use the
title as the identifier, and print a warning and skip the record if
the identifier is missing. Here's an example with :ref:`rdkit2fps
`::
% rdkit2fps ChEBI_lite.sdf.gz
#FPS1
#num_bits=2048
#type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=ChEBI_lite.sdf.gz
#date=2017-09-14T21:17:44
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 1, record #1. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 201, record #5. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 236, record #6. Skipping.
[23:17:44] S group MUL ignored on line 103
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 264, record #7. Skipping.
[23:17:44] Unhandled CTAB feature: S group SRU on line: 31. Molecule skipped.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 435, record #9. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 519, record #10. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 581, record #11. Skipping.
031087be231150242e714400920000a193c1080c02858a1116a68100a58806342840405253004080c8cc3c4811
4101b25081a10c025e634c08a1c00088102c0400121040a2080505188a9c0a150000028211219c1001000981c4
804417180aca0401408500180182210716db1580708a0b8a0802820532854411200c1101040404001118600d0a
518402385dc00011290602205a070480c148f240421000c321801922c7808740cd0b10ea4c40000403dc180121
94d8d120020150b3d00043a24370000201042881d15018c0e0901442881d68604c4a83808110c772a824051948
003c801360600221040010e20418381668404b0424ec130f05a090c94960e0 ChEBI
00008000000000000000002880000000000000000200000004008000000000000000200040000002000c000000
000000000080080000000200400100000000000000001000000400001000000000000000800000000000000100
00000801002000000001000000400004c000000000000000800004000000001102000000200004000000100300
08000000000000000000000000000000000820000404000000800000400000200c000008040000000000000000
200101008000000000000000000202000002008000000000000002000000000008000400000000000000000100
40000100020080000001000300280000002002000000000000000000000000 ChEBI
....
That output contains only two fingerprint records, both with the id
"ChEBI". The other records had no title and were skipped, with a
message sent to stderr describing the problem and the location of the
record containing the problem.
(If the first 100 records have no identifiers then the command-line
tools will exit even if :option:`--errors` is ignore. This is a safety
mechanism. Let me know if it's a problem.)
Instead, use the :option:`--id-tag` option to specify of the name of
the data tag containing the id. For this data set you'll need to write
it as::
--id-tag "ChEBI ID"
The quotes are important because of the space in the tag name.
Here's what that looks like::
% rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz | fold | head -20
[23:26:39] S group MUL ignored on line 103
[23:26:39] Unhandled CTAB feature: S group SRU on line: 31. Molecule skipped.
#FPS1
#num_bits=2048
#type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
#software=RDKit/2017.09.1.dev1 chemfp/3.2
#source=ChEBI_lite.sdf.gz
#date=2017-09-14T21:26:39
10208220141258c184490038b4124609db0030024a0765883c62c9e1288a1dc224de62f445743b8b
30ad542718468104d521a214227b29ba3822fbf20e15491802a051532cd10d902c39b02b51648981
9c87eb41142811026d510a890a711cb02f2090ddacd990c5240cc282090640103d0a0a8b460184f5
11114e2a8060200804529804532313bb03912d5e2857a6028960189e370100052c63474748a1c000
8079f49c484ca04c0d0bcb2c64b72401042a1f82002b097e852830e5898302021a1203e412064814
a598741c014e9210bc30ab180f0162029d4c446aa01c34850071e4ff037a60e732fd85014344f82a
344aa98398654481b003a84f201f518f CHEBI:90
00000000080200412008000008000004000010100022008000400002000020100020006000800001
01000100080001000010000002002200000200000008000000400002100000000080000004401000
80200020800200002000001400022064000004244810000000000080000a80012002020004198002
00080200020020120040203001000802010100024211000004400000000100200003000001000100
0100021000a200601080002a00002020048004030000884084000008000002040200010800000000
2000010022000800002000020001400020800100025040000000200a080244000060008000000802
8100c801108000000041c00200800002 CHEBI:165
In addition to "ChEBI ID" there's also a "ChEBI Name" tag which
includes data values like "tropic acid" and
"(+)-guaia-6,9-diene". Every ChEBI record has a unique name so the
names could also be used as the primary identifier instead of its id.
The FPS fingerprint file format allows identifiers with a space, or
comma, or anything other tab, newline, and a couple of other bytes, so
it's no problem using those names directly.
To use the ChEBI Name as the primary chemfp identifier, specify::
--id-tag "ChEBI Name"
Generate fingerprints with Open Babel
-------------------------------------
If you have the Open Babel Python library installed then you can use
:ref:`ob2fps ` to generate fingerprints::
ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps
This takes just under 3 minutes on my 7 year old desktop to process
all of the records.
The default generates FP2 fingerprints, so the above is the same as::
ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps
ob2fps can generate several other types of fingerprints. (Use
:option:`--help` for a list.) For example, to generate the Open Babel
implementation of the MACCS definition specify::
ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
Generate fingerprints with OpenEye
----------------------------------
If you have the OEChem Python library installed, with licenses for
OEChem and OEGraphSim, then you can use :ref:`oe2fps ` to
generate fingerprints::
oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps
This takes about 40 seconds on my desktop and generates a number of
warnings like "Stereochemistry corrected on atom number 17 of",
"Unsupported Sgroup information ignored", and "Invalid stereochemistry
specified for atom number 9 of". Normally the record title comes after
the "... of", but the title is blank for most of the records.
OEChem could not parse 7 of the 95,955 records. I looked at the
failing records and noticed that all of them had 0 atoms and 0 bonds.
The default settings generate OEGraphSim path fingerprint with the
values::
numbits=4096 minbonds=0 maxbonds=5
atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
Each of these can be changed through command-line options. Use
:option:`--help` for details.
oe2fps can generate several other types of fingerprints. For example,
to generate the OpenEye implementation of the MACCS definition specify::
oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
Use :option:`--help` for a list of available oe2fps fingerprints or to
see more configuration details.
Generate fingerprints with RDKit
--------------------------------
If you have the RDKit Python library installed then you can use
:ref:`rdkit2fps ` to generate fingerprints. Based on the
previous examples you probably guessed that the command-line is::
rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps
This takes 5.5 minutes on my desktop, and RDKit did not generate
fingerprints for 1,101 of the 95,955 records. RDKit logs warning and
error messages to stderr. They look like::
[23:29:49] Explicit valence for atom # 6 N, 4, is greater than permitted
[23:29:49]
****
Post-condition Violation
Element '.' not found
Violation occurred on line 90 in file /Users/dalke/cvses/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****
[23:29:49] Unhandled CTAB feature: S group SRU on line: 52. Molecule skipped.
For example, RDKit is careful to check that structures make chemical
sense. It rejects 4-valent nitrogen and refuses to process that those
structures, which is the reason for the first line of that output.
The default generates RDKit's path fingerprints with parameters::
minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
Each of those can be changed through command-line options. See rdkit2fps
:option:`--help` for details, where you'll also see a list of the
other available fingerprint types.
For example, to generate the RDKit implementation of the MACCS
definition use::
rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps
while the following generates the Morgan/circular fingerprint with
radius 3::
rdkit2fps --morgan --radius 3 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz
Alternate error handlers
========================
In this section you'll learn how to change the error handler for
rdkit2fps using the :option:`--errors` option.
By default the "2fps" programs "ignore" structures which
could not be parsed into a molecule option. There are two other
options. They can "report" more information about the failure case and
keep on processing, or they can be "strict" and exit after reporting
the error.
This is configured with the :option:`--errors` option.
Here's the rdkit2fps output using :option:`--errors report`::
[00:52:39] S group MUL ignored on line 103
[00:52:39] Unhandled CTAB feature: S group SRU on line: 36. Molecule skipped.
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 12036, record #179. Skipping.
[00:52:39] Explicit valence for atom # 12 N, 4, is greater than permitted
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 16213, record #265. Skipping.
The first two lines come from RDKit. The third line is from chemfp,
reporting which record could not be parsed. (The record starts at line
12036 of the file and the SRU is on line 36 of the record, so the SRU
is at line 12072.) The fourth line is another RDKit error message, and
the last line is another chemfp error message.
Here's the rdkit2fps output using :option:`--errors strict`::
[00:54:30] S group MUL ignored on line 103
[00:54:30] Unhandled CTAB feature: S group SRU on line: 36. Molecule skipped.
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 12036, record #179. Exiting.
Because this is strict mode, processing exits at the first failure.
The ob2fps and oe2fps tools implement the :option:`--errors` option,
but they aren't as useful as rdkit2fps because the underlying APIs
don't give useful feedback to chemfp about which records failed. For
example, the standard OEChem file reader automatically skips records
that it cannot parse. Chemfp can't report anything when it doesn't
know there was a failure.
The default error handler in chemfp 1.1 was "strict". In practice this
proved more annoying than useful because most people want to skip the
records which could not be processed. They would then contact me
asking what was wrong, or doing some pre-processing to remove the
failure cases.
One of the few times when it is useful is for records which contain no
identifier. When I changed the default from "strict" to "ignore" and
tried to process ChEBI, I was confused at first about why the output
file was so small. Then I realized that it's because the many records
without a title were skipped, and there was no feedback about skipping
those records.
I changed the code so missing identifiers are always reported, even if
the error setting is "ignore". Missing identifiers will still stop
processing if the error setting is "strict".
chemfp's two cross-toolkit substructure fingerprints
====================================================
In this section you'll learn how to generate the two
substructure-based fingerprints which come as part of chemfp. These
are based on cross-toolkit SMARTS pattern definitions and can be used
with Open Babel, OpenEye, and RDKit. (For OpenEye users, these
fingerprints use the base OEChem library but do not use the separately
licensed OEGraphSim library.)
chemfp implements two platform-independent fingerprints where were
originally designed for substructure filters but which are also used
for similarity searches. One is based on the 166-bit MACCS
implementation in RDKit and the other comes from the 881-bit
PubChem/CACTVS substructure fingerprints.
The chemfp MACCS definition is called "rdmaccs" because it closely
derives from the MACCS SMARTS patterns used in RDKit. (These pattern
definitions are also used in Open Babel and the CDK, while OpenEye
has a completely independent implementation.)
Here are example of the respective rdmaccs fingerprint for phenol
using each of the toolkits.
Open Babel::
% echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs
#FPS1
#num_bits=166
#type=RDMACCS-OpenBabel/2
#software=OpenBabel/2.4.1 chemfp/3.2
#date=2017-09-09T00:40:48
00000000000000000000000000000140004480101e phenol
OpenEye::
#FPS1
#num_bits=166
#type=RDMACCS-OpenEye/2
#software=OEChem/2.1.3 (20170828) chemfp/3.2
#date=2017-09-09T00:41:21
00000000000000000000000000000140004480101e phenol
RDKit::
#FPS1
#num_bits=166
#type=RDMACCS-RDKit/2
#software=RDKit/2017.09.1 chemfp/3.2
#date=2017-09-09T00:42:32
00000000000000000000000000000140004480101e phenol
For more complex molecules it's possible that different toolkits
produce different fingerprint rdmaccs, even though the toolkits use
the same SMARTS definitions. Each toolkit has a different
understanding of chemistry. The most notable is the different
definition of aromaticity, so the bit for "two or more aromatic rings"
will be toolkit dependent.
substruct fingerprints
----------------------
chemp also includes a "substruct" substructure fingerprint. This is an
881 bit fingerprint derived from the PubChem/CACTVS substructure
keys. They do not match the CACTVS fingerprints exactly, in part due
to differences in ring perception. Some of the substruct bits will
always be 0. With that caution in mind, if you want to try them out,
use the :option:`--substruct` option.
The term "substruct" is a horribly generic name. If you can think of a
better one then let me know. Until chemfp 3.0 I said these
fingerprints were "experimental", in that I hadn't fully validated
them against PubChem/CACTVS and could not tell you the error rate. I
still haven't done that.
What's changed is that I've found out over the years that people are
using the substruct fingerprints, even without full validatation. That
surprised me, but use is its own form of validation. I still would
like to validate the fingerprints, but it's slow, tedious work which I
am not really interested in doing. Nor does it earn me any
money. Plus, if the validation does lead to any changes, it's easy to
simply change the version number.
.. _pubchem_fpb_fingerprints:
Generate binary FPB files from a structure file
===============================================
In this section you'll learn how to generate an FPB file instead of an
FPS file. You will need the the ChEBI file from
:ref:`chebi_fingerprints` and a chemistry toolkit. The FPB format was
introduced with chemfp-2.0.
The FPB format was designed so the fingerprints can be memory-mapped
directly to chemfp's internal data structures. This makes it very fast
to load, but unlike the FPS format, it's not so easy to write with
your own code. You should think of the FPB format as an binary
application format, for chemfp-based tools, while the FPS format is a
text-based format for data exchange between diverse programs.
The easiest way to generate an FPB file from the command line is to
use the ".fpb" extension instead of ".fps" or ".fps.gz". Here are
examples using each of the toolkits.
Open Babel::
% ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fpb
OpenEye::
% oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fpb
RDKit::
% rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fpb
The binary format isn't human-readable. Use :ref:`fpcat` to see what's
inside::
% fpcat oe_chebi.fpb | head -8
#FPS1
#num_bits=4096
#type=OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
#software=OEGraphSim/2.3.1 (20170828) chemfp/3.2
000 ... many zeros ... 000 CHEBI:15378
000 ... many zeros ... 000 CHEBI:16042
000 ... many zeros ... 000 CHEBI:17792
000 ... many zeros ... 000 CHEBI:18140
....
182 ... hex values ... c0c CHEBI:60493
By default the fingerprints are ordered from smallest popcount to
largest, which you can see in the output. A pre-ordered index is
faster to search because the target popcounts are pre-computed and
because it enables sublinear search.
If you want to preserve the input order then you'll need to pipe the
FPS output to :ref:`fpcat ` and use its :option:`--preserve-order`
flag. See the next section for an example.
Convert between FPS and FPB formats
===================================
In this section you'll learn how to convert an FPS file into an FPB
file and back, and you'll learn how to control the fingerprint
ordering. You will need the FPS files generated in
:ref:`pubchem_fingerprints` but you do not need a chemistry
toolkit. The FPB format was introduced with chemfp-2.0.
If you already have an FPS file then you can convert it directly into
an FPB file, and without using a chemistry toolkit. The :ref:`fpcat
` program converts from one format to the other.
In an earlier section I generated the files pubchem_queries.fps and
pubchem_targets.fps . I'll convert each to FPB format::
% fpcat pubchem_targets.fps -o pubchem_targets.fpb
% fpcat pubchem_queries.fps -o pubchem_queries.fpb
The FPB format is a binary format which is difficult to read
directly. The easiest way to see what's inside is to use fpcat. If you
don't specify an output filename then it sends the results to stdout
in FPS format::
% fpcat pubchem_queries.fpb | head -5
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
071e0c000000000000000000000000000080c16030000c0600000000000000000000005800000000
00f02001010040000000000010000108000000000000000000008000000000004800000000000000
0000000080901103f101000000000000000100200000040080000010000000 27581954
The keen-eyed reader might have noticed that the conversion does not
have a "source" or "date" field. I haven't figured out if this is a
bug. Should I keep the original date and structure file source, or use
the current date and FPS file source? Let me know if this is important
to you.
By default fpcat reorders the fingerprints in the FPB file by
population count. This improves the similarity search performance, but
it means that the order of the FPB file is likely different than the
original FPS format. You can get a sense of this by looking at the
first fingerprint in the original pubchem_queries.fps file::
% grep -v # pubchem_queries.fps | head -1
075e1c000208000000000000000000000000000000000c06000000000000008002000078200a0000
803510a51b404d93410320501140a44b1a4e430000a4502810119802361750644c07adb9e18c1026
2b801fd7e91913047100000402002001000000020100900000000000000000 27575190
If you want the FPB file to store the fingerprints in input order
instead of the popcount order needed for optimized similarity search,
then use the :option:`--preserve-order` flag::
% fpcat pubchem_queries.fps --preserve-order -o input_order.fpb
% fpcat input_order.fpb | grep -v # | head -1
075e1c000208000000000000000000000000000000000c06000000000000008002000078200a0000
803510a51b404d93410320501140a44b1a4e430000a4502810119802361750644c07adb9e18c1026
2b801fd7e91913047100000402002001000000020100900000000000000000 27575190
On the flip side, fpcat by default preserves the input order when it
creates FPS output. If you instead want to the output FPS file to be
in popcount order then use the :option:`--reorder` flag::
% fpcat --reorder pubchem_queries.fps | grep -v # | head -1
071e0c000000000000000000000000000080c16030000c0600000000000000000000005800000000
00f02001010040000000000010000108000000000000000000008000000000004800000000000000
0000000080901103f101000000000000000100200000040080000010000000 27581954
Specify the fpcat output format
===============================
In this section you'll learn how to specify the output format for
fpcat using a command-line option instead of the filename
extension. You will need the pubchem_queries.fpb file from
:ref:`pubchem_fingerprints`.
If you do not specify an output filename then fpcat will output the
fingerprints in FPS format to stdout. If you specify a filename then
by default it will look at the extension to determine if the output
should be an FPB (".fpb"), FPS (".fps"), or gzip compressed FPS
(".fps.gz") file. The FPS format is used for unrecognized extensions.
In a few rare cases you may want to use a format which doesn't match
the default. To be honest, the examples I can think of aren't that
realistic, but let's suppose you want to output the contents of an FPB
file to stdout in gzip'ed FPS format, and count the number of bytes in
compressed output. I'll use the use the --out flag to change the
format to 'fps.gz' from the default of 'fps', then compare the
resulting size with the uncompressed form::
% fpcat pubchem_queries.fpb --out fps.gz | wc -c
11930
% fpcat pubchem_queries.fpb --out fps | wc -c
89170
It's not that useful because you could pipe the uncompressed output to
gzip, which is also likely faster::
% fpcat pubchem_queries.fpb --out fps | gzip -c -9 | wc -c
11921
By the way, it is not possible to write an FPB file to stdout. In
fact, the output file must be seek-able, which means it can't be a
named pipe either.
Alternate fingerprint file formats
==================================
In this section you'll learn about chemfp's support for other
fingerprint file formats.
Chemfp started as a way to promote the FPS file format for fingerprint
exchange. Chemfp 2.0 added the FPB format, which is a binary format
designed around chemfp's internal search data structure so it can be
loaded quickly.
There are many other fingerprint formats. Perhaps the best
known is the Open Babel `FastSearch
`_ format. Two others are Dave
Cosgrove's `flush `_ format,
and OpenEye's "fpbin" format.
The `chemfp_converters package
`_ contains utilities
to convert between the chemfp formats and these other formats.::
# Convert from/to Dave Cosgrove Flush format
flush2fps drugs.flush
fps2flush drugs.fps -o drugs.flush
# Convert from/to OpenEye's fpbin format
fpbin2fps drugs.fpbin --moldb drugs.sdf
fps2fpbin drugs_openeye_path.fps --moldb drugs.sdf -o drugs.fpbin
# Convert from/to Open Babel's FastSearch format
fs2fps drugs.fs --datafile drugs.sdf
fps2fs drugs_openbabel_FP2.fps --datafile drugs.sdf -o drugs.fs
Of the three formats, the flush format is closest to the FPS data
model. That is, it stores fingerprint records as an identifier and the
fingerprint bytes. By comparison, the FastSearch and fpbin formats
store the fingerprint bytes and an index into another file containing
the structure and identifier. It's impossible for chemfp to get the
data it needs without reading both files.
Chemfp has special support for the flush format. If chemfp_converters
is installed, chemfp will use it to read and write flush files nearly
everywhere that it accepts FPS files. You can use it at the output to
oe2fps, rdkit2fps, and ob2fps, and as the input queries to simsearch,
and as both input and output to fpcat. (You cannot use it as the
simsearch targets because that code has been optimized for FPS and FPB
search, and I haven't spent the time to optimize flush file support.)
This means that if chemfp_converters is installed then you can use
:ref:`fpcat ` to convert between FPS, FPB, and and flush file
formats. For examples::
fpcat drugs.flush -o drugs.fps
fpcat drugs.fps -o drugs.flush
In addition, you can use it at the API level in :func:`chemfp.open`,
:func:`chemfp.load_fingerprints`,
:func:`chemfp.open_fingerprint_writer`, and
:meth:`.FingerprintArena.save`.
Note that the flush format does not support the FPS metadata fields,
like the fingerprint type, and it only support fingerprints which are
a multiple of 32 bits long.
Similarity search with the FPB format
=====================================
In this section you'll learn how to do a similarity search using an
FPB as the target. You will need the FPB files from
:ref:`pubchem_fingerprints` but you do not need a chemistry toolkit.
:ref:`Simsearch `, like all of the tools starting with
chemfp-2.0, understands both FPS and FPB files::
% simsearch -k 3 --threshold 0.7 -q pubchem_queries.fpb pubchem_targets.fpb | head
#Simsearch/1
#num_bits=881
#type=Tanimoto k=3 threshold=0.7
#software=chemfp/3.2
#queries=pubchem_queries.fpb
#targets=pubchem_targets.fpb
3 27581954 14565747 0.7833 14563541 0.7333 14573233 0.7258
3 27581957 14565747 0.7833 14563541 0.7333 14573233 0.7258
3 27580389 14568366 0.8468 14568369 0.8393 14560737 0.8374
2 27584917 14563095 0.7795 14563096 0.7795
By default simsearch uses the query and target filename extensions to
figure out if the file is in FPS or FPB format.
If you don't want it to auto-detect the format then use the
:option:`--query-format` and :option:`--target-format` options to tell
it the format to use. The values can be one of "fps", "fps.gz" and
"fpb".
Converting large data sets to FPB format
========================================
In this section you'll learn how to generate an FPB file on computers
with relatively limited memory. To be realistic, this example uses the
complete `PubChem data set `_,
and extracts the CACTVS/PubChem fingerprints which are in each
record. You do not need a chemistry toolkit for this section.
The most direct way to extract the PubChem fingerprints from a PubChem
distribution is to use :ref:`sdf2fps `::
sdf2fps --pubchem pubchem/Compound_*.sdf.gz -o pubchem.fpb
This uses the default FPB writer options, which stores all of the
fingerprints in memory, sorts them, and saves the result to the output
file. This may use about 2-3 times as much memory as the final FPB
output size, which is a bit unfortunate if you want to generate a 7 GB
FPB file on a 12 GB machine.
(Note: see :ref:`the next section `
for a two-stage solution that lets you parallelize fingerprint
generation.)
The "\*2fps" command-line tools do not have a way to change the
default writer options, although :ref:`fpcat ` does. The
:option:`--max-spool-size` option sets a rough upper bound to the
amount of memory to use. When enabled, the writer breaks the input
into parts and creates a temporary FPB file for each part. At the end,
it merges the sorted data from the temporary FPB files to get the
final FPB file. Note that the specified spool size is only approximate
and is not a hard limit on the maximum amount of memory to use. You
may need to experiment a bit if you have tight constraints.
The value must be a size in bytes, though suffixes like M or MB for
megabyte and T or TB for terabyte are also allowed. These are in
base-10 units, so 1 MB = 1,000,000 bytes. Spaces are not allowed
between the number and the suffix, so "200MB" is okay but "200 MB" is
not. The size must be at least 20 MB.
Here is an example of how to convert the CACTVS fingerprints from all
of PubChem to an FPB file, using a relatively small limit of 200 MB::
sdf2fps --pubchem pubchem/Compound_*.sdf.gz | fpcat --max-spool-size 200MB -o pubchem.fpb
This will take a while! The sdf2fps alone takes almost 45 minutes on
my desktop, of which 50% of the time was to decompress the files.
The temporary files will be placed under the appropriate temporary
directory for your operating system. If that disk isn't large enough
for the intermediate files then use the :option:`--tmpdir` option of
fpcat to specify an alternate directory::
fpcat --max-spool-size 1GB pubchem.fps -o pubchem.fpb
Another option is to specify the directory location using the TMPDIR,
TEMP, or TMP environment variables, which are resolved in that
order. The details are described in the Python documentation for
`tempfile.tempdir `_.
.. _generate_fingerprints_in_parallel:
Generate fingerprints in parallel and merge to FPB format
=========================================================
In this section you'll learn how to merge multiple sorted fingerprints
into a single FPB file.
The previous section used a single shell command to extract the
PubChem/CACTVS fingerprints from PubChem and generate an FPB
file. This is easy to write and understand, but more complex versions
may be more appropriate.
For one, I have four cores on my desktop computer, and I want to use
them to process the PubChem files in parallel. The previous section
was only single threaded.
I have all my PubChem files in ``~/pubchem/``. For each
"Compound_*.sdf.gz" file in that directory I want to extract the
CACTVS/PubChem fingerprints and create an intermediate FPS file in the
local directory. That's equivalent to running the following commands::
sdf2fps --pubchem ~/pubchem/Compound_000000001_000025000.sdf.gz \\
-o Compound_000000001_000025000.fps
sdf2fps --pubchem ~/pubchem/Compound_000025001_000050000.sdf.gz \\
-o Compound_000025001_000050000.fps
... 2146 more lines ...
except that I want to run four at a time.
This is what `GNU Parallel `_
was designed for. It's a command-line tool which can parallelize the
exection of other command-lines.
I'll start by explaining the core command-line substitution pattern::
sdf2fps --pubchem {} -o {/..}.fps'
The ``{}`` will be replaced with a filename, and ``{/..}`` will be
replaced with the base filename, without the directory path prefix or
the two suffixes. That is, when ``{}`` is
"/Users/dalke/pubchem/Compound_000000001_000025000.sdf.gz" then
``{/..}`` will be "Compound_000000001_000025000".
Since I want to generate an FPS file, I added the ".fps" as a suffix
to the second substitution parameter.
I then tell GNU parallel which command-line to use, along with a few
other parameters. Here's the full line, which I split over two lines
to make it more readable::
parallel --plus --no-notice --bar 'sdf2fps --pubchem {}
-o {/..}.fps' ::: ~/pubchem/Compound_*.sdf.gz
The :option:`--plus` tells GNU parallel to recognize an expanded set
of replacement strings. ("{/..}" is not part of the standard set of
patterns.)
The :option:`--no-notice` tells it to not display the message about
citing GNU parallel in scientific papers.
The :option:`--bar` enables a progress bar, which looks like this::
26% 763:2148=1696s /Users/dalke/pubchem/Compound_019150001_019175000.sdf.gz
It's 26% through processing the filenames, which is file 763 out of
2148, and there's an estimated 1696 seconds remaining.
Finally, the ":::" indicates that the remaining options are the list
of parameters to pass to the command-line template for
parallelization.
After about 30 minutes, I now have a large number of FPS files, which
I want to merge into a single FPB file. I'll use :ref:`fpcat `::
fpcat --max-spool-size 2GB Compound*.fps -o pubchem.fpb
This took about 15 minutes. (It's a bit odd that the overall
performance wasn't that much better than the single-threaded code. It
would probably be more clear with compute-intenstive fingerprints,
instead of simple text extraction from an SD tag.)
Note: I started this section as an example of when to use the
:option:`--merge` option to fpcat. When the fingerprints are in
popcount order then popcount sorted output is a merge sort of the
inputs. This doesn't need RAM or temporary disk space for an
intermediate sort. My thought was to save the intermediate
fingerprints in FPB format instead of FPS, which has a side-effect of
sorting the fingerprints. Then I could simply merge the results.
I did this, and ran into two problems. There are 2912 files, and fpcat
will open all of them in order to do the parallel merge. I ran out of
file descriptors, and had to increase the limit to 6000 (3000 is too
small) before it would work. In the future I'll have to implement some
sort of multi-layer merge for when there are too many files. However,
even with 6000 available descriptors, iterating over the FPB-backed
FingerprintArena proved to be rather slow, and I'm not yet sure way. I
think it's simply that I didn't design that code for fast iteration.
Take home message? Use FPB files for now only as the last file format
in your pipeline.