Working with the command-line tools

The sections in this chapter describe examples of using the command-line tools to generate fingerprint files and to do similarity searches of those files.

Generating fingerprint files from PubChem SD files

In this section you’ll learn how to create a fingerprint file from an SD file which contains pre-computed CACTVS fingerprints. You do not need a chemistry toolkit for this section.

PubChem is a great resource of publically available chemistry information. The data is available for ftp download. We’ll use some of their SD formatted files. Each record has a PubChem/CACTVS fingerprint field, which we’ll used.

Start by downloading the files Compound_027575001_027600000.sdf.gz (from ) and Compound_014550001_014575000.sdf.gz (from ). At the time of writing they contain 213 and 5208 records, respectively. (I chose smaller than average files so they would be easier to open and review.)

Next, convert the files into fingerprint files. On the command line do the following two commands:

sdf2fps --pubchem Compound_027575001_027600000.sdf.gz -o pubchem_queries.fps
sdf2fps --pubchem Compound_014550001_014575000.sdf.gz -o pubchem_targets.fps

Congratulations, that was it!

How does this work? Each PubChem record contains the precomputed CACTVS substructure keys in the PUBCHEM_CACTVS_SUBSKEYS tag. The --pubchem flag tells sdf2fps to get the value of that tag and decode it to get the fingerprint. It also adds a few metadata fields to the fingerprint file header.

The order of the fingerprints are the same as the order of the corresponding record in the SDF, although unconvertable records might be skipped, depending on the --errors flag.

If you store records in an SD file then you almost certainly don’t use the same fingerprint encoding as PubChem. sdf2ps can decode from a number of encodings. Use --help to see the list of available decoders.

NxN (self-similar) searches

Use the –NxN option if you want to use the same fingerprints as both the queries and targets:

simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps

This is about twice as fast and uses half as much memory compared to:

simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps

Plus, the –NxN option excludes matching a fingerprint to itself (the diagonal term).

Using a toolkit to process the ChEBI dataset

In this section you’ll learn how to create a fingerprint file from a structure file. The structure processing and fingerprint generation are done with a third-party chemisty toolkit. chemfp supports Open Babel, OpenEye, and RDKit. (OpenEye users please note that you will need an OEGraphSim license to use the OpenEye-specific fingerprinters.)

We’ll work with data from ChEBI which contains “Chemical Entities of Biological Interest”. They distribute their structures in several formats, including as an SD file. For this section, download the “lite” version from . It contains the same structure data as the complete version but many fewer tag data fields. For ChEBI 155 this file contains 95,955 records and the compressed file is 28MB.

Unlike the PubChem data set, the ChEBI data set does not contain fingerprints so we’ll need to generate them using a toolkit.

ChEBI record titles don’t contain the id

Strangely, the ChEBI dataset does not use the title line of the SD file to store the record id. A simple examination shows that 47,376 of the title lines are empty, 39,615 have the title “null”, 4,499 have the title ” “, 2,033 have the title “ChEBI”, 45 of them are labeled “Structure #1”, and the others are usually compound names.

(I’ve asked ChEBI to fix this, to no success. Perhaps you have more influence?)

Instead, the id is stored as the value of the “ChEBI ID” tag, which in the SD file looks like:

> <ChEBI ID>

By default the toolkit-based fingerprint generation tools use the title as the identifier, and print a warning and skip the record if the identifier is missing. Here’s an example with rdkit2fps:

ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 1, record #1. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 201, record #5. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 236, record #6. Skipping.
[22:53:43]  S group MUL ignored on line 103
     ... skipping many lines ...
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 22392, record #343. Skipping.
#type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
#software=RDKit/2018.03.1.dev1 chemfp/1.4
003c801360600221040010e20418381668404b0424ec130f05a090c94960e0      ChEBI
40000100020080000001000300280000002002000000000000000000000000    ChEBI
410291002200024002a1100b5038410206a0000900404400001150000a020a null
    ... and more ...

That output I showed contains only three fingerprint records, the first two with the id “ChEBI” and the last with the id of ‘null’. The earlier records had no title or the title was a space character, so they were skipped, with a message sent to stderr describing the problem and the location of the record containing the problem.

(If the first 100 records have no identifiers then the command-line tools will exit even if --errors is ignore. This is a safety mechanism. Let me know if it’s a problem.)

Instead, use the --id-tag option to specify of the name of the data tag containing the id. For this data set you’ll need to write it as:

--id-tag "ChEBI ID"

The quotes are important because of the space in the tag name.

Here’s what the first few lines of that output looks like:

[22:58:35]  S group MUL ignored on line 103
[22:58:35]  Unhandled CTAB feature: S group SRU on line: 31. Molecule skipped.
#type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
#software=RDKit/2018.03.1.dev1 chemfp/1.4
344aa98398654481b003a84f201f518f    CHEBI:90
8100c801108000000041c00200800002    CHEBI:165

In addition to “ChEBI ID” there’s also a “ChEBI Name” tag which includes data values like “tropic acid” and “(+)-guaia-6,9-diene”. Every ChEBI record has a unique name so the names could also be used as the primary identifier.

The FPS fingerprint file format allows identifiers with a space, or comma, or anything other tab, newline, and a couple of other special bytes, so it’s no problem using those names directly.

To use the ChEBI Name as the primary chemfp identifier, specify:

--id-tag "ChEBI Name"

Generating fingerprints with Open Babel

If you have the Open Babel Python library installed then you can use ob2fps to generate fingerprints:

ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

This takes just under 3 minutes on my 7 year old desktop to process all of the records.

The default uses the FP2 fingerprints, so the above is the same as:

ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

ob2fps can generate several other types of fingerprints. (Use --help for a list.) For example, to generate the Open Babel implementation of the MACCS definition use:

ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

Generating fingerprints with OpenEye

If you have the OEChem Python library installed, with licenses for OEChem and OEGraphSim, then you can use oe2fps to generate fingerprints:

oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps

This takes about 40 seconds on my desktop and generates a number of warnings like “Stereochemistry corrected on atom number 17 of”, “Unsupported Sgroup information ignored”, and “Invalid stereochemistry specified for atom number 9 of”. Normally the record title comes after the “… of”, but the title is blank for most of the records.

OEChem could not parse 7 of the 95,955 records. I looked at the failing records and noticed that all of them had 0 atoms and 0 bonds.

The default settings produce OEGraphSim path fingerprint with the values:

numbits=4096 minbonds=0 maxbonds=5
   atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral

Each of these can be changed through command-line options.

oe2fps can generate several other types of fingerprints. For example, to generate the OpenEye implementation of the MACCS definition specify:

oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

Use --help for a list of available oe2fps fingerprints or to see more configuration details.

Generating fingerprints with RDKit

If you have the RDKit Python library installed then you can use rdkit2fps to generate fingerprints. Based on the previous examples you probably guessed that the command-line is:

rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps

This takes just under 6 minutes on my desktop, and RDKit did not generate fingerprints for 1,101 of the 95,955 records.

You can see some of the RDKit error messages in the output, like:

[00:47:02] Explicit valence for atom # 12 N, 4, is greater than permitted
[00:47:02]  S group DAT ignored on line 102

These come from RDKit’s error log. RDKit is careful to check that structures make chemical sense, and in this case it didn’t like the 4-valent nitrogen. It refuses to process this molecule.

The default generates RDKit’s path fingerprints with parameters:

minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1

(NOTE! In chemfp 1.1 the default nBitsPerHash was 4. The RDKit default nBitsPerHash is 2.)

Each of those can be changed through command-line options. See rdkit2fps --help for details, where you’ll also see a list of the other available fingerprint types.

For example, to generate the RDKit implementation of the MACCS definition use:

rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

while the following generates the Morgan/circular fingerprint with radius 3:

rdkit2fps --morgan --radius 3 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz

Alternate error handlers

In this section you’ll learn how to change the error handler for rdkit2fps using the --errors option.

By default the “<toolkit>2fps” programs “ignore” structures which could not be parsed into a molecule option. There are two other options. They can “report” more information about the failure case and keep on processing, or they can be “strict” and exit after reporting the error.

This is configured with the --errors option.

Here’s the rdkit2fps output using --errors report:

[00:52:39]  S group MUL ignored on line 103
[00:52:39]  Unhandled CTAB feature: S group SRU on line: 36. Molecule skipped.
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 12036, record #179. Skipping.
[00:52:39] Explicit valence for atom # 12 N, 4, is greater than permitted
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 16213, record #265. Skipping.

The first two lines come from RDKit. The third line is from chemfp, reporting which record could not be parsed. (The record starts at line 12036 of the file and the SRU is on line 36 of the record, so the SRU is at line 12072.) The fourth line is another RDKit error message, and the last line is another chemfp error message.

Here’s the rdkit2fps output using --errors strict:

[00:54:30]  S group MUL ignored on line 103
[00:54:30]  Unhandled CTAB feature: S group SRU on line: 36. Molecule skipped.
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 12036, record #179. Exiting.

Because this is strict mode, processing exits at the first failure.

The ob2fps and oe2fps tools implement the --errors option, but they aren’t as useful as rdkit2fps because the underlying APIs don’t give useful feedback to chemfp about which records failed. For example, the standard OEChem file reader automatically skips records that it cannot parse. Chemfp can’t report anything when it doesn’t know there was a failure.

The default error handler in chemfp 1.1 was “strict”. In practice this proved more annoying than useful because most people want to skip the records which could not be processed. They would then contact me asking what was wrong, or doing some pre-processing to remove the failure cases.

One of the few times when it is useful is for records which contain no identifier. When I changed the default from “strict” to “ignore” and tried to process ChEBI, I was confused at first about why the output file was so small. Then I realized that it’s because the many records without a title were skipped, and there was no feedback about skipping those records.

I changed the code so missing identifiers are always reported, even if the error setting is “ignore”. Missing identifiers will still stop processing if the error setting is “strict”.

Alternate fingerprint file formats

In this section you’ll learn about chemfp’s support for other fingerprint file formats.

Chemfp started as a way to promote the FPS file format for fingerprint exchange. Chemfp 2.0 added the FPB format, which is a binary format designed around chemfp’s internal search data structure so it can be loaded quickly. (For FPB support you will need to get a copy of the commercial version of chemfp.)

There are many other fingerprint formats. Perhaps the best known is the Open Babel FastSearch format. Two others are Dave Cosgrove’s flush format, and OpenEye’s “fpbin” format.

The chemfp_converters package contains utilities to convert between the chemfp formats and these other formats.:

# Convert from/to Dave Cosgrove Flush format
flush2fps drugs.flush
fps2flush drugs.fps -o drugs.flush

# Convert from/to OpenEye's fpbin format
fpbin2fps drugs.fpbin --moldb drugs.sdf
fps2fpbin drugs_openeye_path.fps --moldb drugs.sdf -o drugs.fpbin

# Convert from/to Open Babel's FastSearch format
fs2fps drugs.fs --datafile drugs.sdf
fps2fs drugs_openbabel_FP2.fps  --datafile drugs.sdf  -o drugs.fs

Of the three formats, the flush format is closest to the FPS data model. That is, it stores fingerprint records as an identifier and the fingerprint bytes. By comparison, the FastSearch and fpbin formats store the fingerprint bytes and an index into another file containing the structure and identifier. It’s impossible for chemfp to get the data it needs without reading both files.

Chemfp has special support for the flush format. If chemfp_converters is installed, chemfp will use it to read and write flush files nearly everywhere that it accepts FPS files. You can use it at the output to oe2fps, rdkit2fps, and ob2fps, and as the input queries to simsearch. (You cannot use it as the simsearch targets because that code has been optimized for FPS and FPB search, and I haven’t spent the time to optimize flush file support.)

This means that if chemfp_converters is installed then you can use fpcat (see also the next section) to convert between FPS and flush file formats.

In addition, you can use it at the API level in, chemfp.load_fingerprints(), chemfp.open_fingerprint_writer(), and

Note that the flush format does not support the FPS metadata fields, like the fingerprint type, and it only support fingerprints which are a multiple of 32 bits long.

Convert formats with fpcat

In this section you’ll learn how to use the command-line tool fpcat to convert between fingerprint file formats.

Chemfp 1.4 included a backport of fpcat from the commercial version of chemfp. In the commerical version, the fpcat program is often used to convert from the text-based FPS files into the binary FPB format, and vice versa.

The no-cost version of chemfp does not include the FPB format, but it does include support for Dave Cosgrove’s flush file format (see also the previous section). The fpcat program can be used to convert flush files to FPS format and vice-versa:

fpcat drugs.flush -o drugs.fps
fpcat drugs.fps -o drugs.flush

For more control over the conversion, use flush2fps and fps2flush respectively, from the chemfp_converters package.

Merge multiple fingerprint files with fpcat

In this section you’ll learn how to merge multiple fingerprint files into one using the command-line tool fpcat, and how to get slightly faster FPS arena load times by reordering the fingerprints.

The previous section showed how use fpcat to convert from one fingerprint format to another.

You can also use the fpcat program to merge multiple fingerprint files. It’s based on the general idea of the Unix ‘cat’ program. In the following example, I’ll give it three filenames, and have it save the concatenated fingerprints to an fps.gz file:

fpcat filename1.fps filename2.fps filename3.fps -o output.fps.gz

Note: fpcat uses the metadata from the first file to generate the metadata for the output. The output metadata does not currently include the ‘sources’ metadata lines because that would require opening all of the files first to get that information, then closing the files, and reopening them to get the fingerprint data. A future version of chemfp may support this option, and/or some way to specify the source line(s) directly.

For example, if you generate fingerprints for a lot of structures, you might split them up into multiple files, process them in parallel, and use fpcat to merge the results into a single file.

More concretely, I used RDKit to convert the ChEMBL 23 SD file into a SMILES file, which I want to process to get the MACCS fingerprints. I’ll break it up into three parts, so lines 1, 4, 7, etc. go into one file, lines 2, 5, 8, etc. go into another, and lines 3, 6, 9, etc. go into a third:

% awk 'NR % 3 == 0' chembl_23.rdkit.smi > subset0.smi
% awk 'NR % 3 == 1' chembl_23.rdkit.smi > subset1.smi
% awk 'NR % 3 == 2' chembl_23.rdkit.smi > subset2.smi

I’ll have rdkit2fps process each subset independently in the background (my laptop has more than 3 cores, so each job will get its own core):

% rdkit2fps --maccs166 subset0.smi -o subset0.fps &
[1] 13935
% rdkit2fps --maccs166 subset1.smi -o subset1.fps &
[2] 13943
% rdkit2fps --maccs166 subset2.smi -o subset2.fps &
[3] 13952

You may want to use something like GNU parallel for a more automated solution.

Once those are done, I’ll merge them using fpcat:

% fpcat subset0.fps subset1.fps subset2.fps -o chembl_23.maccs.fps

By default the output fingerprints contain the fingerprints from the first file, in the order they appear in the file, followed by the fingerprints from the second file, and so on.

Chemfp goes through several steps to load an FPS file into an arena. It loads the fingerprints into memory, it sorts them by population count, so that fingerprints with 0 bits set come first, then those with 1 bit set, etc., and finally it creates an index describing the offset to each of those popcount boundaries.

As an optimization, if the fingerprints are already ordered, then there’s no need to sort them, so it skips that step. Here’s an example of the time needed to load the 1.7M ChEMBL 23 MACCS fingerprints:

% time python -c 'import chemfp; chemfp.load_fingerprints("chembl_23.maccs.fps")'
7.762u 0.251s 0:08.01 100.0%  0+0k 0+0io 0pf+0w

(This was the best of 3 times.)

I can ask fpcat to reorder the fingerprints by population count. This loads all of the fingerprints into memory, sorts them, and then saves the fingerprints in sorted order.:

% fpcat subset0.fps subset1.fps subset2.fps -o chembl_23.maccs.fps --reorder

As a result, the load time decreases by about 10-15%:

% time python -c 'import chemfp; chemfp.load_fingerprints("chembl_23.maccs.fps")'
6.681u 0.246s 0:06.94 99.7%   0+0k 0+0io 0pf+0w

Of course, if you really want fast load performance, you should use the FPB format:

% time python -c 'import chemfp; print(len(chemfp.load_fingerprints("chembl_23.maccs.fpb")))'
0.078u 0.013s 0:00.09 88.8%   0+0k 0+0io 0pf+0w

About half of the 0.09 seconds is the startup overhead for Python itself.

chemfp’s two cross-toolkit substructure fingerprints

In this section you’ll learn how to generate the two substructure-based fingerprints which come as part of chemfp. These are based on cross-toolkit SMARTS pattern definitions and can be used with Open Babel, OpenEye, and RDKit. (For OpenEye users, these fingerprints use the base OEChem library and not the separately licensed OEGraphSim add-on.)

Chemfp implements two platform-independent fingerprints where were originally designed for substructure filters but which are also used for similarity searches. One is based on the 166-bit MACCS implementation in RDKit and the other is derived from the 881-bit PubChem/CACTVS substructure fingerprints.

The chemfp MACCS definition is called “rdmaccs” because it closely derives from the MACCS SMARTS patterns used in RDKit. (These pattern definitions are also used in Open Babel and the CDK, but are completely independent from the OpenEye implementation.)

Here are example of the respective rdmaccs fingerprint for phenol using each of the toolkits.

Open Babel:

% echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs
#software=OpenBabel/2.4.1 chemfp/1.4
00000000000000000000000000000140004480101e  phenol


% echo "c1ccccc1O phenol" | oe2fps --in smi --rdmaccs
#software=OEChem/2.1.3.b.1_debug (20170816) chemfp/1.4
00000000000000000000000000000140004480101e  phenol


% echo "c1ccccc1O phenol" | rdkit2fps --in smi --rdmaccs
#software=RDKit/2018.03.1.dev1 chemfp/1.4
00000000000000000000000000000140004480101e  phenol

For more complex molecules it’s possible that different toolkits produce different fingerprint rdmaccs, even though the toolkits use the same SMARTS definitions. Each toolkit has a different understanding of chemistry. The most notable is the different definition of aromaticity, so the bit for “two or more aromatic rings” will be toolkit dependent.

substruct fingerprints

chemp also includes a “substruct” substructure fingerprint. This is an 881 bit fingerprint derived from the PubChem/CACTVS substructure keys. They do not match the CACTVS fingerprints exactly, in part due to differences in ring perception. Some of the substruct bits will always be 0. With that caution in mind, if you want to try them out, use the --substruct option.

The term “substruct” is a horribly generic name, but I couldn’t think of a better one. Until chemfp 3.0 I said these fingerprints were “experimental”, in that I hadn’t fully validated them against PubChem/CACTVS and could not tell you the error rate. I still haven’t done that.

What’s changed is that I’ve found out over the years that people are using the substruct fingerprints, even without full validatation. That surprised me, but use is its own form of validation. I still would like to validate the fingerprints, but it’s slow, tedious work which I am not really interested in doing. Nor does it earn me any money. Plus, if the validation does lead to any changes, it’s easy to simply change the version number.