Working with the command-line tools

The sections in this chapter describe examples of using the command-line tools to generate fingerprint files and to do similarity searches of those files.

Generate fingerprint files from PubChem SD tags

In this section you’ll learn how to create a fingerprint file from an SD file which contains pre-computed CACTVS fingerprints. You do not need a chemistry toolkit for this section.

PubChem is a great resource of publically available chemistry information. The data is available for ftp download. We’ll use some of their SD formatted files. Each record has a PubChem/CACTVS fingerprint field, which we’ll extract to generate an FPS file.

Start by downloading the files Compound_027575001_027600000.sdf.gz and Compound_014550001_014575000.sdf.gz. At the time of writing (April 2017) they contain 384 and 5167 records, respectively. (I chose smaller than average files so they would be easier to open and review.)

Next, convert the files into fingerprint files. On the command line do the following two commands:

sdf2fps --pubchem Compound_027575001_027600000.sdf.gz -o pubchem_queries.fps
sdf2fps --pubchem Compound_014550001_014575000.sdf.gz -o pubchem_targets.fps

Congratulations, that was it!

If you’re curious about what an FPS file looks like, here are the first 10 lines of pubchem_queries.fps, with some of the lengthy fingerprint lines replaced with an ellipsis:

#type=CACTVS-E_SCREEN/1.0 extended=2
075e1c00020800000000 ... 1fd7e91913047100000402002001000000020100900000000000000000  27575190
035e1c00620000000000 ... 1f97e11913047100000800402000080000040020100004000000000000  27575192
075e1c00020000000000 ... 1f97e11913057101000002006800000000000100340000000000000000  27575198
075e1c00024000000000 ... 1f97e11913047100000000002000000000000000100000000000000000  27575208

How does this work? Each PubChem record contains the precomputed CACTVS substructure keys in the PUBCHEM_CACTVS_SUBSKEYS tag. Here’s what it looks like for record 27575190, which is the first record in Compound_027575001_027600000.sdf.gz:


The --pubchem flag tells sdf2fps to get the value of that tag and decode it to get the fingerprint. It also adds a few metadata fields to the fingerprint file header.

The order of the FPS fingerprints are the same as the order of the corresponding record in the SDF. You can see that in the output, where 27575190 is the first record in the FPS fingerprints.

If you store records in an SD file then you almost certainly don’t use the same fingerprint encoding as PubChem. sdf2fps can decode from a number of encodings, like hex and base64. Use --help to see the list of available decoders.

NxN (self-similar) searches

In this section you’ll learn how to use the same fingerprints as both the queries and targets, that is, a self-similarity search. You will need the pubchem_queries.fps fingerprint file generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.

Use the --NxN option if you want to use the same set of fingerprints as both the queries and targets. Using the pubchem_queries.fps from the previous sections:

simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps

This code is very fast because there are so few fingerprints. For larger files the --NxN will be about twice as fast and use half as much memory compared to:

simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps

In addition, the --NxN option excludes matching a fingerprint to itself (the diagonal term).

Using a toolkit to process the ChEBI dataset

In this section you’ll learn how to create a fingerprint file from a structure file. The structure processing and fingerprint generation are done with a third-party chemisty toolkit. chemfp supports Open Babel, OpenEye, and RDKit. (OpenEye users please note that you will need an OEGraphSim license to use the OpenEye-specific fingerprinters.)

We’ll work with data from ChEBI, which are “Chemical Entities of Biological Interest”. They distribute their structures in several formats, including as an SD file. For this section, download the “lite” version from . It contains the same structure data as the complete version but many fewer tag data fields. For ChEBI 155 this file contains 95,955 records and the compressed file is 28MB.

Unlike the PubChem data set, the ChEBI data set does not contain fingerprints so we’ll need to generate them using a toolkit.

ChEBI record titles don’t contain the id

Strangely, the ChEBI dataset does not use the title line of the SD file to store the record id. A simple examination shows that 47,376 of the title lines are empty, 39,615 have the title “null”, 4,499 have the title ” “, 2,033 have the title “ChEBI”, 45 of them are labeled “Structure #1”, and the others are usually compound names.

(I’ve asked ChEBI to fix this, to no success. Perhaps you have more influence?)

Instead, the record id is stored as value of the “ChEBI ID” tag, which looks like:

> <ChEBI ID>

By default the toolkit-based fingerprint generation tools use the title as the identifier, and print a warning and skip the record if the identifier is missing. Here’s an example with rdkit2fps:

% rdkit2fps ChEBI_lite.sdf.gz
#type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
#software=RDKit/2017.09.1.dev1 chemfp/3.2
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 1, record #1. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 201, record #5. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 236, record #6. Skipping.
[23:17:44]  S group MUL ignored on line 103
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 264, record #7. Skipping.
[23:17:44]  Unhandled CTAB feature: S group SRU on line: 31. Molecule skipped.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 435, record #9. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 519, record #10. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 581, record #11. Skipping.
003c801360600221040010e20418381668404b0424ec130f05a090c94960e0      ChEBI
40000100020080000001000300280000002002000000000000000000000000      ChEBI

That output contains only two fingerprint records, both with the id “ChEBI”. The other records had no title and were skipped, with a message sent to stderr describing the problem and the location of the record containing the problem.

(If the first 100 records have no identifiers then the command-line tools will exit even if --errors is ignore. This is a safety mechanism. Let me know if it’s a problem.)

Instead, use the --id-tag option to specify of the name of the data tag containing the id. For this data set you’ll need to write it as:

--id-tag "ChEBI ID"

The quotes are important because of the space in the tag name.

Here’s what that looks like:

% rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz  | fold | head -20
[23:26:39]  S group MUL ignored on line 103
[23:26:39]  Unhandled CTAB feature: S group SRU on line: 31. Molecule skipped.
#type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1
#software=RDKit/2017.09.1.dev1 chemfp/3.2
344aa98398654481b003a84f201f518f    CHEBI:90
8100c801108000000041c00200800002    CHEBI:165

In addition to “ChEBI ID” there’s also a “ChEBI Name” tag which includes data values like “tropic acid” and “(+)-guaia-6,9-diene”. Every ChEBI record has a unique name so the names could also be used as the primary identifier instead of its id.

The FPS fingerprint file format allows identifiers with a space, or comma, or anything other tab, newline, and a couple of other bytes, so it’s no problem using those names directly.

To use the ChEBI Name as the primary chemfp identifier, specify:

--id-tag "ChEBI Name"

Generate fingerprints with Open Babel

If you have the Open Babel Python library installed then you can use ob2fps to generate fingerprints:

ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

This takes just under 3 minutes on my 7 year old desktop to process all of the records.

The default generates FP2 fingerprints, so the above is the same as:

ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

ob2fps can generate several other types of fingerprints. (Use --help for a list.) For example, to generate the Open Babel implementation of the MACCS definition specify:

ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

Generate fingerprints with OpenEye

If you have the OEChem Python library installed, with licenses for OEChem and OEGraphSim, then you can use oe2fps to generate fingerprints:

oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps

This takes about 40 seconds on my desktop and generates a number of warnings like “Stereochemistry corrected on atom number 17 of”, “Unsupported Sgroup information ignored”, and “Invalid stereochemistry specified for atom number 9 of”. Normally the record title comes after the “… of”, but the title is blank for most of the records.

OEChem could not parse 7 of the 95,955 records. I looked at the failing records and noticed that all of them had 0 atoms and 0 bonds.

The default settings generate OEGraphSim path fingerprint with the values:

numbits=4096 minbonds=0 maxbonds=5
   atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral

Each of these can be changed through command-line options. Use --help for details.

oe2fps can generate several other types of fingerprints. For example, to generate the OpenEye implementation of the MACCS definition specify:

oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

Use --help for a list of available oe2fps fingerprints or to see more configuration details.

Generate fingerprints with RDKit

If you have the RDKit Python library installed then you can use rdkit2fps to generate fingerprints. Based on the previous examples you probably guessed that the command-line is:

rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps

This takes 5.5 minutes on my desktop, and RDKit did not generate fingerprints for 1,101 of the 95,955 records. RDKit logs warning and error messages to stderr. They look like:

[23:29:49] Explicit valence for atom # 6 N, 4, is greater than permitted

Post-condition Violation
Element '.' not found
Violation occurred on line 90 in file /Users/dalke/cvses/rdkit/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1

[23:29:49]  Unhandled CTAB feature: S group SRU on line: 52. Molecule skipped.

For example, RDKit is careful to check that structures make chemical sense. It rejects 4-valent nitrogen and refuses to process that those structures, which is the reason for the first line of that output.

The default generates RDKit’s path fingerprints with parameters:

minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1

Each of those can be changed through command-line options. See rdkit2fps --help for details, where you’ll also see a list of the other available fingerprint types.

For example, to generate the RDKit implementation of the MACCS definition use:

rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

while the following generates the Morgan/circular fingerprint with radius 3:

rdkit2fps --morgan --radius 3 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz

Alternate error handlers

In this section you’ll learn how to change the error handler for rdkit2fps using the --errors option.

By default the “<toolkit>2fps” programs “ignore” structures which could not be parsed into a molecule option. There are two other options. They can “report” more information about the failure case and keep on processing, or they can be “strict” and exit after reporting the error.

This is configured with the --errors option.

Here’s the rdkit2fps output using --errors report:

[00:52:39]  S group MUL ignored on line 103
[00:52:39]  Unhandled CTAB feature: S group SRU on line: 36. Molecule skipped.
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 12036, record #179. Skipping.
[00:52:39] Explicit valence for atom # 12 N, 4, is greater than permitted
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 16213, record #265. Skipping.

The first two lines come from RDKit. The third line is from chemfp, reporting which record could not be parsed. (The record starts at line 12036 of the file and the SRU is on line 36 of the record, so the SRU is at line 12072.) The fourth line is another RDKit error message, and the last line is another chemfp error message.

Here’s the rdkit2fps output using --errors strict:

[00:54:30]  S group MUL ignored on line 103
[00:54:30]  Unhandled CTAB feature: S group SRU on line: 36. Molecule skipped.
ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 12036, record #179. Exiting.

Because this is strict mode, processing exits at the first failure.

The ob2fps and oe2fps tools implement the --errors option, but they aren’t as useful as rdkit2fps because the underlying APIs don’t give useful feedback to chemfp about which records failed. For example, the standard OEChem file reader automatically skips records that it cannot parse. Chemfp can’t report anything when it doesn’t know there was a failure.

The default error handler in chemfp 1.1 was “strict”. In practice this proved more annoying than useful because most people want to skip the records which could not be processed. They would then contact me asking what was wrong, or doing some pre-processing to remove the failure cases.

One of the few times when it is useful is for records which contain no identifier. When I changed the default from “strict” to “ignore” and tried to process ChEBI, I was confused at first about why the output file was so small. Then I realized that it’s because the many records without a title were skipped, and there was no feedback about skipping those records.

I changed the code so missing identifiers are always reported, even if the error setting is “ignore”. Missing identifiers will still stop processing if the error setting is “strict”.

chemfp’s two cross-toolkit substructure fingerprints

In this section you’ll learn how to generate the two substructure-based fingerprints which come as part of chemfp. These are based on cross-toolkit SMARTS pattern definitions and can be used with Open Babel, OpenEye, and RDKit. (For OpenEye users, these fingerprints use the base OEChem library but do not use the separately licensed OEGraphSim library.)

chemfp implements two platform-independent fingerprints where were originally designed for substructure filters but which are also used for similarity searches. One is based on the 166-bit MACCS implementation in RDKit and the other comes from the 881-bit PubChem/CACTVS substructure fingerprints.

The chemfp MACCS definition is called “rdmaccs” because it closely derives from the MACCS SMARTS patterns used in RDKit. (These pattern definitions are also used in Open Babel and the CDK, while OpenEye has a completely independent implementation.)

Here are example of the respective rdmaccs fingerprint for phenol using each of the toolkits.

Open Babel:

% echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs
#software=OpenBabel/2.4.1 chemfp/3.2
00000000000000000000000000000140004480101e  phenol


#software=OEChem/2.1.3 (20170828) chemfp/3.2
00000000000000000000000000000140004480101e  phenol


#software=RDKit/2017.09.1 chemfp/3.2
00000000000000000000000000000140004480101e  phenol

For more complex molecules it’s possible that different toolkits produce different fingerprint rdmaccs, even though the toolkits use the same SMARTS definitions. Each toolkit has a different understanding of chemistry. The most notable is the different definition of aromaticity, so the bit for “two or more aromatic rings” will be toolkit dependent.

substruct fingerprints

chemp also includes a “substruct” substructure fingerprint. This is an 881 bit fingerprint derived from the PubChem/CACTVS substructure keys. They do not match the CACTVS fingerprints exactly, in part due to differences in ring perception. Some of the substruct bits will always be 0. With that caution in mind, if you want to try them out, use the --substruct option.

The term “substruct” is a horribly generic name. If you can think of a better one then let me know. Until chemfp 3.0 I said these fingerprints were “experimental”, in that I hadn’t fully validated them against PubChem/CACTVS and could not tell you the error rate. I still haven’t done that.

What’s changed is that I’ve found out over the years that people are using the substruct fingerprints, even without full validatation. That surprised me, but use is its own form of validation. I still would like to validate the fingerprints, but it’s slow, tedious work which I am not really interested in doing. Nor does it earn me any money. Plus, if the validation does lead to any changes, it’s easy to simply change the version number.

Generate binary FPB files from a structure file

In this section you’ll learn how to generate an FPB file instead of an FPS file. You will need the the ChEBI file from Using a toolkit to process the ChEBI dataset and a chemistry toolkit. The FPB format was introduced with chemfp-2.0.

The FPB format was designed so the fingerprints can be memory-mapped directly to chemfp’s internal data structures. This makes it very fast to load, but unlike the FPS format, it’s not so easy to write with your own code. You should think of the FPB format as an binary application format, for chemfp-based tools, while the FPS format is a text-based format for data exchange between diverse programs.

The easiest way to generate an FPB file from the command line is to use the “.fpb” extension instead of “.fps” or “.fps.gz”. Here are examples using each of the toolkits.

Open Babel:

% ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fpb


% oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fpb


% rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fpb

The binary format isn’t human-readable. Use fpcat command-line options to see what’s inside:

% fpcat oe_chebi.fpb | head -8
#type=OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
#software=OEGraphSim/2.3.1 (20170828) chemfp/3.2
000 ... many zeros ... 000  CHEBI:15378
000 ... many zeros ... 000  CHEBI:16042
000 ... many zeros ... 000  CHEBI:17792
000 ... many zeros ... 000  CHEBI:18140
182 ... hex values ... c0c  CHEBI:60493

By default the fingerprints are ordered from smallest popcount to largest, which you can see in the output. A pre-ordered index is faster to search because the target popcounts are pre-computed and because it enables sublinear search.

If you want to preserve the input order then you’ll need to pipe the FPS output to fpcat and use its --preserve-order flag. See the next section for an example.

Convert between FPS and FPB formats

In this section you’ll learn how to convert an FPS file into an FPB file and back, and you’ll learn how to control the fingerprint ordering. You will need the FPS files generated in Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit. The FPB format was introduced with chemfp-2.0.

If you already have an FPS file then you can convert it directly into an FPB file, and without using a chemistry toolkit. The fpcat program converts from one format to the other.

In an earlier section I generated the files pubchem_queries.fps and pubchem_targets.fps . I’ll convert each to FPB format:

% fpcat pubchem_targets.fps -o pubchem_targets.fpb
% fpcat pubchem_queries.fps -o pubchem_queries.fpb

The FPB format is a binary format which is difficult to read directly. The easiest way to see what’s inside is to use fpcat. If you don’t specify an output filename then it sends the results to stdout in FPS format:

% fpcat pubchem_queries.fpb | head -5
#type=CACTVS-E_SCREEN/1.0 extended=2
0000000080901103f101000000000000000100200000040080000010000000      27581954

The keen-eyed reader might have noticed that the conversion does not have a “source” or “date” field. I haven’t figured out if this is a bug. Should I keep the original date and structure file source, or use the current date and FPS file source? Let me know if this is important to you.

By default fpcat reorders the fingerprints in the FPB file by population count. This improves the similarity search performance, but it means that the order of the FPB file is likely different than the original FPS format. You can get a sense of this by looking at the first fingerprint in the original pubchem_queries.fps file:

% grep -v # pubchem_queries.fps  | head -1
2b801fd7e91913047100000402002001000000020100900000000000000000      27575190

If you want the FPB file to store the fingerprints in input order instead of the popcount order needed for optimized similarity search, then use the --preserve-order flag:

% fpcat pubchem_queries.fps --preserve-order -o input_order.fpb
% fpcat input_order.fpb | grep -v # | head -1
2b801fd7e91913047100000402002001000000020100900000000000000000      27575190

On the flip side, fpcat by default preserves the input order when it creates FPS output. If you instead want to the output FPS file to be in popcount order then use the --reorder flag:

% fpcat --reorder pubchem_queries.fps | grep -v # | head -1
0000000080901103f101000000000000000100200000040080000010000000      27581954

Specify the fpcat output format

In this section you’ll learn how to specify the output format for fpcat using a command-line option instead of the filename extension. You will need the pubchem_queries.fpb file from Generate fingerprint files from PubChem SD tags.

If you do not specify an output filename then fpcat will output the fingerprints in FPS format to stdout. If you specify a filename then by default it will look at the extension to determine if the output should be an FPB (“.fpb”), FPS (“.fps”), or gzip compressed FPS (“.fps.gz”) file. The FPS format is used for unrecognized extensions.

In a few rare cases you may want to use a format which doesn’t match the default. To be honest, the examples I can think of aren’t that realistic, but let’s suppose you want to output the contents of an FPB file to stdout in gzip’ed FPS format, and count the number of bytes in compressed output. I’ll use the use the –out flag to change the format to ‘fps.gz’ from the default of ‘fps’, then compare the resulting size with the uncompressed form:

% fpcat pubchem_queries.fpb --out fps.gz | wc -c
% fpcat pubchem_queries.fpb --out fps | wc -c

It’s not that useful because you could pipe the uncompressed output to gzip, which is also likely faster:

% fpcat pubchem_queries.fpb --out fps | gzip -c -9 | wc -c

By the way, it is not possible to write an FPB file to stdout. In fact, the output file must be seek-able, which means it can’t be a named pipe either.

Alternate fingerprint file formats

In this section you’ll learn about chemfp’s support for other fingerprint file formats.

Chemfp started as a way to promote the FPS file format for fingerprint exchange. Chemfp 2.0 added the FPB format, which is a binary format designed around chemfp’s internal search data structure so it can be loaded quickly.

There are many other fingerprint formats. Perhaps the best known is the Open Babel FastSearch format. Two others are Dave Cosgrove’s flush format, and OpenEye’s “fpbin” format.

The chemfp_converters package contains utilities to convert between the chemfp formats and these other formats.:

# Convert from/to Dave Cosgrove Flush format
flush2fps drugs.flush
fps2flush drugs.fps -o drugs.flush

# Convert from/to OpenEye's fpbin format
fpbin2fps drugs.fpbin --moldb drugs.sdf
fps2fpbin drugs_openeye_path.fps --moldb drugs.sdf -o drugs.fpbin

# Convert from/to Open Babel's FastSearch format
fs2fps drugs.fs --datafile drugs.sdf
fps2fs drugs_openbabel_FP2.fps  --datafile drugs.sdf  -o drugs.fs

Of the three formats, the flush format is closest to the FPS data model. That is, it stores fingerprint records as an identifier and the fingerprint bytes. By comparison, the FastSearch and fpbin formats store the fingerprint bytes and an index into another file containing the structure and identifier. It’s impossible for chemfp to get the data it needs without reading both files.

Chemfp has special support for the flush format. If chemfp_converters is installed, chemfp will use it to read and write flush files nearly everywhere that it accepts FPS files. You can use it at the output to oe2fps, rdkit2fps, and ob2fps, and as the input queries to simsearch, and as both input and output to fpcat. (You cannot use it as the simsearch targets because that code has been optimized for FPS and FPB search, and I haven’t spent the time to optimize flush file support.)

This means that if chemfp_converters is installed then you can use fpcat to convert between FPS, FPB, and and flush file formats. For examples:

fpcat drugs.flush -o drugs.fps
fpcat drugs.fps -o drugs.flush

In addition, you can use it at the API level in, chemfp.load_fingerprints(), chemfp.open_fingerprint_writer(), and

Note that the flush format does not support the FPS metadata fields, like the fingerprint type, and it only support fingerprints which are a multiple of 32 bits long.

Similarity search with the FPB format

In this section you’ll learn how to do a similarity search using an FPB as the target. You will need the FPB files from Generate fingerprint files from PubChem SD tags but you do not need a chemistry toolkit.

Simsearch, like all of the tools starting with chemfp-2.0, understands both FPS and FPB files:

% simsearch -k 3 --threshold 0.7 -q pubchem_queries.fpb pubchem_targets.fpb | head
#type=Tanimoto k=3 threshold=0.7
3   27581954        14565747        0.7833  14563541        0.7333  14573233        0.7258
3   27581957        14565747        0.7833  14563541        0.7333  14573233        0.7258
3   27580389        14568366        0.8468  14568369        0.8393  14560737        0.8374
2   27584917        14563095        0.7795  14563096        0.7795

By default simsearch uses the query and target filename extensions to figure out if the file is in FPS or FPB format.

If you don’t want it to auto-detect the format then use the --query-format and --target-format options to tell it the format to use. The values can be one of “fps”, “fps.gz” and “fpb”.

Converting large data sets to FPB format

In this section you’ll learn how to generate an FPB file on computers with relatively limited memory. To be realistic, this example uses the complete PubChem data set, and extracts the CACTVS/PubChem fingerprints which are in each record. You do not need a chemistry toolkit for this section.

The most direct way to extract the PubChem fingerprints from a PubChem distribution is to use sdf2fps:

sdf2fps --pubchem pubchem/Compound_*.sdf.gz -o pubchem.fpb

This uses the default FPB writer options, which stores all of the fingerprints in memory, sorts them, and saves the result to the output file. This may use about 2-3 times as much memory as the final FPB output size, which is a bit unfortunate if you want to generate a 7 GB FPB file on a 12 GB machine.

(Note: see the next section for a two-stage solution that lets you parallelize fingerprint generation.)

The “*2fps” command-line tools do not have a way to change the default writer options, although fpcat does. The --max-spool-size option sets a rough upper bound to the amount of memory to use. When enabled, the writer breaks the input into parts and creates a temporary FPB file for each part. At the end, it merges the sorted data from the temporary FPB files to get the final FPB file. Note that the specified spool size is only approximate and is not a hard limit on the maximum amount of memory to use. You may need to experiment a bit if you have tight constraints.

The value must be a size in bytes, though suffixes like M or MB for megabyte and T or TB for terabyte are also allowed. These are in base-10 units, so 1 MB = 1,000,000 bytes. Spaces are not allowed between the number and the suffix, so “200MB” is okay but “200 MB” is not. The size must be at least 20 MB.

Here is an example of how to convert the CACTVS fingerprints from all of PubChem to an FPB file, using a relatively small limit of 200 MB:

sdf2fps --pubchem pubchem/Compound_*.sdf.gz | fpcat --max-spool-size 200MB -o pubchem.fpb

This will take a while! The sdf2fps alone takes almost 45 minutes on my desktop, of which 50% of the time was to decompress the files.

The temporary files will be placed under the appropriate temporary directory for your operating system. If that disk isn’t large enough for the intermediate files then use the --tmpdir option of fpcat to specify an alternate directory:

fpcat --max-spool-size 1GB pubchem.fps -o pubchem.fpb

Another option is to specify the directory location using the TMPDIR, TEMP, or TMP environment variables, which are resolved in that order. The details are described in the Python documentation for tempfile.tempdir.

Generate fingerprints in parallel and merge to FPB format

In this section you’ll learn how to merge multiple sorted fingerprints into a single FPB file.

The previous section used a single shell command to extract the PubChem/CACTVS fingerprints from PubChem and generate an FPB file. This is easy to write and understand, but more complex versions may be more appropriate.

For one, I have four cores on my desktop computer, and I want to use them to process the PubChem files in parallel. The previous section was only single threaded.

I have all my PubChem files in ~/pubchem/. For each “Compound_*.sdf.gz” file in that directory I want to extract the CACTVS/PubChem fingerprints and create an intermediate FPS file in the local directory. That’s equivalent to running the following commands:

sdf2fps --pubchem ~/pubchem/Compound_000000001_000025000.sdf.gz \\
     -o Compound_000000001_000025000.fps
sdf2fps --pubchem ~/pubchem/Compound_000025001_000050000.sdf.gz \\
     -o Compound_000025001_000050000.fps
 ... 2146 more lines ...

except that I want to run four at a time.

This is what GNU Parallel was designed for. It’s a command-line tool which can parallelize the exection of other command-lines.

I’ll start by explaining the core command-line substitution pattern:

sdf2fps --pubchem {} -o {/..}.fps'

The {} will be replaced with a filename, and {/..} will be replaced with the base filename, without the directory path prefix or the two suffixes. That is, when {} is “/Users/dalke/pubchem/Compound_000000001_000025000.sdf.gz” then {/..} will be “Compound_000000001_000025000”.

Since I want to generate an FPS file, I added the “.fps” as a suffix to the second substitution parameter.

I then tell GNU parallel which command-line to use, along with a few other parameters. Here’s the full line, which I split over two lines to make it more readable:

parallel --plus --no-notice --bar 'sdf2fps --pubchem {}
   -o {/..}.fps' ::: ~/pubchem/Compound_*.sdf.gz

The --plus tells GNU parallel to recognize an expanded set of replacement strings. (“{/..}” is not part of the standard set of patterns.)

The --no-notice tells it to not display the message about citing GNU parallel in scientific papers.

The --bar enables a progress bar, which looks like this:

26% 763:2148=1696s /Users/dalke/pubchem/Compound_019150001_019175000.sdf.gz

It’s 26% through processing the filenames, which is file 763 out of 2148, and there’s an estimated 1696 seconds remaining.

Finally, the “:::” indicates that the remaining options are the list of parameters to pass to the command-line template for parallelization.

After about 30 minutes, I now have a large number of FPS files, which I want to merge into a single FPB file. I’ll use fpcat:

fpcat --max-spool-size 2GB Compound*.fps -o pubchem.fpb

This took about 15 minutes. (It’s a bit odd that the overall performance wasn’t that much better than the single-threaded code. It would probably be more clear with compute-intenstive fingerprints, instead of simple text extraction from an SD tag.)

Note: I started this section as an example of when to use the --merge option to fpcat. When the fingerprints are in popcount order then popcount sorted output is a merge sort of the inputs. This doesn’t need RAM or temporary disk space for an intermediate sort. My thought was to save the intermediate fingerprints in FPB format instead of FPS, which has a side-effect of sorting the fingerprints. Then I could simply merge the results.

I did this, and ran into two problems. There are 2912 files, and fpcat will open all of them in order to do the parallel merge. I ran out of file descriptors, and had to increase the limit to 6000 (3000 is too small) before it would work. In the future I’ll have to implement some sort of multi-layer merge for when there are too many files. However, even with 6000 available descriptors, iterating over the FPB-backed FingerprintArena proved to be rather slow, and I’m not yet sure way. I think it’s simply that I didn’t design that code for fast iteration.

Take home message? Use FPB files for now only as the last file format in your pipeline.