Working with the command-line tools

The sections in this chapter describe examples of using the command-line tools to generate fingerprint files and to do similarity searches of those files.

Generating fingerprint files from PubChem SD files

In this section you’ll learn how to create a fingerprint file from an SD file which contains pre-computed CACTVS fingerprints. You do not need a chemistry toolkit for this section.

PubChem is a great resource of publically available chemistry information. The data is available for ftp download. We’ll use some of their SD formatted files. Each record has a PubChem/CACTVS fingerprint field, which we’ll used.

Start by downloading the files Compound_027575001_027600000.sdf.gz (from ) and Compound_014550001_014575000.sdf.gz (from ). At the time of writing they contain 224 and 3119 records, respectively. (I chose smaller than average files so they would be easier to open and review.)

Next, convert the files into fingerprint files. On the command line do the following two commands:

sdf2fps --pubchem Compound_027575001_027600000.sdf.gz -o pubchem_queries.fps
sdf2fps --pubchem Compound_014550001_014575000.sdf.gz -o pubchem_targets.fps

Congratulations, that was it!

How does this work? Each PubChem record contains the precomputed CACTVS substructure keys in the PUBCHEM_CACTVS_SUBSKEYS tag. The --pubchem flag tells sdf2fps to get the value of that tag and decode it to get the fingerprint. It also adds a few metadata fields to the fingerprint file header.

The order of the fingerprints are the same as the order of the corresponding record in the SDF, although unconvertable records might be skipped, depending on the --errors flag.

If you store records in an SD file then you almost certainly don’t use the same fingerprint encoding as PubChem. sdf2ps can decode from a number of encodings. Use --help to see the list of available decoders.

NxN (self-similar) searches

Use the –NxN option if you want to use the same fingerprints as both the queries and targets:

simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps

This is about twice as fast and uses half as much memory compared to:

simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps

and the –NxN option excludes matching a fingerprint to itself (the diagonal term).

Using a toolkit to process the ChEBI dataset

In this section you’ll learn how to create a fingerprint file from a structure file. The structure processing and fingerprint generation are done with a third-party chemisty toolkit. chemfp supports Open Babel, OpenEye, and RDKit. (OpenEye users please note that you will need an OEGraphSim license to use the OpenEye-specific fingerprinters.)

We’ll work with data from ChEBI which contains “Chemical Entities of Biological Interest”. They distribute their structures in several formats, including as an SD file. For this section, download the “lite” version from . It contains the same structure data as the complete version but many fewer tag data fields. For ChEBI 82 this file contains 19640 records and the compressed file is 5MB.

Unlike the PubChem data set, the ChEBI data set does not contain fingerprints so we’ll need to generate them using a toolkit.

ChEBI record titles don’t contain the id

Strangely, the ChEBI dataset does not use the title line of the SD file to store the record id. A simple examination shows that 16498 of the title lines are empty, 2119 of them have the title “ChEBI”, and 45 of them are labeled “Structure #1.”

Instead, the id is the value of the “ChEBI ID” tag, which looks like:

> <ChEBI ID>

By default the toolkit-based fingerprint generation tools use the title as the identifier, and exits with an error if the identifier is missing. (Use the --errors option to change the behaviour). If you try one of them with this data file you will get the error message:

ERROR: Missing title for record #1 of 'ChEBI_lite.sdf.gz'. Exiting.

Instead, use the --id-tag option to specify of the name of the data tag containing the id. For this data set you’ll need to write it as:

--id-tag "ChEBI ID"

The quotes are important because of the space in the tag name.

There’s also a “ChEBI Name” tag which includes data values like “tropic acid” and “(+)-guaia-6,9-diene”. Every record has a unique name so the names could be used as the primary identifier. The FPS fingerprint file format allows identifiers with a space, or comma, or anything other tab, newline, and a couple of other bytes, so it’s no problem using those names directly.

To use the ChEBI Name as the primary chemfp identifier, use:

--id-tag "ChEBI Name"

Generating fingerprints with Open Babel

If you have the Open Babel Python library installed then you can use ob2fps to generate fingerprints:

ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

This takes about 30 seconds on my laptop.

The default uses the FP2 fingerprints, so the above is the same as:

ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps

ob2fps can generate several other types of fingerprints. (See XXX for details). For example, to generate the Open Babel implementation of the MACCS definition use:

ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

Generating fingerprints with OpenEye

If you have the OEChem Python library installed then you can use oe2fps to generate fingerprints:

oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps

This takes about 10 seconds on my laptop and generates a number of stereochemistry warnings.

The default settings produce OEGraphSim path fingerprint with the values:

numbits=4096 minbonds=0 maxbonds=5 atype=DefaultAtom btype=DefaultBond

Each of these can be changed through command-line options. See XXX for details. There are also options to use an alternate aromaticity model.

oe2fps can generate several other types of fingerprints. For example, to generate the OpenEye implementation of the MACCS definition use:

oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

Generating fingerprints with RDKit

If you have the RDKit Python library installed then you can use rdkit2fps to generate fingerprints. Based on the previous examples you probably guessed that the command-line is:

rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps

Unfortunately, this isn’t enough. If you do this you’ll get the message:

[18:54:07] Explicit valence for atom # 12 N, 4, is greater than permitted
ERROR: Could not parse molecule block at line 14840 of 'ChEBI_lite.sdf.gz'. Exiting.

The first line comes from RDKit’s error log. RDKit is careful to check that structures make chemical sense, and in this case it didn’t like the 4-valent nitrogen. It refuses to process this molecule.

The second line comes from rdkit2fps. By default it complains and exits with an error if RDKit cannot process a record. Basically it highlights the source of the problem and demands that you do something about it.

In most cases it’s okay to skip a few records which can’t be processed. You can tell rdkit2fps to report the error but continue processing by using the --errors option:

rdkit2fps --id-tag "ChEBI ID" --errors report ChEBI_lite.sdf.gz -o rdkit_chebi.fps

Four minutes later I see that 403 records out of the 19640 could not be processed.

The previous command-line created RDKit’s path fingerprints with parameters:

minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=4 useHs=1

Each of those can be changed through command-line options. See XXX for details.

rdkit2fps can generate several other types of fingerprints. For example, to generate the RDKit implementation of the MACCS definition use:

rdkitfps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps

chemfp supports neither count fingerprints nor sparse fingerprints so cannot generate RDKit’s circular fingerprints.

chemfp’s two cross-toolkit substructure fingerprints

In this section you’ll learn how to generate the two substructure-based fingerprints which come as part of chemfp. These are based on cross-toolkit SMARTS pattern definitions and can be used with Open Babel, OpenEye, and RDKit. (For OpenEye users, these fingerprints use the base OEChem library and not the separately licensed OEGraphSim add-on.)

chemfp implements two platform-independent fingerprints where were originally designed for substructure filters but which are also used for similarity searches. One is based on the 166-bit MACCS implementation in RDKit and the other comes from the 881-bit PubChem/CACTVS substructure fingerprints.

The chemfp MACCS definition is called “rdmaccs” because it closely derives from the MACCS SMARTS patterns used in RDKit. (These pattern definitions are also used in Open Babel and the CDK, but are completely independent from the OpenEye implementation.)

Here are example of the respective rdmaccs fingerprint for phenol using each of the toolkits.

Open Babel:

% echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs
00000000000000000000000000000140004480101e  phenol


% echo "c1ccccc1O phenol" | oe2fps --in smi --rdmaccs
#software=OEChem/1.7.4 (20100809)
00000000000000000000000000000140004480101e  phenol


% echo "c1ccccc1O phenol" | rdkit2fps --in smi --rdmaccs
echo "c1ccccc1O phenol" | python2.7 rdkit2fps --in smi --rdmaccs
00000000000000000000000000000140004480101e  phenol

You might be wondering why --rdmaccs produces different fingerprint types even if the toolkits use the same SMARTS definitions. Each toolkit perceives chemistry differently. Open Babel before 2.3 didn’t support chirality so chiral-based bits will never be set. Each toolkit uses a different definition of aromaticity, so a bit which is set when there are “two or more aromatic rings” will be toolkit dependent.

substruct fingerprints

(This is a horribly generic name. If you can think of a better one then let me know.)

chemp also includes an experimental “substruct” substructure fingerprint. This is an 881 bit fingerprint derived from the PubChem/CACTVS substructure keys. They are still being tested and validated, but you you want to try them out, use the --substruct option.