.. highlight:: none .. _using_tools: ################################### Working with the command-line tools ################################### The sections in this chapter describe examples of using the command-line tools to generate fingerprint files and to do similarity searches of those files. .. _pubchem_fingerprints: Generate fingerprint files from PubChem SD tags =============================================== In this section you'll learn how to create a fingerprint file from an SD file which contains pre-computed CACTVS fingerprints. You do not need a chemistry toolkit for this section. `PubChem `_ is a great resource of publically available chemistry information. The data is available for `ftp download `_. We'll use some of their `SD formatted `_ files. Each record has a PubChem/CACTVS fingerprint field, which we'll extract to generate an FPS file. Start by downloading the files Compound_099000001_099500000.sdf.gz (from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_099000001_099500000.sdf.gz ) and Compound_048500001_049000000.sdf.gz (from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_048500001_049000000.sdf.gz ). At the time of writing they contain 10,826 and 14,967 records, respectively. (I chose some of the smallest files so they would be easier to open and review.) Next, convert the files into fingerprint files. On the command line do the following two commands:: sdf2fps --pubchem Compound_099000001_099500000.sdf.gz -o pubchem_queries.fps sdf2fps --pubchem Compound_048500001_049000000.sdf.gz -o pubchem_targets.fps Congratulations, that was it! If you're curious about what an FPS file looks like, here are the first 10 lines of pubchem_queries.fps, with some of the lengthy fingerprint lines replaced with an ellipsis:: #FPS1 #num_bits=881 #type=CACTVS-E_SCREEN/1.0 extended=2 #software=CACTVS/unknown #source=Compound_099000001_099500000.sdf.gz #date=2020-05-11T14:35:08 07de0d00000000000000 ... 393e338d1017100000000204000000000000010200000000000000000 99000039 07de1c00020000000000 ... 995e1398a405000010000000000008000000000000000000000000000 99000230 07de0c00000000000000 ... b1be31913097110008000000008000800400000000400000000000000 99002251 07de0500000000000000 ... 313e43891037901000000004000040000000000200002000000000000 99003537 How does this work? Each PubChem record contains the precomputed CACTVS substructure keys in the PUBCHEM_CACTVS_SUBSKEYS tag. Here's what it looks like for record 99000039, which is the first record in Compound_099000001_099500000.sdf.gz:: > AAADceB7sAAAAAAAAAAAAAAAAAAAAAAAAAA8YIAABYAAAACx9AAAHgAQAAAADCjBngQ8wPLIEACoAzV3 VACCgCA1AiAI2KG4ZNgIYPrA1fGUJYhglgDIyccci4COAAAAAAQCAAAAAAAACAQAAAAAAAAAAA== The :option:`--pubchem` flag tells :ref:`sdf2fps ` to get the value of that tag and decode it to get the fingerprint. It also adds a few metadata fields to the fingerprint file header. The order of the FPS fingerprints are the same as the order of the corresponding record in the SDF. You can see that in the output, where 99000039 is the first record in the FPS fingerprints. If you store records in an SD file then you almost certainly don't use the same fingerprint encoding as PubChem. :ref:`sdf2fps ` can decode from a number of encodings, like hex and base64. Use :option:`--help` to see the list of available decoders. The example uses :option:`-o` to have sdf2fps write the output to a file instead of to stdout. By default, filenames ending in ".fps" are saved in FPS format. Use ".fps.gz" for the gzip-compressed FPS format and ".fps.zst" for the zstandard-compressed FPS format. k-nearest neighbor search ========================= In this section you'll learn how to search a fingerprint file to find the k-nearest neighbors. You will need the FPS fingerprint files generated in :ref:`pubchem_fingerprints` but you do not need a chemistry toolkit. We'll use the pubchem_queries.fps as the queries for a k=2 nearest neighor similarity search of the target file puchem_targets.gps:: simsearch -k 2 -q pubchem_queries.fps pubchem_targets.fps That's all! You should get output which starts:: #Simsearch/1 #num_bits=881 #type=Tanimoto k=2 threshold=0.0 #software=chemfp/3.4 #queries=pubchem_queries.fps #targets=pubchem_targets.fps #query_source=Compound_099000001_099500000.sdf.gz #target_source=Compound_048500001_049000000.sdf.gz 2 99000039 48503376 0.8785 48503380 0.8729 2 99000230 48563034 0.8588 48731730 0.8523 2 99002251 48798046 0.8110 48625236 0.8107 2 99003537 48997075 0.9036 48997697 0.8985 Here's how to interpret the output. The lines starting with '#' are header lines. It contains metadata information describing that this is a similarity search report. You can see the search parameters, the name of the tool which did the search, and the filenames which went into the search. After the '#' header lines come the search results, with one result per line. There are in the same order as the query fingerprints. Each result line contains tab-delimited columns. The first column is the number of hits. The second column is the query identifier used. The remaining columns contain the hit data, with alternating target id and its score. For example, the first result line contains the 2 hits for the query 99000039. The first hit is the target id 48503376 with score 0.8785 and the second hit is 48503380 with score 0.8729. Since this is a k-nearest neighor search, the hits are sorted by score, starting with the highest score. Do be aware that ties are broken arbitrarily. There may be additional hits with the score 0.8729 which are not reported. Threshold search ================ In this section you'll learn how to search a fingerprint file to find all of the neighbors at or above a given threshold. You will need the FPS fingerprint files generated in :ref:`pubchem_fingerprints` but you do not need a chemistry toolkit. Let's do a threshold search and find all hits which are at least 0.85 similar to the queries:: simsearch --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fps The first 15 lines of output from this are:: #Simsearch/1 #num_bits=881 #type=Tanimoto k=all threshold=0.85 #software=chemfp/3.4 #queries=pubchem_queries.fps #targets=pubchem_targets.fps #query_source=Compound_099000001_099500000.sdf.gz #target_source=Compound_048500001_049000000.sdf.gz 4 99000039 48732162 0.8596 48503380 0.8729 48503376 0.8785 48520532 0.8541 2 99000230 48563034 0.8588 48731730 0.8523 0 99002251 4 99003537 48566113 0.8724 48998000 0.8535 48997697 0.8985 48997075 0.9036 4 99003538 48566113 0.8724 48998000 0.8535 48997697 0.8985 48997075 0.9036 0 99005028 0 99005031 Take a look at the first result line, which contains the 4 hits for the query id 99000039. As before, the hit information alternates between the target ids and the target scores, but unlike the k-nearest search, the hits are not in a particular order. You can see that here where the scores are 0.8596, 0.8729, 0.8785, and 0.8541. You might be wondering why I chose the 0.85 threshold, or decided to show only the first 15 lines of output. Quite simply, it was for presentation. With a threshold of 0.8, the first record has 41 hits, which requires 84 columns to show, which is a bit overwhelming. Combined k-nearest and threshold search ======================================= In this section you'll learn how to search a fingerprint file to find the k-nearest neighbors, where all of the hits must be at or above given threshold. You will need the fingerprint files generated in :ref:`pubchem_fingerprints` but you do not need a chemistry toolkit. You can combine the :option:`-k` and :option:`--threshold` queries to find the k-nearest neighbors which are all at or above a given threshold:: simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_targets.fps This find the nearest 3 structures, which all must be at least 0.7 similar to the query fingerprint. The output from the above starts:: #Simsearch/1 #num_bits=881 #type=Tanimoto k=3 threshold=0.7 #software=chemfp/3.4 #queries=pubchem_queries.fps #targets=pubchem_targets.fps #query_source=Compound_099000001_099500000.sdf.gz #target_source=Compound_048500001_049000000.sdf.gz 3 99000039 48503376 0.8785 48503380 0.8729 48732162 0.8596 3 99000230 48563034 0.8588 48731730 0.8523 48583483 0.8412 3 99002251 48798046 0.8110 48625236 0.8107 48500395 0.7927 3 99003537 48997075 0.9036 48997697 0.8985 48566113 0.8724 3 99003538 48997075 0.9036 48997697 0.8985 48566113 0.8724 3 99005028 48651160 0.8288 48848576 0.8167 48660867 0.8000 3 99005031 48651160 0.8288 48848576 0.8167 48660867 0.8000 3 99006292 48945841 0.9652 48737522 0.8793 48575758 0.8537 3 99006293 48945841 0.9652 48737522 0.8793 48575758 0.8537 0 99006597 3 99006753 48655580 0.9310 48662591 0.9249 48654553 0.9096 3 99009085 48561250 0.8503 48588162 0.8027 48675288 0.7973 The output format is identical to the previous two search examples, and because this is a k-nearest search, the hits are sorted from highest score to lowest. NxN (self-similar) searches =========================== In this section you'll learn how to use the same fingerprints as both the queries and targets, that is, a self-similarity search. You will need the pubchem_queries.fps fingerprint file generated in :ref:`pubchem_fingerprints` but you do not need a chemistry toolkit. Use the :option:`--NxN` option if you want to use the same set of fingerprints as both the queries and targets. Using the pubchem_queries.fps from the previous sections:: simsearch -k 3 --threshold 0.7 --NxN pubchem_queries.fps This code is very fast because there are so few fingerprints. For larger files the :option:`--NxN` will be about twice as fast and use half as much memory compared to:: simsearch -k 3 --threshold 0.7 -q pubchem_queries.fps pubchem_queries.fps In addition, the :option:`--NxN` option excludes matching a fingerprint to itself (the diagonal term). .. _chebi_fingerprints: Using a toolkit to process the ChEBI dataset ============================================ In this section you'll learn how to create a fingerprint file from a structure file. The structure processing and fingerprint generation are done with a third-party chemisty toolkit. chemfp supports Open Babel, OpenEye, RDKit and CDK. (OpenEye users please note that you will need an OEGraphSim license to use the OpenEye-specific fingerprinters.) We'll work with data from `ChEBI `_, which are "Chemical Entities of Biological Interest". They distribute their structures in several formats, including as an SD file. For this section, download the "lite" version from ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_lite.sdf.gz . It contains the same structure data as the complete version but many fewer tag data fields. For ChEBI 187 this file contains 107,207 records and the compressed file is 34MB. Unlike the PubChem data set, the ChEBI data set does not contain fingerprints so we'll need to generate them using a toolkit. ChEBI record titles don't contain the id ---------------------------------------- Strangely, the ChEBI dataset does not use the title line of the SD file to store the record id. A simple examination shows that 58,288 of the title lines are empty, 39,524 have the title "null", 4,345 have the title " " (with a single space), 1,983 have the title "ChEBI", 57 of them are labeled "Structure #1", and the others are usually compound names like 'fluprednidene acetate', 'bkas#30-CoA(4-)', and 'Compound 92'. (I've asked ChEBI to fix this, to no success after many years. Perhaps you have more influence?) Instead, the record id is stored as value of the "ChEBI ID" tag, which looks like:: > CHEBI:776 By default the toolkit-based fingerprint generation tools use the title as the identifier, and print a warning and skip the record if the identifier is missing. Here's an example with :ref:`rdkit2fps `:: % rdkit2fps ChEBI_lite.sdf.gz ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 1, record #1. Skipping. ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 62, record #2. Skipping. ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 100, record #3. Skipping. ERROR: Missing title in SD record, file 'ChEBI_lite.sdf.gz', line 135, record #4. Skipping. ... skipping many lines ... ERROR: Empty title in SD record after cleanup, file 'ChEBI_lite.sdf.gz', line 2019, record #32: first line is ' '. Skipping. ... skipping a lot more lines ... #FPS1 #num_bits=2048 #type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1 #software=RDKit/2020.03.1 chemfp/3.4 #source=ChEBI_lite.sdf.gz #date=2020-05-12T09:36:52 031087be231150242e714400920000a193c1080c02858a1116a68100a588063428404052 53004080c8cc3c48114101b25081a10c025e634c08a1c00088102c0400121040a2080505 188a9c0a150000028211219c1001000981c4804417180aca0401408500180182210716db 1580708a0b8a0802820532854411200c1101040404001118600d0a518402385dc0001129 0602205a070480c148f240421000c321801922c7808740cd0b10ea4c40000403dc180121 94d8d120020150b3d00043a24370000201042881d15018c0e0901442881d68604c4a8380 8110c772a824051948003c801360600221040010e20418381668404b0424ec130f05a090 c94960e0 ChEBI 000080000000000000000028800000000000000002000000040080000000000000002000 40000002000c000000000000000080080000000200400100000000000000001000000400 00100000000000000080000000000000010000000801002000000001000000400004c000 000000000000800004000000001102000000200004000000100300080000000000000000 00000000000000000820000404000000800000400000200c000008040000000000000000 200101008000000000000000000202000002008000000000000002000000000008000400 000000000000000100400001000200800000010003002800000020020000000000000000 00000000 ChEBI 210809600d11180010010200820108302804406016040100a4019100001204a12800000c 400202200286000491800080c00019050000630a8222b4a10c10450170048100a0020600 200093020522088a9005040028100000890048004af130e280000445000526496044c228 0413804030000062060804c520002200030064114f2001803401af120100043248000c20 02008092020c6a042925c0800008c140848448541a42205c0305584810788441610a0400 000c8100088c4064000105128a824284300648008900000100c00201c41027400c8a2090 8700440a0012012180410291002200024002a1100b5038410206a0000900404400001150 000a020a null .... more lines omitted ... That output shown contains three fingerprint records; the first two with the id "ChEBI" and the third with the id "ChEBI". The other records had no title and were skipped, with a message sent to stderr describing the problem and the location of the record containing the problem. (The "Empty title after cleanup" is because chemfp removes trailing whitespace on the title line. If nothing is left after cleanup then chemfp will report the problem.) (If the first 100 records have no identifiers then the command-line tools will exit even if :option:`--errors` is ignore. This is a safety mechanism. Let me know if it's a problem.) Instead, use the :option:`--id-tag` option to specify of the name of the data tag containing the id. For this data set you'll need to write it as:: --id-tag "ChEBI ID" The quotes are important because of the space in the tag name. Here's what that looks like:: % rdkit2fps ChEBI_lite.sdf.gz --id-tag "ChEBI ID" | head -8 | fold #FPS1 #num_bits=2048 #type=RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1 #software=RDKit/2020.03.1 chemfp/3.4 #source=ChEBI_lite.sdf.gz #date=2020-05-12T09:44:29 10208220141258c184490038b4124609db0030024a0765883c62c9e1288a1dc224de62f445743b8b 30ad542718468104d521a214227b29ba3822fbf20e15491802a051532cd10d902c39b02b51648981 9c87eb41142811026d510a890a711cb02f2090ddacd990c5240cc282090640103d0a0a8b460184f5 11114e2a8060200804529804532313bb03912d5e2857a6028960189e370100052c63474748a1c000 8079f49c484ca04c0d0bcb2c64b72401042a1f82002b097e852830e5898302021a1203e412064814 a598741c014e9210bc30ab180f0162029d4c446aa01c34850071e4ff037a60e732fd85014344f82a 344aa98398654481b003a84f201f518f CHEBI:90 00000000080200412008000008000004000010100022008000400002000020100020006000800001 01000100080001000010000002002200000200000008000000400002100000000080000004401000 80200020800200002000001400022064000004244810000000000080000a80012002020004198002 00080200020020120040203001000802010100024211000004400000000100200003000001000100 0100021000a200601080002a00002020048004030000884084000008000002040200010800000000 2000010022000800002000020001400020800100025040000000200a080244000060008000000802 8100c801108000000041c00200800002 CHEBI:165 In addition to "ChEBI ID" there's also a "ChEBI Name" tag which includes data values like "tropic acid" and "(+)-guaia-6,9-diene". Every ChEBI record has a unique name so the names could also be used as the primary identifier instead of its id. To use the ChEBI Name as the primary chemfp identifier, specify:: --id-tag "ChEBI Name" The FPS fingerprint file format allows identifiers with a space, or comma, or anything other tab, newline, and a couple of other bytes, so it's no problem using those names directly. Generate fingerprints with Open Babel ------------------------------------- If you have the Open Babel Python library installed then you can use :ref:`ob2fps ` to generate fingerprints:: ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps This takes about 2m45s on my 2019-era laptop to process all of the records, and generates messages like:: ============================== *** Open Babel Warning in Expand Alias R was not chemically interpreted ============================== *** Open Babel Warning in ReadMolecule WARNING: Problem interpreting the valence field of an atom The valence field specifies a valence 3 that is less than the observed explicit valence 4. ============================== *** Open Babel Warning in ReadMolecule Failed to kekulize aromatic bonds in MOL file ============================== *** Open Babel Warning in ReadMolecule Invalid line: M RGP must only refer to pseudoatoms M RGP 2 12 1 15 2 The default generates FP2 fingerprints, so the above is the same as:: ob2fps --FP2 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fps ob2fps can generate several other types of fingerprints. (Use :option:`--help` for a list.) For example, to generate the Open Babel implementation of the MACCS definition specify:: ob2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps Generate fingerprints with OpenEye ---------------------------------- If you have the OEChem Python library installed, with licenses for OEChem and OEGraphSim, then you can use :ref:`oe2fps ` to generate fingerprints:: oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps This takes about 33 seconds on my laptop and generates a number of warnings like "Stereochemistry corrected on atom number 17 of", "Unsupported Sgroup information ignored", and "Invalid stereochemistry specified for atom number 9 of". Normally the record title comes after the "... of", but the title is blank for most of the records. OEChem could not parse 2 of the 107,207 records. I looked at the failing records (CHEBI:147324 and CHEBI:147325) and noticed that they have 0 atoms and 0 bonds. By default OEChem's SDF reader skips empty records. If you really need those records, add the SuppressEmptyMolSkip flag to the default 'flavor' reader argument, like this:: oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fps \ -R flavor=Default,SuppressEmptyMolSkip The default settings generate OEGraphSim path fingerprint with the values:: numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral Each of these can be changed through command-line options. Use :option:`--help` for details. oe2fps can generate several other types of fingerprints. For example, to generate the OpenEye implementation of the MACCS definition specify:: oe2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps Use :option:`--help` for a list of available oe2fps fingerprints or to see more configuration details. Generate fingerprints with RDKit -------------------------------- If you have the RDKit Python library installed then you can use :ref:`rdkit2fps ` to generate fingerprints. Based on the previous examples you probably guessed that the command-line is:: rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fps This takes about 5 minutes and 20 seconds on my laptop, and RDKit did not generate fingerprints for 242 of the 106,965 records. RDKit logs warning and error messages to stderr. They look like:: [11:48:30] WARNING: not removing hydrogen atom without neighbors [11:48:30] Explicit valence for atom # 12 N, 4, is greater than permitted [11:48:30] **** Post-condition Violation Element 'X' not found Violation occurred on line 91 in file /Users/dalke/ftps/rdkit-Release_2020_03_1/Code/GraphMol/PeriodicTable.h Failed Expression: anum > -1 **** [11:48:30] Element 'X' not found For example, RDKit is careful to check that structures make chemical sense. It rejects 4-valent nitrogen and refuses to process that those structures, which is the reason for the first line of that output. The default generates RDKit's path fingerprints with parameters:: minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1 Each of those can be changed through command-line options. See rdkit2fps :option:`--help` for details, where you'll also see a list of the other available fingerprint types. For example, to generate the RDKit implementation of the MACCS definition use:: rdkit2fps --maccs166 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps while the following generates the Morgan/circular fingerprint with radius 3:: rdkit2fps --morgan --radius 3 --id-tag "ChEBI ID" ChEBI_lite.sdf.gz Generate fingerprints with CDK ------------------------------ If you have the CDK Java JAR file on your CLASSPATH and you have installed the JPype1 package (see the [:ref:`installation guide ` for help]) then you can use :ref:`cdk2fps ` to generate fingerprints. Based on the previous examples you can probably guess that the command-line is:: cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps This takes about 1 minute and 20 seconds on my laptop, and CDK did not generate fingerprints for 16 of the 107,207 structures. (This section of the documentation was written in just before the chemfp 3.5 release, with a different ChEBI download than the rest of this document.) CDK generated two warnings:: org.openscience.cdk.config.IsotopeFactory ERROR: Could not find major isotope for: 88 org.openscience.cdk.config.IsotopeFactory ERROR: Could not find major isotope for: 88 Chemfp lets you choose an :ref:`alternate error handler ` (see the next section), which can help you figure out what happened. I'll enable the ``report`` error handler:: cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fps --errors report This generates 16 lines of the form:: ERROR: Cannot generate fingerprint: java.lang.NullPointerException, file 'ChEBI_lite.sdf.gz', record #8513. Skipping. That means the record caused the CDK fingerprinter function to fail, by raising a Java NullPointerException, which chemfp catches and reports. For reference, that record is ChEBI ID CHEBI:5015. (There is experimental support for letting chemfp's text toolkit parse the SD records and passing the text to CDK. This is enabled with the cdk.sdf.implementation value "chemfp", like this:: ... --errors report -R implementation=chemfp This adds line number information to the report:: ERROR: Cannot generate fingerprint: java.lang.NullPointerException, file 'ChEBI_lite.sdf.gz', line 539706, record #8513. Skipping. This variant implementation took 1 minute and 30 seconds, so adds some overhead. Let me know if you find it useful.) The default ``cdk2fps`` fingerprint type is ``CDK-Daylight`` with parameters:: size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0 Each of those can be changed through command-line options. See cdk2fps :option:`--help` for details, where you'll also see a list of the other available fingerprint types. For example, to generate the CDK implementation of the MACCS definition use:: cdk2fps --MACCS --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o chebi_maccs.fps --errors report This generates 22 report lines of the form:: ERROR: Cannot generate fingerprint: java.lang.NullPointerException: Aromaticity model requires implicit hydrogen count is set., file 'ChEBI_lite.sdf.gz', record #2918. Skipping. .. simsearch_with_structure_queries Use structures as input to simsearch ==================================== In this section you'll learn how to use structures as queries to simsearch queries instead of fingerprints. You'll need a chemistry toolkit and a fingerprint file generated from that toolkit. This section assumes you have one of the ``chebi_maccs.fps`` ChEBI fingerprint files generated in the previous section. I'll search the ChEBI data set for phenol with the SMILES "c1ccccc1O". In the previous section I generated the file ``chebi_maccs.fps``, which starts:: % head -3 chebi_maccs.fps #FPS1 #num_bits=166 #type=CDK-MACCS/2.0 The type information gives a hint for how to generate a fingerprint query for that data set:: % echo "c1ccccc1O phenol" | cdk2fps --MACCS #num_bits=166 #type=CDK-MACCS/2.0 #software=CDK/2.3 chemfp/3.5.1 #date=2021-02-04T14:28:08 00000000000000000000000000000140004480101e phenol I'll pass that fingerprint simsearch with another pipe:: % echo "c1ccccc1O phenol" | cdk2fps --MACCS | simsearch chebi_maccs.fps --threshold 1.0 #Simsearch/1 #num_bits=166 #type=Tanimoto k=all threshold=1.0 #software=chemfp/3.5.1 #targets=chebi_maccs.fps #target_source=ChEBI_lite.sdf.gz 1 phenol CHEBI:15882 1.0000 That's a bit clumsy, because I have to look at the fingerprint file and figure out which command-line tools and options to use. A simpler way is to pass the structures directly to ``simsearch``, either as a command-line option or from an input file. In the following I'll pass in the SMILES using the ``--query`` parameter:: % simsearch chebi_maccs.fps--threshold 1.0 --query "c1ccccc1O phenol" #Simsearch/1 #num_bits=166 #type=Tanimoto k=all threshold=1.0 #software=chemfp/3.5.1 #targets=chebi_maccs.fps #target_source=ChEBI_lite.sdf.gz 1 phenol CHEBI:15882 1.0000 This works by opening the targets file to get the type string from the metadata section. The type string is used to figure out how to generate the fingerprints for that type, as well as how to handle structure processing. If the query is in some other format then use ``--query-format`` to specify the format name. Use ``--query-id`` to specify the query id instead of using the id from the input record. The ``--query`` simsearch option takes the structure on the command-line. Use ``--queries`` to read queries from a file, which may be a fingerprint file or a structure file. I'll demonstrate with a SMILES file containing two records:: % cat queries.smi c1ccccc1O phenol CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine which I'll use to search chebi_maccs.fps:: % simsearch --queries queries.smi chebi_maccs.fps --threshold 1.0 #Simsearch/1 #num_bits=166 #type=Tanimoto k=all threshold=1.0 #software=chemfp/3.5.1 #queries=queries.smi #targets=chebi_maccs.fps #query_source=queries.smi #target_source=ChEBI_lite.sdf.gz 1 phenol CHEBI:15882 1.0000 1 caffeine CHEBI:27732 1.0000 By default simsearch uses the filename to figure out the format type and compression. Use ``--query-format`` to specify a different format. For example, if neither ``--query`` nor ``--queries`` are specified then the default reads FPS queries from stdin. I'll use ``--query-format sdf.gz`` to have it read gzip-compressed SD records from stdin, in this case from a PubChem file:: % cat Compound_099000001_099500000.sdf.gz | \ ? python -m chemfp simsearch --query-format sdf.gz chebi_maccs.fps -k 1 | head -10 #Simsearch/1 #num_bits=166 #type=Tanimoto k=1 threshold=0.0 #software=chemfp/3.5.1 #targets=chebi_maccs.fps #target_source=ChEBI_lite.sdf.gz 1 99000039 CHEBI:116650 0.8000 1 99000230 CHEBI:127468 0.8837 1 99002251 CHEBI:109790 0.7091 1 99003537 CHEBI:131613 0.7458 .. using_fingerprint_file_metadata Make new fingerprints matching the type in an existing file =========================================================== In this section you'll learn how to generate fingerprints that match the fingerprint type of an existing file. You'll need a chemistry toolkit and a fingerprint file generated from that toolkit. This section assumes you have one of the ``chebi_maccs.fps`` ChEBI fingerprint files generated in an earlier section. In the previous section you learned how to use structures as input to simsearch; simsearch uses the target fingerprint file metadata to generate the appropriate fingerprints for each structure. What if you want to generate fingerprints appropriate for the target data set but don't immediately want to use them for a search? The ``--using FILENAME`` option to cdk2fps, oe2fps, rdkit2fps, and ob2fps opens the named fingerprint file to get the appropriate type. I'll demonstrate with a simple SMILES file, where the fingerprint types comes from ``chebi_maccs.fps``:: % cat simple.smi c1ccccc1O phenol CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine % cdk2fps simple.smi --using chebi_maccs.fps #FPS1 #num_bits=166 #type=CDK-MACCS/2.0 #software=CDK/2.4-SNAPSHOT chemfp/3.5.1 #source=simple.smi #date=2021-02-04T15:19:49 00000000000000000000000000000140004480101e phenol 000000003000000001d414d90323914380f138ea1f caffeine How do you know to use ``cdk2fps``? You don't, but the programs will try to process fingerprint types from other toolkits:: % oe2fps simple.smi --using chebi_maccs.fps WARNING: Fingerprint type 'CDK-MACCS/2.0' from the --using file 'chebi_maccs.fps' is based on the cdk toolkit. WARNING: oe2fps expects fingerprints based on the openeye toolkit. #FPS1 #num_bits=166 #type=CDK-MACCS/2.0 #software=CDK/2.4-SNAPSHOT chemfp/3.5.1 #source=simple.smi #date=2021-02-04T15:20:56 00000000000000000000000000000140004480101e phenol 000000003000000001d414d90323914380f138ea1f caffeine This cross-toolkit functionality is part of the long-term chemfp design but this specific code path hasn't yet been fully tested for all the possible error conditions, which is why it prints the two "WARNING" lines to stderr. If you know the chemfp fingerprint type string then you could pass that in on the command-line via the ``--type`` option, as in:: % ob2fps --type "OpenBabel-ECFP2 nBits=128" simple.smi #FPS1 #num_bits=128 #type=OpenBabel-ECFP2/1 nBits=128 #software=OpenBabel/3.0.0 chemfp/3.5.1 #source=simple.smi #date=2021-02-04T15:34:50 00080020000a00000c00000000000000 phenol 000c4448000000880404000221002040 caffeine .. _alternate_error_handlers: Alternate error handlers ======================== In this section you'll learn how to change the error handler for rdkit2fps using the :option:`--errors` option. By default the "2fps" programs "ignore" structures which could not be parsed into a molecule option. There are two other options. They can "report" more information about the failure case and keep on processing, or they can be "strict" and exit after reporting the error. This is configured with the :option:`--errors` option. Here's the rdkit2fps output using :option:`--errors report`:: [12:21:03] WARNING: not removing hydrogen atom without neighbors [12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Skipping. [12:21:03] Explicit valence for atom # 12 N, 4, is greater than permitted ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24338, record #381. Skipping. The first two lines come from RDKit. The third line is from chemfp, reporting which record could not be parsed. (The record starts at line 24228 of the file.) The fourth line is another RDKit error message, and the last line is another chemfp error message. Here's the rdkit2fps output using :option:`--errors strict`:: [12:24:24] WARNING: not removing hydrogen atom without neighbors [12:24:24] Explicit valence for atom # 12 N, 4, is greater than permitted ERROR: Could not parse molecule block, file 'ChEBI_lite.sdf.gz', line 24228, record #380. Exiting. Because this is strict mode, processing exits at the first failure. The ob2fps and oe2fps tools implement the :option:`--errors` option, but they aren't as useful as rdkit2fps because the underlying APIs don't give useful feedback to chemfp about which records failed. For example, the standard OEChem file reader automatically skips records that it cannot parse. Chemfp can't report anything when it doesn't know there was a failure. The default error handler in chemfp 1.1 was "strict". In practice this proved more annoying than useful because most people want to skip the records which could not be processed. They would then contact me asking what was wrong, or doing some pre-processing to remove the failure cases. One of the few times when it is useful is for records which contain no identifier. When I changed the default from "strict" to "ignore" and tried to process ChEBI, I was confused at first about why the output file was so small. Then I realized that it's because the many records without a title were skipped, and there was no feedback about skipping those records. I changed the code so missing identifiers are always reported, even if the error setting is "ignore". Missing identifiers will still stop processing if the error setting is "strict". chemfp's two cross-toolkit substructure fingerprints ==================================================== In this section you'll learn how to generate the two substructure-based fingerprints which come as part of chemfp. These are based on cross-toolkit SMARTS pattern definitions and can be used with Open Babel, OpenEye, and RDKit. (For OpenEye users, these fingerprints use the base OEChem library but do not use the separately licensed OEGraphSim library.) chemfp implements two platform-independent fingerprints where were originally designed for substructure filters but which are also used for similarity searches. One is based on the 166-bit MACCS implementation in RDKit and the other comes from the 881-bit PubChem/CACTVS substructure fingerprints. The chemfp MACCS definition is called "rdmaccs" because it closely derives from the MACCS SMARTS patterns used in RDKit. (These pattern definitions are also used in Open Babel and the CDK, while OpenEye has a completely independent implementation.) Here are example of the respective rdmaccs fingerprint for phenol using each of the toolkits. Open Babel:: % echo "c1ccccc1O phenol" | ob2fps --in smi --rdmaccs #FPS1 #num_bits=166 #type=RDMACCS-OpenBabel/2 #software=OpenBabel/3.0.0 chemfp/3.5 #date=2021-01-27T14:45:40 00000000000000000000000000000140004480101e phenol OpenEye:: % echo "c1ccccc1O phenol" | oe2fps --in smi --rdmaccs #FPS1 #num_bits=166 #type=RDMACCS-OpenEye/2 #software=OEChem/2.3.0 (20191016) chemfp/3.5 #date=2021-01-27T14:46:03 00000000000000000000000000000140004480101e phenol RDKit:: % echo "c1ccccc1O phenol" | rdkit2fps --in smi --rdmaccs #FPS1 #num_bits=166 #type=RDMACCS-RDKit/2 #software=RDKit/2020.03.1 chemfp/3.5 #date=2021-01-27T14:46:22 00000000000000000000000000000140004480101e phenol CDK:: % echo "c1ccccc1O phenol" | cdk2fps --in smi --rdmaccs #FPS1 #num_bits=166 #type=RDMACCS-CDK/2 #software=CDK/2.3 chemfp/3.5 #date=2021-01-27T14:47:34 00000000000000000000000000000140004480101e phenol For more complex molecules it's possible that different toolkits produce different fingerprint rdmaccs, even though the toolkits use the same SMARTS definitions. Each toolkit has a different understanding of chemistry. The most notable is the different definition of aromaticity, so the bit for "two or more aromatic rings" will be toolkit dependent. substruct fingerprints ---------------------- chemp also includes a "substruct" substructure fingerprint. This is an 881 bit fingerprint derived from the PubChem/CACTVS substructure keys. They do not match the CACTVS fingerprints exactly, in part due to differences in ring perception. Some of the substruct bits will always be 0. With that caution in mind, if you want to try them out, use the :option:`--substruct` option. The term "substruct" is a horribly generic name. If you can think of a better one then let me know. Until chemfp 3.0 I said these fingerprints were "experimental", in that I hadn't fully validated them against PubChem/CACTVS and could not tell you the error rate. I still haven't done that. What's changed is that I've found out over the years that people are using the substruct fingerprints, even without full validatation. That surprised me, but use is its own form of validation. I still would like to validate the fingerprints, but it's slow, tedious work which I am not really interested in doing. Nor does it earn me any money. Plus, if the validation does lead to any changes, it's easy to simply change the version number. .. _pubchem_fpb_fingerprints: Generate binary FPB files from a structure file =============================================== In this section you'll learn how to generate an FPB file instead of an FPS file. You will need the the ChEBI file from :ref:`chebi_fingerprints` and a chemistry toolkit. The FPB format was introduced with chemfp-2.0. .. NOTE:: Several chemfp features, like creating FPB files, require a valid license key. If you are using chemfp under the Base License Agreement then contact sales@dalkescientific.com to purchase a license key or request an evaluation license. The FPB format was designed so the fingerprints can be memory-mapped directly to chemfp's internal data structures. This makes it very fast to load, but unlike the FPS format, it's not so easy to write with your own code. You should think of the FPB format as an binary application format, for chemfp-based tools, while the FPS format is a text-based format for data exchange between diverse programs. The easiest way to generate an FPB file from the command line is to use the ".fpb" extension instead of ".fps" or ".fps.gz". Here are examples using each of the toolkits. Open Babel:: % ob2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o ob_chebi.fpb OpenEye:: % oe2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o oe_chebi.fpb RDKit:: % rdkit2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o rdkit_chebi.fpb CDK:: % cdk2fps --id-tag "ChEBI ID" ChEBI_lite.sdf.gz -o cdk_chebi.fpb The binary format isn't human-readable. Use :ref:`fpcat` to see what's inside:: % fpcat oe_chebi.fpb #FPS1 #num_bits=4096 #type=OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral #software=OEGraphSim/2.4.3 (20191016) chemfp/3.4 0000000 ... many zeros ...00000000000000 CHEBI:15378 0000000 ... many zeros ...00000000000000 CHEBI:16042 0000000 ... many zeros ...00000000000000 CHEBI:17792 .... 182b528 ... many hex values ... a8c10c0c CHEBI:60493 By default the fingerprints are ordered from smallest popcount to largest, which you can see in the output. A pre-ordered index is faster to search because the target popcounts are pre-computed and because it often reduces the search space. If you want to preserve the input order then you'll need to pipe the FPS output to :ref:`fpcat ` and use its :option:`--preserve-order` flag. See the next section for an example. Convert between FPS and FPB formats =================================== In this section you'll learn how to convert an FPS file into an FPB file and back, and you'll learn how to control the fingerprint ordering. You will need the FPS files generated in :ref:`pubchem_fingerprints` but you do not need a chemistry toolkit. The FPB format was introduced with chemfp-2.0. If you already have an FPS file then you can convert it directly into an FPB file, and without using a chemistry toolkit. The :ref:`fpcat ` program converts from one format to the other. In an earlier section I generated the files pubchem_queries.fps and pubchem_targets.fps . I'll convert each to FPB format:: % fpcat pubchem_targets.fps -o pubchem_targets.fpb % fpcat pubchem_queries.fps -o pubchem_queries.fpb The FPB format is a binary format which is difficult to read directly. The easiest way to see what's inside is to use fpcat. If you don't specify an output filename then it sends the results to stdout in FPS format:: % fpcat pubchem_queries.fpb | head -5 | fold #FPS1 #num_bits=881 #type=CACTVS-E_SCREEN/1.0 extended=2 #software=CACTVS/unknown 00028000e00000000000000000000000000000000000000000000000000000000000009840000000 0000c001000300000000000000000000000000000000000000000200000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000 99116624 The keen-eyed reader might have noticed that the conversion does not have a "source" or "date" field. I haven't figured out if this is a bug. Should I keep the original date and structure file source, or use the current date and FPS file source? Let me know if this is important to you. By default when fpcat generates an FPB file it reorders the fingerprints by population count and creates a popcount index. This improves the similarity search performance, but it means that the order of the FPB file is likely different than the original FPS format. You can get a sense of this by looking at the first fingerprint in the original pubchem_queries.fps file:: % grep -v # pubchem_queries.fps | head -1 | fold 07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000 0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106 69001393e338d1017100000000204000000000000010200000000000000000 99000039 and confirming that it isn't the same as the first fingerpritn in pubchem_queries.fpb. If you want the FPB file to store the fingerprints in input order instead of the popcount order needed for optimized similarity search, then use the :option:`--preserve-order` flag:: % fpcat pubchem_queries.fps --preserve-order -o input_order.fpb % fpcat input_order.fpb | grep -v # | head -1 | fold 07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000 0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106 69001393e338d1017100000000204000000000000010200000000000000000 99000039 On the flip side, fpcat by default preserves the input order when it creates FPS output. If you instead want to the output FPS file to be in popcount order then use the :option:`--reorder` flag:: % fpcat --reorder pubchem_queries.fps | grep -v # | head -1 | fold 00028000e00000000000000000000000000000000000000000000000000000000000009840000000 0000c001000300000000000000000000000000000000000000000200000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000 99116624 Specify the fpcat output format =============================== In this section you'll learn how to specify the output format for fpcat using a command-line option instead of the filename extension. You will need the pubchem_queries.fpb file from :ref:`pubchem_fingerprints`. If you do not specify an output filename then fpcat will output the fingerprints in FPS format to stdout. If you specify a filename then by default it will look at the extension to determine if the output should be an FPB (".fpb"), FPS (".fps"), or gzip or Zstandard compressed FPS (".fps.gz" or ".fps.zst") file. The FPS format is used for unrecognized extensions. In a few rare cases you may want to use a format which doesn't match the default. To be honest, the examples I can think of aren't that realistic, but let's suppose you want to output the contents of an FPB file to stdout in gzip'ed FPS format, and count the number of bytes in compressed output. I'll use the use the --out flag to change the format to 'fps.gz' from the default of 'fps', then compare the resulting size with the uncompressed form:: % fpcat pubchem_queries.fpb --out fps | wc -c 2511714 % fpcat pubchem_queries.fpb --out fps.gz | wc -c 314393 It's not that useful because you could pipe the uncompressed output to gzip, which is also likely faster:: % fpcat pubchem_queries.fpb --out fps | gzip -c -9 | wc -c 11921 In case you're wondering, chemfp 3.4 added support for zstandard compression, if the "zstandard" Python module is available. % fpcat pubchem_queries.fpb --out fps.zst | wc -c 293806 Chemfp cannot write an FPB file to stdout. In fact, the output file must be seek-able, which means it can't be a named pipe either. Alternate fingerprint file formats ================================== In this section you'll learn about chemfp's support for other fingerprint file formats. Chemfp started as a way to promote the FPS file format for fingerprint exchange. Chemfp 2.0 added the FPB format, which is a binary format designed around chemfp's internal search data structure so it can be loaded quickly. There are many other fingerprint formats. Perhaps the best known is the Open Babel `FastSearch `_ format. Two others are Dave Cosgrove's `flush `_ format, and OpenEye's "fpbin" format. The `chemfp_converters package `_ contains utilities to convert between the chemfp formats and these other formats.:: # Convert from/to Dave Cosgrove Flush format flush2fps drugs.flush fps2flush drugs.fps -o drugs.flush # Convert from/to OpenEye's fpbin format fpbin2fps drugs.fpbin --moldb drugs.sdf fps2fpbin drugs_openeye_path.fps --moldb drugs.sdf -o drugs.fpbin # Convert from/to Open Babel's FastSearch format fs2fps drugs.fs --datafile drugs.sdf fps2fs drugs_openbabel_FP2.fps --datafile drugs.sdf -o drugs.fs Of the three formats, the flush format is closest to the FPS data model. That is, it stores fingerprint records as an identifier and the fingerprint bytes. By comparison, the FastSearch and fpbin formats store the fingerprint bytes and an index into another file containing the structure and identifier. It's impossible for chemfp to get the data it needs without reading both files. Chemfp has special support for the flush format. If chemfp_converters is installed, chemfp will use it to read and write flush files nearly everywhere that it accepts FPS files. You can use it at the output to oe2fps, rdkit2fps, and ob2fps, and as the input queries to simsearch, and as both input and output to fpcat. (You cannot use it as the simsearch targets because that code has been optimized for FPS and FPB search, and I haven't spent the time to optimize flush file support.) This means that if chemfp_converters is installed then you can use :ref:`fpcat ` to convert between FPS, FPB, and and flush file formats. For examples:: fpcat drugs.flush -o drugs.fps fpcat drugs.fps -o drugs.flush In addition, you can use it at the API level in :func:`chemfp.open`, :func:`chemfp.load_fingerprints`, :func:`chemfp.open_fingerprint_writer`, and :meth:`.FingerprintArena.save`. Note that the flush format does not support the FPS metadata fields, like the fingerprint type, and it only support fingerprints which are a multiple of 32 bits long. Also, compressed flush files are not supported. Similarity search with the FPB format ===================================== In this section you'll learn how to do a similarity search using an FPB file as the target. You will need the fingerprint files from :ref:`pubchem_fingerprints` but you do not need a chemistry toolkit. NOTE: The Chemfp Base License does not let you generate FPB files. Contact sales@dalkescientific.com to learn about other licensing options. :ref:`Simsearch `, like all of the tools starting with chemfp-2.0, understands both FPS and FPB files:: % simsearch -k 3 --threshold 0.85 -q pubchem_queries.fps pubchem_targets.fpb | head #Simsearch/1 #num_bits=881 #type=Tanimoto k=3 threshold=0.85 #software=chemfp/3.4 #queries=pubchem_queries.fps #targets=pubchem_targets.fpb #query_source=Compound_099000001_099500000.sdf.gz 3 99000039 48503376 0.8785 48503380 0.8729 48732162 0.8596 2 99000230 48563034 0.8588 48731730 0.8523 0 99002251 You can also use an FPB file as the queries. The pubchem_queries.fpb file are indexed, which means the queries with the fewest bits set come first. These will likely be less similar to the targets, so I've lowered the threshold quite considerably:: % simsearch -k 3 --threshold 0.15 -q pubchem_queries.fpb pubchem_targets.fpb | head #Simsearch/1 #num_bits=881 #type=Tanimoto k=3 threshold=0.15 #software=chemfp/3.4 #queries=pubchem_queries.fpb #targets=pubchem_targets.fpb 1 99116624 48637532 0.1607 1 99116625 48637532 0.1607 3 99116667 48656359 0.2727 48656867 0.2667 48839868 0.2642 3 99116668 48656359 0.2727 48656867 0.2667 48839868 0.2642 By default simsearch uses the query and target filename extensions to figure out if the file is in FPS, FPB, or flush format. If you don't want it to auto-detect the format then use the :option:`--query-format` and :option:`--target-format` options to tell it the format to use. The values can be one of "fps", "fps.gz", "fps.zst", "fpb", "fpb.gz", "fpb.zst", or "flush". Converting large data sets to FPB format ======================================== In this section you'll learn how to generate an FPB file on computers with relatively limited memory. To be realistic, this example uses the complete `PubChem data set `_, and extracts the CACTVS/PubChem fingerprints which are in each record. You do not need a chemistry toolkit for this section. The most direct way to extract the PubChem fingerprints from a PubChem distribution is to use :ref:`sdf2fps `:: sdf2fps --pubchem pubchem/Compound_*.sdf.gz -o pubchem.fpb This uses the default FPB writer options, which stores all of the fingerprints in memory, sorts them, and saves the result to the output file. This may use several times as much memory as the final FPB output size, which is a bit unfortunate if you want to generate a 7 GB FPB file on a 12 GB machine. When I updated this section in June 2020, it took around 25GB of memory to create an FPB file with 102,768,482 PubChem fingerprints, and the final file was about 14GB. (Note: see :ref:`the next section ` for a two-stage solution that lets you parallelize fingerprint generation.) The "\*2fps" command-line tools do not have a way to change the default writer options, although :ref:`fpcat ` does. The :option:`--max-spool-size` option sets a rough upper bound to the amount of memory to use. When enabled, the writer breaks the input into parts and creates a temporary FPB file for each part. At the end, it merges the sorted data from the temporary FPB files to get the final FPB file. Be aware that the specified spool size is only approximate and is not a hard limit on the maximum amount of memory to use. You may need to experiment a bit if you have tight constraints, and this option might not be as useful as I thought it was. The value must be a size in bytes, though suffixes like M or MB for megabyte and T or TB for terabyte are also allowed. These are in base-10 units, so 1 MB = 1,000,000 bytes. Spaces are not allowed between the number and the suffix, so "200MB" is okay but "200 MB" is not. The size must be at least 20 MB. Here is an example of how to convert the CACTVS fingerprints from all of PubChem to an FPB file, using a relatively small limit of 200 MB:: sdf2fps --pubchem pubchem/Compound_*.sdf.gz | fpcat --max-spool-size 200MB -o pubchem.fpb This will take a while! The sdf2fps alone takes almost 45 minutes on a ca. 2017-era Haswell machine. If I save the intermediate results to an FPS file then the in-memory fpcat conversion from FPS to FPB takes 5½ minutes and requires 25GB of memory. With spool of 200MB, the conversion takes nearly 10 minutes. According to ``htop``, the spooled conversion required, near the peak, 13.3G of virtual memory, a resident set size of 12G, and 10.6G of shared shared pages. The shared pages are from memory-mapping the intermediate FPB files, so this probably required only 2GB of real memory. If I use a 1GB spool size, the conversion time decreases from 10 to 8 minutes, and uses about the same amount of peak memory. The temporary files will be placed under the appropriate temporary directory for your operating system. If that disk isn't large enough for the intermediate files then use the :option:`--tmpdir` option of fpcat to specify an alternate directory:: fpcat --max-spool-size 1GB pubchem.fps -o pubchem.fpb --tmpdir /usr/tmp Another option is to specify the directory location using the TMPDIR, TEMP, or TMP environment variables, which are resolved in that order. The details are described in the Python documentation for `tempfile.tempdir `_. .. _generate_fingerprints_in_parallel: Generate fingerprints in parallel and merge to FPB format ========================================================= In this section you'll learn how to merge multiple sorted fingerprints into a single FPB file. The previous section used a single shell command to extract the PubChem/CACTVS fingerprints from PubChem and generate an FPB file. This is easy to write and understand, but more complex versions may be more appropriate. For one, I have four cores on my desktop computer, and I want to use them to process the PubChem files in parallel. The previous section was only single threaded. I have all my PubChem files in ``~/pubchem/``. For each "Compound_*.sdf.gz" file in that directory I want to extract the CACTVS/PubChem fingerprints and create an intermediate FPS file in the local directory. That's equivalent to running the following commands:: sdf2fps --pubchem ~/pubchem/Compound_000000001_000500000.sdf.gz \\ -o Compound_000000001_000500000.fps sdf2fps --pubchem ~/pubchem/Compound_000500001_001000000.sdf.gz \\ -o Compound_000500001_001000000.fps ... 291 more lines ... except that I want to run four at a time. This is what `GNU Parallel `_ was designed for. It's a command-line tool which can parallelize the execution of other command-lines. I'll start by explaining the core command-line substitution pattern:: sdf2fps --pubchem {} -o {/..}.fps' The ``{}`` will be replaced with a filename, and ``{/..}`` will be replaced with the base filename, without the directory path prefix or the two suffixes. That is, when ``{}`` is "/Users/dalke/pubchem/Compound_000000001_000500000.sdf.gz" then ``{/..}`` will be "Compound_000000001_000500000.fps". Since I want to generate an FPS file, I added the ".fps" as a suffix to the second substitution parameter. I then tell GNU parallel which command-line to use, along with a few other parameters. Here's the full line, which I split over two lines to make it more readable:: parallel --plus --no-notice --bar 'sdf2fps --pubchem {} -o {/..}.fps' ::: ~/pubchem/Compound_*.sdf.gz The :option:`--plus` tells GNU parallel to recognize an expanded set of replacement strings. ("{/..}" is not part of the standard set of patterns.) The :option:`--no-notice` tells it to not display the message about citing GNU parallel in scientific papers. The :option:`--bar` enables a progress bar, which looks like this:: 30% 88:205=11m17s /Users/dalke/pubchem/Compound_045500001_046000000.sdf.gz This status line shows that processing is 30% complete, which is file 88 out of 205, and there's an estimated 11 minutes and 17 seconds remaining. Finally, the ":::" indicates that the remaining options are the list of parameters to pass to the command-line template for parallelization. After about 21 minutes, using 4 CPUs on my laptop (with an effective scaling of 2.8), I now have a large number of FPS files, which I want to merge into a single FPB file. I'll use :ref:`fpcat `:: fpcat --max-spool-size 1GB Compound*.fps -o pubchem.fpb Unfortunately my laptop ran out of disk space, so I'll just leave it a that; re-doing the same command on a server machine won't provide you any new information.