=============================== Help for the command-line tools =============================== The chemfp command-line tools are: * :ref:`fpcat ` - merge multiple fingerprint files into one * :ref:`ob2fps ` - use Open Babel to generate fingerprints * :ref:`oe2fps ` - use OEChem/OEGraphSim to generate fingerprints * :ref:`rdkit2fps ` - use RDKit to generate fingerprints * :ref:`sdf2fps ` - extract fingerprints from an SD file * :ref:`simsearch ` - search a fingerprint file for similar fingerprints .. _fpcat: fpcat command-line options ========================== The following comes from ``fpcat --help``: .. code-block:: none usage: fpcat [-h] [--in FORMAT] [--merge] [-o FILENAME] [--out FORMAT] [--reorder] [--preserve-order] [--alignment N] [--show-progress] [--max-spool-size SIZE] [--tmpdir DIRNAME] [--version] [filename [filename ...]] Combine multiple fingerprint files into a single file. positional arguments: filename input fingerprint filenames (default: use stdin) optional arguments: -h, --help show this help message and exit --in FORMAT input fingerprint format. One of fps, fps.gz, or fpb. (default guesses from filename or is fps) --merge assume the input fingerprint files are in popcount order and do a merge sort -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output fingerprint format. One of fps, fps.gz, or fpb. (default guesses from output filename, or is 'fps') --reorder reorder the output fingerprints by popcount (default for FPB output) --preserve-order save the output fingerprints in the same order as the input (default for FPS output) --alignment N alignment size when saving a FPB file (default=8) --show-progress show progress --max-spool-size SIZE use temporary files for extra storage space for huge FPB files (default uses RAM) --tmpdir DIRNAME directory for the temporary files (default uses the system temp directory) --version show program's version number and exit Examples: fpcat can be used to convert between FPS and FPB formats. This is handy if you want to see what's inside of an FPB file: fpcat fingerprints.fpb You can use also use fpcat to make an FPB file from an FPS file: fpcat fingerprints.fps -o fingerprints.fpb You might have generated a set of FPS file which you want to merge into a single FPB. (For example, you might have used GNU parallel to generate FPS files for each of the PubChem files, which you want to merge into a single file.): fpcat Compound_*.fps -o pubchem.fpb By default the FPB format sorts the fingerprints by popcount. (Use --preserve-order if you really want to preserve the input order.) The sort overhead for PubChem uses about 10 GB of RAM. If you don't have that much memory then ask fpcat to use less memory: fpcat --max-spool-size 1GB Compound_*.fps -o pubchem.fpb This will use about 2 GB of RAM and the --tmpdir for the rest. (Yes, it would be nice if I could get those two memory size numbers to match.) The --merge option is experimental. Use it if the input fingerprints are in popcount order, because sorted output is a simple merge sort of the individual sorted inputs. However, this option opens all input files at the same time, which may exceed your resource limit on file descriptors. The current implementation also requires a lot of disk seeks so is slow for many files. .. _ob2fps: ob2fps command-line options =========================== The following comes from ``ob2fps --help``: .. code-block:: none usage: ob2fps [-h] [--FP2 | --FP3 | --FP4 | --MACCS | --substruct | --rdmaccs | --rdmaccs/1] [--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version] [filenames [filenames ...]] Generate FPS or FPB fingerprints from a structure file using OpenBabel positional arguments: filenames input structure files (default is stdin) optional arguments: -h, --help show this help message and exit --FP2 linear fragments up to 7 atoms --FP3 SMARTS patterns specified in the file patterns.txt --FP4 SMARTS patterns specified in the file SMARTS_InteLigand.txt --MACCS Open Babel's implementation of the MACCS 166 keys --substruct ChemFP substructure fingerprints --rdmaccs, --rdmaccs/2 166 bit RDKit/MACCS fingerprints (version 2) --rdmaccs/1 use the version 1 definition for --rdmaccs --id-tag NAME tag name containing the record id (SD files only) --in FORMAT input structure format (default autodetects from the filename extension) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file Aliase for '-R has_header=1' --version show program's version number and exit By default the Open Babel structure reader determines the file format and compression type based on the filename extension. Unknown filename extensions are treated as a uncompressed SMILES files. If the data comes from stdin, or the guess based on extension name is wrong, then use "--in FORMAT" option to change the default input format. For examples: --in smi --in sdf.gz The most commmon format names are : File Type Valid FORMAT names --------- ------------------ SMILES smi, can, usm - append ".gz" for gzip'ed files InChI inchi - append ".gz" for gzip'ed files SDF (native) sdf - gzip compression is handled automatically SDF (chemfp) sdf - append ".gz" suffix for gzip'ed files MOL2 mol2 - gzip compression is handled automatically PDB pdb - " " " " " MacroModel mmod - " " " " " For a full list of formats, see http://openbabel.org/wiki/List_of_extensions . Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead. chemfp uses its own parsers to find SMILES and InChi records, which are passed on to Open Babel for processing. These give chemfp better error reporting and control. However, unlike the normal Open Babel parsers, they do not automatically recognize gzip files, so the format name must include the ".gz" suffix to read compressed formats. By default chemfp uses Open Babel's native SDF reader. It also supports an alternate implementation using chemfp's low-level SDF record parser. To use chemfp's record parser, use the 'implementation' reader argument: -R implementation=chemfp All format support Open Babel's 'options' OBConversion argument. This is a compact string like 'ab"btext"', which in this case sets option 'a' to True, and option 'b' to text "btext". You will need to consult the Open Babel documentation and implementation for details on the options available to each format. .. _oe2fps: oe2fps command-line options =========================== The following comes from ``oe2fps --help``: .. code-block:: none usage: oe2fps [-h] [--path] [--circular] [--tree] [--numbits INT] [--minbonds INT] [--maxbonds INT] [--minradius INT] [--maxradius INT] [--atype ATYPE] [--btype BTYPE] [--maccs166] [--substruct] [--rdmaccs] [--rdmaccs/1] [--aromaticity NAME] [--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--version] [filenames [filenames ...]] Generate FPS or FPB fingerprints from a structure file using OEChem positional arguments: filenames input structure files (default is stdin) optional arguments: -h, --help show this help message and exit --aromaticity NAME use the named aromaticity model (same as '-R aromaticity=NAME') --id-tag NAME tag name containing the record id (SD files only) --in FORMAT input structure format (default guesses from filename) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --version show program's version number and exit path, circular, and tree fingerprints: --path generate path fingerprints (default) --circular generate circular fingerprints --tree generate tree fingerprints --numbits INT number of bits in the fingerprint (default=4096) --minbonds INT minimum number of bonds in the path or tree fingerprint (default=0) --maxbonds INT maximum number of bonds in the path or tree fingerprint (path default=5, tree default=4) --minradius INT minimum radius for the circular fingerprint (default=0) --maxradius INT maximum radius for the circular fingerprint (default=5) --atype ATYPE atom type flags, described below (default=Default) --btype BTYPE bond type flags, described below (default=Default) 166 bit MACCS substructure keys: --maccs166 generate MACCS fingerprints 881 bit ChemFP substructure keys: --substruct generate ChemFP substructure fingerprints ChemFP version of the 166 bit RDKit/MACCS keys: --rdmaccs, --rdmaccs/2 generate 166 bit RDKit/MACCS fingerprints (version 2) --rdmaccs/1 use the version 1 definition for --rdmaccs ATYPE is one or more of the following, separated by the '|' character Arom AtmNum Chiral EqArom EqHBAcc EqHBDon EqHalo FCharge HCount HvyDeg Hyb InRing The following shorthand terms and expansions are also available: DefaultPathAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo DefaultCircularAtom = AtmNum|Arom|Chiral|FCharge|HCount|EqHalo DefaultTreeAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb and 'Default' selects the correct value for the specified fingerprint. Examples: --atype Default --atype "Arom|AtmNum|FCharge|HCount" --atype Arom,AtmNum,FCharge,HCount BTYPE is one or more of the following, separated by the '|' character Chiral InRing Order The following shorthand terms and expansions are also available: DefaultPathBond = Order|Chiral DefaultCircularBond = Order DefaultTreeBond = Order and 'Default' selects the correct value for the specified fingerprint. Examples: --btype Default --btype Order|InRing To simplify command-line use, a comma may be used instead of a '|' to separate different fields. Example: --atype AtmNum,HvyDegree OEChem guesses the input structure format based on the filename extension and assumes SMILES for structures read from stdin. Use "--in FORMAT" to select an alternative, where FORMAT is one of: File Type Valid FORMATs (use gz if compressed) --------- ------------------------------------ SMILES smi, can, usm, smi.gz, can.gz, usm.gz SDF sdf, mol, sdf.gz, mol.gz SKC skc, skc.gz CDK cdk, cdk.gz MOL2 mol2, mol2.gz PDB pdb, pdb.gz MacroModel mmod, mmod.gz OEBinary v2 oeb, oeb.gz InChI inchi, inchi.gz Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead. Use the '-R' reader arguments option to pass in format-specific structure reader arguments. The details depend on the specific format. All formats handle the following two reader arguments: aromaticity - one of 'openeye', 'daylight', 'tripos', 'mdl', or 'mmff' (this can also be set via the older '--aromaticity' command-line option) flavor - a '|' or ',' separated list of flavor names, or a numeric value. A leading '-' means to remove the given flavor. Examples include: o Canon,Strict -- the bitwise merger of the format's Canon and Strict values o DEFAULT|-Kekule -- the format's DEFAULT flavor but without the Kekule bits (every flavor has a DEFAULT) o 42 -- the specific OEChem flavor value 42 Format Reader arguments ------ ---------------- smi, flavor using 'Canon', 'Strict', and 'DEFAULT' can, delimiter -- one of 'to-eol', 'tab', 'whitespace', or 'space' & usm sdf the only flavor is 'DEFAULT' skc the only flavor is 'DEFAULT' mol2 flavor using 'M2H' mol2h flavor using 'M2H' mmod flavor using 'FormalCrg' pdb flavor using 'ALL', 'BondOrder', 'CHARGE', 'Connect', 'DATA', 'END', 'ENDM', 'FORMALCHARGE', 'FormalCrg', 'ImplicitH', 'RADIUS', 'Rings', 'SecStruct', and 'TER' xyz flavor using 'BondOrder', 'Connect', 'FormalCrg', 'ImplicitH', and 'Rings' cdx flavor using 'SuperAtom' oeb the only flavor is 'DEFAULT' See http://docs.eyesopen.com/toolkits/cpp/oechemtk/molreadwrite.html#flavored-input-and-output for a description of available flavors for each format. .. _rdkit2fps: rdkit2fps command-line options ============================== The following comes from ``rdkit2fps --help``: .. code-block:: none usage: rdkit2fps [-h] [--fpSize FPSIZE] [--RDK] [--minPath INT] [--maxPath INT] [--nBitsPerHash INT] [--useHs 0|1] [--morgan] [--radius INT] [--useFeatures 0|1] [--useChirality 0|1] [--useBondTypes 0|1] [--torsions] [--targetSize INT] [--pairs] [--minLength INT] [--maxLength INT] [--maccs166] [--avalon] [--isQuery 0_or_1] [--bitFlags INT] [--pattern] [--substruct] [--rdmaccs] [--rdmaccs/1] [--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version] [filenames [filenames ...]] Generate FPS or FPB fingerprints from a structure file using RDKit positional arguments: filenames input structure files (default is stdin) optional arguments: -h, --help show this help message and exit --fpSize FPSIZE number of bits in the fingerprint. Default of 2048 for RDK, Morgan, topological torsion, atom pair, and pattern fingerprints, and 512 for Avalon fingerprints --id-tag NAME tag name containing the record id (SD files only) --in FORMAT input structure format (default guesses from filename) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file Aliase for '-R has_header=1' --version show program's version number and exit RDKit topological fingerprints: --RDK generate RDK fingerprints (default) --minPath INT minimum number of bonds to include in the subgraph (default=1) --maxPath INT maximum number of bonds to include in the subgraph (default=7) --nBitsPerHash INT number of bits to set per path (default=2) --useHs 0|1 include information about the number of hydrogens on each atom (default=1) RDKit Morgan fingerprints: --morgan generate Morgan fingerprints --radius INT radius for the Morgan algorithm (default=2) --useFeatures 0|1 use chemical-feature invariants (default=0) --useChirality 0|1 include chirality information (default=0) --useBondTypes 0|1 include bond type information (default=1) RDKit Topological Torsion fingerprints: --torsions generate Topological Torsion fingerprints --targetSize INT number of bonds per torsion (default=4) RDKit Atom Pair fingerprints: --pairs generate Atom Pair fingerprints --minLength INT minimum bond count for a pair (default=1) --maxLength INT maximum bond count for a pair (default=30) 166 bit MACCS substructure keys: --maccs166 generate MACCS fingerprints Avalon fingerprints: --avalon generate Avalon fingerprints --isQuery 0_or_1 is the fingerprint for a query structure? (1 if yes, 0 if no) (default=0) --bitFlags INT bit flags, SSSBits are 32767 and similarity bits are 15761407 (default=15761407) RDKit Pattern fingerprints: --pattern generate (substructure) pattern fingerprints ChemFP's version of the 881 bit PubChem substructure keys: --substruct generate ChemFP substructure fingerprints ChemFP version of the 166 bit RDKit/MACCS keys: --rdmaccs, --rdmaccs/2 generate 166 bit RDKit/MACCS fingerprints (version 2) --rdmaccs/1 use the version 1 definition for --rdmaccs This program guesses the input structure format and the compression based on the filename extension. If the guess fails then it assumes the input is an uncompressed SMILES file. If the data comes from stdin, or the guess based on extension name is wrong, then use "--in" to change the default input format. The supported format extensions are: File Type Valid FORMATs (use gz if compressed) --------- ------------------------------------ SMILES smi, can, usm, smi.gz, can.gz, ism.gz SDF sdf, sdf.gz InChI inchi, inchi.gz Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead. Use the '-R' reader arguments option to pass in format-specific structure reader arguments. The details depend on the specific format. * All of the input formats implement the 'sanitize' option. Use "-R sanitize=false" to disable the default sanitization. * The SMILES formats use the 'delimiter' option to specify the delimiter type. The default is 'to-eol'. The other values are "tab", "whitespace", and "space". Use "-R delimiter=whitespace" to match RDKit's native delimiter style. * The SDF format supports two additional reader arguments: * 'strictParsing'; use "-R strictParsing=false" to disable strict parsing * 'removeHs'; use "-R removeHs=false" to keep all of the hydrogens * The InChI format supports four additional reader arguments: * 'delimiter' works the same as it does for the SMILES formats * 'removeHs' works the same as it does for the SDF format * 'treatWarningAsError'; use "-R treatWarningAsError=true" to convert all warnings into errors * 'logLevel' specifies the RDKit/InChI library log level, as an integer .. _sdf2fps: sdf2fps command-line options ============================ The following comes from ``sdf2fps --help``: .. code-block:: none usage: sdf2fps [-h] [--id-tag TAG] [--fp-tag TAG] [--in FORMAT] [--num-bits INT] [--errors {strict,report,ignore}] [-o FILENAME] [--out FORMAT] [--software TEXT] [--type TEXT] [--version] [--binary] [--binary-msb] [--hex] [--hex-lsb] [--hex-msb] [--base64] [--cactvs] [--daylight] [--decoder DECODER] [--pubchem] [filenames [filenames ...]] Extract a fingerprint tag from an SD file and generate FPS or FPB fingerprints positional arguments: filenames input SD files (default is stdin) optional arguments: -h, --help show this help message and exit --id-tag TAG get the record id from TAG instead of the first line of the record --fp-tag TAG get the fingerprint from tag TAG (required) --in FORMAT Specify if the input SD file is uncompressed or gzip compressed --num-bits INT use the first INT bits of the input. Use only when the last 1-7 bits of the last byte are not part of the fingerprint. Unexpected errors will occur if these bits are not all zero. --errors {strict,report,ignore} how should structure parse errors be handled? (default=strict) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --software TEXT use TEXT as the software description --type TEXT use TEXT as the fingerprint type description --version show program's version number and exit Fingerprint decoding options: --binary Encoded with the characters '0' and '1'. Bit #0 comes first. Example: 00100000 encodes the value 4 --binary-msb Encoded with the characters '0' and '1'. Bit #0 comes last. Example: 00000100 encodes the value 4 --hex Hex encoded. Bit #0 is the first bit (1<<0) of the first byte. Example: 01f2 encodes the value \x01\xf2 = 498 --hex-lsb Hex encoded. Bit #0 is the eigth bit (1<<7) of the first byte. Example: 804f encodes the value \x01\xf2 = 498 --hex-msb Hex encoded. Bit #0 is the first bit (1<<0) of the last byte. Example: f201 encodes the value \x01\xf2 = 498 --base64 Base-64 encoded. Bit #0 is first bit (1<<0) of first byte. Example: AfI= encodes value \x01\xf2 = 498 --cactvs CACTVS encoding, based on base64 and includes a version and bit length --daylight Daylight encoding, which is is base64 variant --decoder DECODER import and use the DECODER function to decode the fingerprint shortcuts: --pubchem decode CACTVS substructure keys used in PubChem. Same as --software=CACTVS/unknown --type 'CACTVS-E_SCREEN/1.0 extended=2' --fp-tag=PUBCHEM_CACTVS_SUBSKEYS --cactvs .. _simsearch: simsearch command-line options ============================== The following comes from ``simsearch --help``: .. code-block:: none usage: simsearch [-h] [-k K_NEAREST] [-t THRESHOLD] [--alpha ALPHA] [--beta BETA] [--queries QUERIES] [--NxN] [--query QUERY] [--hex-query HEX_QUERY] [--query-id QUERY_ID] [--query-format FORMAT] [--target-format FORMAT] [-o FILENAME] [-c] [-b BATCH_SIZE] [--scan] [--memory] [--times] [--version] target_filename Search an FPS or FPB file for similar fingerprints positional arguments: target_filename target filename optional arguments: -h, --help show this help message and exit -k K_NEAREST, --k-nearest K_NEAREST select the k nearest neighbors (use 'all' for all neighbors) -t THRESHOLD, --threshold THRESHOLD minimum similarity score threshold --alpha ALPHA Tversky alpha parameter (default: 1.0) --beta BETA Tversky beta parameter (default: the value of --alpha) --queries QUERIES, -q QUERIES filename containing the query fingerprints --NxN use the targets as the queries, and exclude the self- similarity term --query QUERY query as a structure record (default format: 'smi') --hex-query HEX_QUERY query in hex --query-id QUERY_ID id for the query or hex-query (default: 'Query1' --query-format FORMAT, --in FORMAT input query format (default uses the file extension, else 'fps') --target-format FORMAT input target format (default uses the file extension, else 'fps') -o FILENAME, --output FILENAME output filename (default is stdout) -c, --count report counts -b BATCH_SIZE, --batch-size BATCH_SIZE batch size --scan scan the file to find matches (low memory overhead) --memory build and search an in-memory data structure (faster for multiple queries) --times report load and execution times to stderr --version show program's version number and exit