Help for the command-line tools

The chemfp command-line tools are:

  • fpcat - merge multiple fingerprint files into one
  • ob2fps - use Open Babel to generate fingerprints
  • oe2fps - use OEChem/OEGraphSim to generate fingerprints
  • rdkit2fps - use RDKit to generate fingerprints
  • sdf2fps - extract fingerprints from an SD file
  • simsearch - search a fingerprint file for similar fingerprints

fpcat command-line options

The following comes from fpcat --help:

usage: fpcat [-h] [--in FORMAT] [--merge] [-o FILENAME] [--out FORMAT]
             [--reorder] [--preserve-order] [--alignment N]
             [--show-progress] [--max-spool-size SIZE] [--tmpdir DIRNAME]
             [--version]
             [filename [filename ...]]

Combine multiple fingerprint files into a single file.

positional arguments:
  filename              input fingerprint filenames (default: use stdin)

optional arguments:
  -h, --help            show this help message and exit
  --in FORMAT           input fingerprint format. One of fps, fps.gz, or fpb.
                        (default guesses from filename or is fps)
  --merge               assume the input fingerprint files are in popcount
                        order and do a merge sort
  -o FILENAME, --output FILENAME
                        save the fingerprints to FILENAME (default=stdout)
  --out FORMAT          output fingerprint format. One of fps, fps.gz, or fpb.
                        (default guesses from output filename, or is 'fps')
  --reorder             reorder the output fingerprints by popcount (default
                        for FPB output)
  --preserve-order      save the output fingerprints in the same order as the
                        input (default for FPS output)
  --alignment N         alignment size when saving a FPB file (default=8)
  --show-progress       show progress
  --max-spool-size SIZE
                        use temporary files for extra storage space for huge
                        FPB files (default uses RAM)
  --tmpdir DIRNAME      directory for the temporary files (default uses the
                        system temp directory)
  --version             show program's version number and exit

Examples:

fpcat can be used to convert between FPS and FPB formats. This is
handy if you want to see what's inside of an FPB file:

    fpcat fingerprints.fpb

You can use also use fpcat to make an FPB file from an FPS file:

    fpcat fingerprints.fps -o fingerprints.fpb

You might have generated a set of FPS file which you want to merge
into a single FPB. (For example, you might have used GNU parallel to
generate FPS files for each of the PubChem files, which you want to
merge into a single file.):

    fpcat Compound_*.fps -o pubchem.fpb

By default the FPB format sorts the fingerprints by popcount. (Use
--preserve-order if you really want to preserve the input order.)  The
sort overhead for PubChem uses about 10 GB of RAM. If you don't have
that much memory then ask fpcat to use less memory:

    fpcat --max-spool-size 1GB Compound_*.fps -o pubchem.fpb

This will use about 2 GB of RAM and the --tmpdir for the rest. (Yes,
it would be nice if I could get those two memory size numbers to
match.)

The --merge option is experimental. Use it if the input fingerprints
are in popcount order, because sorted output is a simple merge sort of
the individual sorted inputs. However, this option opens all input
files at the same time, which may exceed your resource limit on file
descriptors. The current implementation also requires a lot of disk
seeks so is slow for many files.

ob2fps command-line options

The following comes from ob2fps --help:

usage: ob2fps [-h]
              [--FP2 | --FP3 | --FP4 | --MACCS | --substruct | --rdmaccs | --rdmaccs/1]
              [--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT]
              [--errors {strict,report,ignore}] [-R NAME=VALUE]
              [--delimiter {tab,whitespace,to-eol,space}] [--has-header]
              [--version]
              [filenames [filenames ...]]

Generate FPS or FPB fingerprints from a structure file using OpenBabel

positional arguments:
  filenames             input structure files (default is stdin)

optional arguments:
  -h, --help            show this help message and exit
  --FP2                 linear fragments up to 7 atoms
  --FP3                 SMARTS patterns specified in the file patterns.txt
  --FP4                 SMARTS patterns specified in the file
                        SMARTS_InteLigand.txt
  --MACCS               Open Babel's implementation of the MACCS 166 keys
  --substruct           ChemFP substructure fingerprints
  --rdmaccs, --rdmaccs/2
                        166 bit RDKit/MACCS fingerprints (version 2)
  --rdmaccs/1           use the version 1 definition for --rdmaccs
  --id-tag NAME         tag name containing the record id (SD files only)
  --in FORMAT           input structure format (default autodetects from the
                        filename extension)
  -o FILENAME, --output FILENAME
                        save the fingerprints to FILENAME (default=stdout)
  --out FORMAT          output structure format (default guesses from output
                        filename, or is 'fps')
  --errors {strict,report,ignore}
                        how should structure parse errors be handled?
                        (default=ignore)
  -R NAME=VALUE         specify a reader argument
  --delimiter {tab,whitespace,to-eol,space}
                        delimiter style for SMILES and InChI files. Alias for
                        '-R delimiter=VALUE'.
  --has-header          Skip the first line of a SMILES or InChI file Aliase
                        for '-R has_header=1'
  --version             show program's version number and exit

By default the Open Babel structure reader determines the file format
and compression type based on the filename extension. Unknown
filename extensions are treated as a uncompressed SMILES files.

If the data comes from stdin, or the guess based on extension name is
wrong, then use "--in FORMAT" option to change the default input format.
For examples:

   --in smi
   --in sdf.gz

The most commmon format names are :

  File Type      Valid FORMAT names
  ---------      ------------------
   SMILES        smi, can, usm  - append ".gz" for gzip'ed files
   InChI         inchi          - append ".gz" for gzip'ed files
   SDF (native)  sdf            - gzip compression is handled automatically
   SDF (chemfp)  sdf            - append ".gz"  suffix for gzip'ed files
   MOL2          mol2           - gzip compression is handled automatically
   PDB           pdb            -   "       "       "    "         "
   MacroModel    mmod           -   "       "       "    "         "

For a full list of formats, see http://openbabel.org/wiki/List_of_extensions .

Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead.

chemfp uses its own parsers to find SMILES and InChi records, which are
passed on to Open Babel for processing. These give chemfp better error
reporting and control. However, unlike the normal Open Babel parsers, they
do not automatically recognize gzip files, so the format name must include
the ".gz" suffix to read compressed formats.

By default chemfp uses Open Babel's native SDF reader. It also supports
an alternate implementation using chemfp's low-level SDF record parser.
To use chemfp's record parser, use the 'implementation' reader argument:

   -R implementation=chemfp

All format support Open Babel's 'options' OBConversion argument. This is a
compact string like 'ab"btext"', which in this case sets option 'a' to
True, and option 'b' to text "btext".

You will need to consult the Open Babel documentation and implementation
for details on the options available to each format.

oe2fps command-line options

The following comes from oe2fps --help:

usage: oe2fps [-h] [--path] [--circular] [--tree] [--numbits INT]
              [--minbonds INT] [--maxbonds INT] [--minradius INT]
              [--maxradius INT] [--atype ATYPE] [--btype BTYPE] [--maccs166]
              [--substruct] [--rdmaccs] [--rdmaccs/1] [--aromaticity NAME]
              [--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT]
              [--errors {strict,report,ignore}] [-R NAME=VALUE]
              [--delimiter {tab,whitespace,to-eol,space}] [--version]
              [filenames [filenames ...]]

Generate FPS or FPB fingerprints from a structure file using OEChem

positional arguments:
  filenames             input structure files (default is stdin)

optional arguments:
  -h, --help            show this help message and exit
  --aromaticity NAME    use the named aromaticity model (same as '-R
                        aromaticity=NAME')
  --id-tag NAME         tag name containing the record id (SD files only)
  --in FORMAT           input structure format (default guesses from filename)
  -o FILENAME, --output FILENAME
                        save the fingerprints to FILENAME (default=stdout)
  --out FORMAT          output structure format (default guesses from output
                        filename, or is 'fps')
  --errors {strict,report,ignore}
                        how should structure parse errors be handled?
                        (default=ignore)
  -R NAME=VALUE         specify a reader argument
  --delimiter {tab,whitespace,to-eol,space}
                        delimiter style for SMILES and InChI files. Alias for
                        '-R delimiter=VALUE'.
  --version             show program's version number and exit

path, circular, and tree fingerprints:
  --path                generate path fingerprints (default)
  --circular            generate circular fingerprints
  --tree                generate tree fingerprints
  --numbits INT         number of bits in the fingerprint (default=4096)
  --minbonds INT        minimum number of bonds in the path or tree
                        fingerprint (default=0)
  --maxbonds INT        maximum number of bonds in the path or tree
                        fingerprint (path default=5, tree default=4)
  --minradius INT       minimum radius for the circular fingerprint
                        (default=0)
  --maxradius INT       maximum radius for the circular fingerprint
                        (default=5)
  --atype ATYPE         atom type flags, described below (default=Default)
  --btype BTYPE         bond type flags, described below (default=Default)

166 bit MACCS substructure keys:
  --maccs166            generate MACCS fingerprints

881 bit ChemFP substructure keys:
  --substruct           generate ChemFP substructure fingerprints

ChemFP version of the 166 bit RDKit/MACCS keys:
  --rdmaccs, --rdmaccs/2
                        generate 166 bit RDKit/MACCS fingerprints (version 2)
  --rdmaccs/1           use the version 1 definition for --rdmaccs

ATYPE is one or more of the following, separated by the '|' character

  Arom AtmNum Chiral EqArom EqHBAcc EqHBDon EqHalo FCharge HCount HvyDeg
  Hyb InRing

The following shorthand terms and expansions are also available:
 DefaultPathAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo
 DefaultCircularAtom = AtmNum|Arom|Chiral|FCharge|HCount|EqHalo
 DefaultTreeAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb
and 'Default' selects the correct value for the specified fingerprint.

Examples:
  --atype Default
  --atype "Arom|AtmNum|FCharge|HCount"
  --atype Arom,AtmNum,FCharge,HCount

BTYPE is one or more of the following, separated by the '|' character

  Chiral InRing Order

The following shorthand terms and expansions are also available:
 DefaultPathBond = Order|Chiral
 DefaultCircularBond = Order
 DefaultTreeBond = Order
and 'Default' selects the correct value for the specified fingerprint.

Examples:
   --btype Default
   --btype Order|InRing

To simplify command-line use, a comma may be used instead of a '|' to
separate different fields. Example:
  --atype AtmNum,HvyDegree

OEChem guesses the input structure format based on the filename
extension and assumes SMILES for structures read from stdin.
Use "--in FORMAT" to select an alternative, where FORMAT is one of:

  File Type      Valid FORMATs (use gz if compressed)
  ---------      ------------------------------------
   SMILES        smi, can, usm, smi.gz, can.gz, usm.gz
   SDF           sdf, mol, sdf.gz, mol.gz
   SKC           skc, skc.gz
   CDK           cdk, cdk.gz
   MOL2          mol2, mol2.gz
   PDB           pdb, pdb.gz
   MacroModel    mmod, mmod.gz
   OEBinary v2   oeb, oeb.gz
   InChI         inchi, inchi.gz

Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead.

Use the '-R' reader arguments option to pass in format-specific structure
reader arguments. The details depend on the specific format. All formats
handle the following two reader arguments:

  aromaticity - one of 'openeye', 'daylight', 'tripos', 'mdl', or 'mmff'
      (this can also be set via the older '--aromaticity' command-line option)

  flavor - a '|' or ',' separated list of flavor names, or a numeric value.
       A leading '-' means to remove the given flavor. Examples include:

       o  Canon,Strict  -- the bitwise merger of the format's Canon and Strict values
       o  DEFAULT|-Kekule -- the format's DEFAULT flavor but without the Kekule bits
                      (every flavor has a DEFAULT)
       o  42  -- the specific OEChem flavor value 42

  Format    Reader arguments
  ------    ----------------
    smi,    flavor using 'Canon', 'Strict', and 'DEFAULT'
    can,    delimiter -- one of 'to-eol', 'tab', 'whitespace', or 'space'
  & usm

    sdf     the only flavor is 'DEFAULT'
    skc     the only flavor is 'DEFAULT'
    mol2    flavor using 'M2H'
   mol2h    flavor using 'M2H'
    mmod    flavor using 'FormalCrg'
    pdb     flavor using 'ALL', 'BondOrder', 'CHARGE', 'Connect', 'DATA',
                'END', 'ENDM', 'FORMALCHARGE', 'FormalCrg', 'ImplicitH',
                'RADIUS', 'Rings', 'SecStruct', and 'TER'
    xyz     flavor using 'BondOrder', 'Connect', 'FormalCrg', 'ImplicitH',
                and 'Rings'
    cdx     flavor using 'SuperAtom'
    oeb     the only flavor is 'DEFAULT'

See http://docs.eyesopen.com/toolkits/cpp/oechemtk/molreadwrite.html#flavored-input-and-output
for a description of available flavors for each format.

rdkit2fps command-line options

The following comes from rdkit2fps --help:

usage: rdkit2fps [-h] [--fpSize FPSIZE] [--RDK] [--minPath INT]
                 [--maxPath INT] [--nBitsPerHash INT] [--useHs 0|1] [--morgan]
                 [--radius INT] [--useFeatures 0|1] [--useChirality 0|1]
                 [--useBondTypes 0|1] [--torsions] [--targetSize INT]
                 [--pairs] [--minLength INT] [--maxLength INT] [--maccs166]
                 [--avalon] [--isQuery 0_or_1] [--bitFlags INT] [--pattern]
                 [--substruct] [--rdmaccs] [--rdmaccs/1] [--id-tag NAME]
                 [--in FORMAT] [-o FILENAME] [--out FORMAT]
                 [--errors {strict,report,ignore}] [-R NAME=VALUE]
                 [--delimiter {tab,whitespace,to-eol,space}] [--has-header]
                 [--version]
                 [filenames [filenames ...]]

Generate FPS or FPB fingerprints from a structure file using RDKit

positional arguments:
  filenames             input structure files (default is stdin)

optional arguments:
  -h, --help            show this help message and exit
  --fpSize FPSIZE       number of bits in the fingerprint. Default of 2048 for
                        RDK, Morgan, topological torsion, atom pair, and
                        pattern fingerprints, and 512 for Avalon fingerprints
  --id-tag NAME         tag name containing the record id (SD files only)
  --in FORMAT           input structure format (default guesses from filename)
  -o FILENAME, --output FILENAME
                        save the fingerprints to FILENAME (default=stdout)
  --out FORMAT          output structure format (default guesses from output
                        filename, or is 'fps')
  --errors {strict,report,ignore}
                        how should structure parse errors be handled?
                        (default=ignore)
  -R NAME=VALUE         specify a reader argument
  --delimiter {tab,whitespace,to-eol,space}
                        delimiter style for SMILES and InChI files. Alias for
                        '-R delimiter=VALUE'.
  --has-header          Skip the first line of a SMILES or InChI file Aliase
                        for '-R has_header=1'
  --version             show program's version number and exit

RDKit topological fingerprints:
  --RDK                 generate RDK fingerprints (default)
  --minPath INT         minimum number of bonds to include in the subgraph
                        (default=1)
  --maxPath INT         maximum number of bonds to include in the subgraph
                        (default=7)
  --nBitsPerHash INT    number of bits to set per path (default=2)
  --useHs 0|1           include information about the number of hydrogens on
                        each atom (default=1)

RDKit Morgan fingerprints:
  --morgan              generate Morgan fingerprints
  --radius INT          radius for the Morgan algorithm (default=2)
  --useFeatures 0|1     use chemical-feature invariants (default=0)
  --useChirality 0|1    include chirality information (default=0)
  --useBondTypes 0|1    include bond type information (default=1)

RDKit Topological Torsion fingerprints:
  --torsions            generate Topological Torsion fingerprints
  --targetSize INT      number of bonds per torsion (default=4)

RDKit Atom Pair fingerprints:
  --pairs               generate Atom Pair fingerprints
  --minLength INT       minimum bond count for a pair (default=1)
  --maxLength INT       maximum bond count for a pair (default=30)

166 bit MACCS substructure keys:
  --maccs166            generate MACCS fingerprints

Avalon fingerprints:
  --avalon              generate Avalon fingerprints
  --isQuery 0_or_1      is the fingerprint for a query structure? (1 if yes, 0
                        if no) (default=0)
  --bitFlags INT        bit flags, SSSBits are 32767 and similarity bits are
                        15761407 (default=15761407)

RDKit Pattern fingerprints:
  --pattern             generate (substructure) pattern fingerprints

ChemFP's version of the 881 bit PubChem substructure keys:
  --substruct           generate ChemFP substructure fingerprints

ChemFP version of the 166 bit RDKit/MACCS keys:
  --rdmaccs, --rdmaccs/2
                        generate 166 bit RDKit/MACCS fingerprints (version 2)
  --rdmaccs/1           use the version 1 definition for --rdmaccs

This program guesses the input structure format and the compression
based on the filename extension. If the guess fails then it assumes
the input is an uncompressed SMILES file.

If the data comes from stdin, or the guess based on extension name is
wrong, then use "--in" to change the default input format. The
supported format extensions are:

  File Type      Valid FORMATs (use gz if compressed)
  ---------      ------------------------------------
   SMILES        smi, can, usm, smi.gz, can.gz, ism.gz
   SDF           sdf, sdf.gz
   InChI         inchi, inchi.gz

Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead.

Use the '-R' reader arguments option to pass in format-specific structure
reader arguments. The details depend on the specific format.

  * All of the input formats implement the 'sanitize' option. Use
    "-R sanitize=false" to disable the default sanitization.

  * The SMILES formats use the 'delimiter' option to specify the
    delimiter type. The default is 'to-eol'. The other values are
    "tab", "whitespace", and "space". Use "-R delimiter=whitespace"
    to match RDKit's native delimiter style.

  * The SDF format supports two additional reader arguments:
     * 'strictParsing'; use "-R strictParsing=false" to disable strict parsing
     * 'removeHs'; use "-R removeHs=false" to keep all of the hydrogens

  * The InChI format supports four additional reader arguments:
     * 'delimiter' works the same as it does for the SMILES formats
     * 'removeHs' works the same as it does for the SDF format
     * 'treatWarningAsError'; use "-R treatWarningAsError=true" to convert all warnings into errors
     * 'logLevel' specifies the RDKit/InChI library log level, as an integer

sdf2fps command-line options

The following comes from sdf2fps --help:

usage: sdf2fps [-h] [--id-tag TAG] [--fp-tag TAG] [--in FORMAT]
               [--num-bits INT] [--errors {strict,report,ignore}]
               [-o FILENAME] [--out FORMAT] [--software TEXT] [--type TEXT]
               [--version] [--binary] [--binary-msb] [--hex] [--hex-lsb]
               [--hex-msb] [--base64] [--cactvs] [--daylight]
               [--decoder DECODER] [--pubchem]
               [filenames [filenames ...]]

Extract a fingerprint tag from an SD file and generate FPS or FPB fingerprints

positional arguments:
  filenames             input SD files (default is stdin)

optional arguments:
  -h, --help            show this help message and exit
  --id-tag TAG          get the record id from TAG instead of the first line
                        of the record
  --fp-tag TAG          get the fingerprint from tag TAG (required)
  --in FORMAT           Specify if the input SD file is uncompressed or gzip
                        compressed
  --num-bits INT        use the first INT bits of the input. Use only when the
                        last 1-7 bits of the last byte are not part of the
                        fingerprint. Unexpected errors will occur if these
                        bits are not all zero.
  --errors {strict,report,ignore}
                        how should structure parse errors be handled?
                        (default=strict)
  -o FILENAME, --output FILENAME
                        save the fingerprints to FILENAME (default=stdout)
  --out FORMAT          output structure format (default guesses from output
                        filename, or is 'fps')
  --software TEXT       use TEXT as the software description
  --type TEXT           use TEXT as the fingerprint type description
  --version             show program's version number and exit

Fingerprint decoding options:
  --binary              Encoded with the characters '0' and '1'. Bit #0 comes
                        first. Example: 00100000 encodes the value 4
  --binary-msb          Encoded with the characters '0' and '1'. Bit #0 comes
                        last. Example: 00000100 encodes the value 4
  --hex                 Hex encoded. Bit #0 is the first bit (1<<0) of the
                        first byte. Example: 01f2 encodes the value \x01\xf2 =
                        498
  --hex-lsb             Hex encoded. Bit #0 is the eigth bit (1<<7) of the
                        first byte. Example: 804f encodes the value \x01\xf2 =
                        498
  --hex-msb             Hex encoded. Bit #0 is the first bit (1<<0) of the
                        last byte. Example: f201 encodes the value \x01\xf2 =
                        498
  --base64              Base-64 encoded. Bit #0 is first bit (1<<0) of first
                        byte. Example: AfI= encodes value \x01\xf2 = 498
  --cactvs              CACTVS encoding, based on base64 and includes a
                        version and bit length
  --daylight            Daylight encoding, which is is base64 variant
  --decoder DECODER     import and use the DECODER function to decode the
                        fingerprint

shortcuts:
  --pubchem             decode CACTVS substructure keys used in PubChem. Same
                        as --software=CACTVS/unknown --type 'CACTVS-E_SCREEN/1.0 extended=2'
                        --fp-tag=PUBCHEM_CACTVS_SUBSKEYS --cactvs

simsearch command-line options

The following comes from simsearch --help:

usage: simsearch [-h] [-k K_NEAREST] [-t THRESHOLD] [--alpha ALPHA]
                 [--beta BETA] [--queries QUERIES] [--NxN] [--query QUERY]
                 [--hex-query HEX_QUERY] [--query-id QUERY_ID]
                 [--query-format FORMAT] [--target-format FORMAT]
                 [-o FILENAME] [-c] [-b BATCH_SIZE] [--scan] [--memory]
                 [--times] [--version]
                 target_filename

Search an FPS or FPB file for similar fingerprints

positional arguments:
  target_filename       target filename

optional arguments:
  -h, --help            show this help message and exit
  -k K_NEAREST, --k-nearest K_NEAREST
                        select the k nearest neighbors (use 'all' for all
                        neighbors)
  -t THRESHOLD, --threshold THRESHOLD
                        minimum similarity score threshold
  --alpha ALPHA         Tversky alpha parameter (default: 1.0)
  --beta BETA           Tversky beta parameter (default: the value of --alpha)
  --queries QUERIES, -q QUERIES
                        filename containing the query fingerprints
  --NxN                 use the targets as the queries, and exclude the self-
                        similarity term
  --query QUERY         query as a structure record (default format: 'smi')
  --hex-query HEX_QUERY
                        query in hex
  --query-id QUERY_ID   id for the query or hex-query (default: 'Query1'
  --query-format FORMAT, --in FORMAT
                        input query format (default uses the file extension,
                        else 'fps')
  --target-format FORMAT
                        input target format (default uses the file extension,
                        else 'fps')
  -o FILENAME, --output FILENAME
                        output filename (default is stdout)
  -c, --count           report counts
  -b BATCH_SIZE, --batch-size BATCH_SIZE
                        batch size
  --scan                scan the file to find matches (low memory overhead)
  --memory              build and search an in-memory data structure (faster
                        for multiple queries)
  --times               report load and execution times to stderr
  --version             show program's version number and exit