rdkit2fps command-line options¶

The following comes from rdkit2fps --help:

usage: rdkit2fps [-h] [--fpSize INT] [--radius INT] [--nBitsPerEntry INT]
                 [--includeChirality 0|1] [--from-atoms INT,INT,...] [--RDK]
                 [--minPath INT] [--maxPath INT] [--nBitsPerHash INT]
                 [--useHs 0|1] [--branchedPaths 0|1] [--useBondOrder 0|1]
                 [--morgan] [--useFeatures 0|1] [--useChirality 0|1]
                 [--useBondTypes 0|1] [--includeRedundantEnvironments 0|1]
                 [--torsions] [--targetSize INT] [--pairs] [--minLength INT]
                 [--maxLength INT] [--use2D 0|1] [--maccs166] [--avalon]
                 [--isQuery 0_or_1] [--bitFlags INT] [--secfp] [--rings 0|1]
                 [--isomeric 0|1] [--kekulize 0|1] [--min_radius INT]
                 [--pattern] [--substruct] [--rdmaccs] [--rdmaccs/1]
                 [--id-tag NAME] [--type TYPE_STRING] [--using FILENAME]
                 [--in FORMAT] [-o FILENAME] [--out FORMAT]
                 [--errors {strict,report,ignore}]
                 [--progress | --no-progress] [--help-formats] [-R NAME=VALUE]
                 [--delimiter {tab,whitespace,to-eol,space}] [--has-header]
                 [--version]
                 [filenames ...]

Generate FPS or FPB fingerprints from a structure file using RDKit

positional arguments:
  filenames             input structure files (default is stdin)

options:
  -h, --help            show this help message and exit
  --id-tag NAME         tag name containing the record id (SD files only)
  --type TYPE_STRING    Specify a chemfp type string
  --using FILENAME      Get the fingerprint type from the metadata of a
                        fingerprint file
  --in FORMAT           input structure format (default guesses from filename)
  -o FILENAME, --output FILENAME
                        save the fingerprints to FILENAME (default=stdout)
  --out FORMAT          output structure format (default guesses from output
                        filename, or is 'fps')
  --errors {strict,report,ignore}
                        how should structure parse errors be handled?
                        (default=ignore)
  --progress, --no-progress
                        Show a progress bar (default: show unless the output
                        is a terminal)
  --help-formats        list the available formats and reader arguments
  -R NAME=VALUE         specify a reader argument
  --delimiter {tab,whitespace,to-eol,space}
                        delimiter style for SMILES and InChI files. Alias for
                        '-R delimiter=VALUE'.
  --has-header          Skip the first line of a SMILES or InChI file Alias
                        for '-R has_header=1'
  --version             show program's version number and exit

Common Parameters (used by more than one fingerprint type):
  --fpSize INT          number of bits in the fingerprint. Default of 2048 for
                        RDK, Morgan, topological torsion, atom pair, pattern
                        and SECFP fingerprints, and 512 for Avalon
                        fingerprints
  --radius INT          radius for the Morgan or SECFP fingerprints. Default
                        of 2 for Morgan, 3 for SECFP
  --nBitsPerEntry INT   number of bits per entry
  --includeChirality 0|1
                        include chirality information in the atom invariants
  --from-atoms INT,INT,...
                        fingerprint generation must use these atom indices
                        (out of range indices are ignored)

RDKit topological fingerprints:
  Branched or linear hash fingerprint.
  Uses --fpSize and --fromAtoms plus:

  --RDK                 generate RDK fingerprints (default)
  --minPath INT         minimum number of bonds to include in the subgraph
                        (default=1)
  --maxPath INT         maximum number of bonds to include in the subgraph
                        (default=7)
  --nBitsPerHash INT    number of bits to set per path (default=2)
  --useHs 0|1           include information about the number of hydrogens on
                        each atom (default=1)
  --branchedPaths 0|1   if set both branched and unbranched paths will be used
                        in the fingerprint (default=1)
  --useBondOrder 0|1    if set both bond orders will be used in the path
                        hashes (default=1)

RDKit Morgan fingerprints:
  Circular fingerprints similar to ECFP or FCFP fingerprints.
  Uses --fpSize, --radius, and --fromAtoms plus:

  --morgan              generate Morgan fingerprints
  --useFeatures 0|1     use chemical-feature invariants (default=0)
  --useChirality 0|1    include chirality information (default=0)
  --useBondTypes 0|1    include bond type information (default=1)
  --includeRedundantEnvironments 0|1
                        if set, the check for redundant atom environments will
                        not be done (default=0)

RDKit Topological Torsion fingerprints:
  See Nilakantan et al., JCICS 27, 82-85 (1987).
  Uses --fpSize, --nBitsPerEntry, --includeChirality, and --fromAtoms plus:

  --torsions            generate Topological Torsion fingerprints
  --targetSize INT      number of bonds per torsion (default=4)

RDKit Atom Pair fingerprints:
  See Carhart et al., JCICS 25, 64-73 (1985).
  Uses --fpSize, --nBitsPerEntry, --includeChirality, and --fromAtoms plus:

  --pairs               generate Atom Pair fingerprints
  --minLength INT       minimum bond count for a pair (default=1)
  --maxLength INT       maximum bond count for a pair (default=30)
  --use2D 0|1           use 2D instead of 3D distance matrix (default=1)

166 bit MACCS substructure keys:
  --maccs166            generate MACCS fingerprints

Avalon fingerprints:
  Fingerprints from the Avalon toolkit.
  Uses --fpSize plus:

  --avalon              generate Avalon fingerprints
  --isQuery 0_or_1      is the fingerprint for a query structure? (1 if yes, 0
                        if no) (default=0)
  --bitFlags INT        bit flags, SSSBits are 32767 and similarity bits are
                        15761407 (default=15761407)

SECFP fingerprints:
  A circular fingerprint based on fragment SMILES instead of hashing.
  Uses --fpSize and --radius plus:

  --secfp               generate SECFP fingerprints
  --rings 0|1           if 1, add SSSR ring to the fingerprint (default=1)
  --isomeric 0|1        if 1, use isomeric SMILES instead of non-isomeric
                        SMILES (default=0)
  --kekulize 0|1        if 1, use Kekule SMILES instead of aromatic SMILES
                        (default=0)
  --min_radius INT      minimum radius used to extract n-grams (default=1)

RDKit Pattern fingerprints:
  Fingerprints for substructure search screening.

  --pattern             generate (substructure) pattern fingerprints

chemfp's version of the 881 bit PubChem substructure keys:
  --substruct           generate ChemFP substructure fingerprints

chemfp's version of the 166 bit RDKit/MACCS keys:
  --rdmaccs, --rdmaccs/2
                        generate 166 bit RDKit/MACCS fingerprints (version 2)
  --rdmaccs/1           use the version 1 definition for --rdmaccs

This program guesses the input structure format and the compression
based on the filename extension. If the guess fails then it assumes
the input is an uncompressed SMILES file.

If the data comes from stdin, or the guess based on extension name is
wrong, then use "--in" to change the default input format.

Use the '-R' reader arguments option to pass in format-specific structure
reader arguments. The details depend on the specific format.

Use the command-line option `--help-formats` to display a list of
available formats and reader arguments.

Supported rdkit2fps formats¶

The following comes from rdkit2fps --help-formats:

These are the structure file formats that chemfp can read when using
the RDKit toolkit.

By default, chemfp uses the filename extension to determine the format
type. If the filename ends with ".gz" or ".zst" then it is intepreted
as a gzip or Zstandard compressed file, and the second-to-last
extension is used to determine the format type. Unknown or unsupported
extensions are interpreted as a SMILES file.

You may instead specify the file format by name (see below), which is
especially important when reading from stdin, which has no associated
filename extension.

The supported filename extensions are:

   File Type    Extension(s)
   ==========   =============
     SMILES     can, ism, isosmi, smi, usm
      SDF       mdl, sd, sdf
     InChI      inchi
  Tripos Mol2   mol2
      PDB       ent, pdb
    Maestro     mae, maegz
     FASTA      faa, fasta

The format can also be specified by name using the '--in' option:

   File Type    Format name (append .gz or .zst if compressed)
   ==========   ==============================================
     SMILES     smi, can, usm
      SDF       sdf
     InChI      inchi
  Tripos Mol2   mol2
      PDB       pdb
    Maestro     mae
     FASTA      fasta

The input format parsers can be configured with the "-R" option. For
example, the following reader arguments tell the SMILES readers that
the fields are whitespace delimited and the first line is a header.

   -R delimiter=whitespace -R has_header=true

All of the input formats implement the 'sanitize' option, which is
enabled by default. Use "-R sanitize=false" to disable sanitization.

The SMILES format parsers use two additional reader arguments:
   * 'delimiter' specifies the delimiter type. The default is 'to-eol'.
     The other values are 'tab', 'whitespace', 'space' and 'native'.
     Use "-R delimiter=native" to match RDKit's native delimiter
     style, which is 'whitespace'.
   * 'has_header', if false will skip the first line
     of the SMILES file (because it is a header line).

The SDF format parser supports two additional reader arguments:
   * 'strictParsing', if false will disable strict parsing
   * 'removeHs', if false will keep all of the hydrogens

The InChI format parser supports four additional reader arguments:
   * 'delimiter' works the same as it does for the SMILES formats
   * 'removeHs' works the same as it does for the SDF format
   * 'treatWarningAsError', if true treats all warnings as errors
   * 'logLevel' specifies the RDKit/InChI library log level, as an integer

The Tripos Mol2 format parser supports two additional reader arguments:
   * 'removeHs' works the same as it does for the SDF format
   * 'cleanupSubstructures' if false disables standardizing
      some substructures found in Mol2 files

The PDB format parser supports three additional reader arguments:
   * 'removeHs' works the same as it does for the SDF format
   * 'flavor', an input parameter with no documented meaning
   * 'proximityBonding', if false will disable automatic
       automatic proximity bonding

The Maestro format parser supports one additional reader argument:
   * 'removeHs' works the same as it does for the SDF format

The FASTA format parser supports one additional reader argument:
   * 'flavor', an integer from 0 to 9. The values mean:
       0 - the sequence contains L-amino acids
       1 - allow lowercase for D-amino acids
       2 - RNA with no cap        6 - DNA with no cap
       3 - RNA with 5' cap        7 - DNA with 5' cap
       4 - RNA with 3' cap        8 - DNA with 3' cap
       5 - RNA with both caps     9 - DNA with both caps