###############################
Help for the command-line tools
###############################


The chemfp command-line tools are:

  * :ref:`fpcat <fpcat>` - merge multiple fingerprint files into one
  * :ref:`ob2fps <ob2fps>` - use Open Babel to generate fingerprints
  * :ref:`oe2fps <oe2fps>` - use OEChem/OEGraphSim to generate fingerprints
  * :ref:`rdkit2fps <rdkit2fps>` - use RDKit to generate fingerprints
  * :ref:`cdk2fps <cdk2fps>` - use CDK to generate fingerprints
  * :ref:`sdf2fps <sdf2fps>` - extract fingerprints from an SD file
  * :ref:`simsearch <simsearch>` - search a fingerprint file for similar fingerprints

.. _fpcat:

fpcat command-line options
==========================

The following comes from ``fpcat --help``:

.. code-block:: none 

    usage: fpcat [-h] [--in FORMAT] [--merge] [-o FILENAME] [--out FORMAT]
                    [--level LEVEL] [--reorder] [--preserve-order] [--alignment N]
                    [--show-progress] [--max-spool-size SIZE] [--tmpdir DIRNAME]
                    [--version] [--license-check]
                    [filename ...]
    
    Combine multiple fingerprint files into a single file.
    
    positional arguments:
      filename              input fingerprint filenames (default: use stdin)
    
    optional arguments:
      -h, --help            show this help message and exit
      --in FORMAT           input fingerprint format. One of fps or fpb (with
                            optional gz or zst compression), or flush. (default
                            guesses from filename or is fps)
      --merge               assume the input fingerprint files are in popcount
                            order and do a merge sort
      -o FILENAME, --output FILENAME
                            save the fingerprints to FILENAME (default=stdout)
      --out FORMAT          output fingerprint format. One of fps, fps.gz,
                            fps.zst, fpb, or flush. (default guesses from output
                            filename, or is 'fps')
      --level LEVEL         compression level. Must be a positive integer or one
                            of 'min', 'default', or 'max'.
      --reorder             reorder the output fingerprints by popcount (default
                            for FPB output)
      --preserve-order      save the output fingerprints in the same order as the
                            input (default for FPS output)
      --alignment N         alignment size when saving a FPB file (default=8)
      --show-progress       show progress
      --max-spool-size SIZE
                            use temporary files for extra storage space for huge
                            FPB files (default uses RAM)
      --tmpdir DIRNAME      directory for the temporary files (default uses the
                            system temp directory)
      --version             show program's version number and exit
      --license-check       Check the license and report results to stdout.
    
    Examples:
    
    fpcat can be used to convert between FPS and FPB formats. This is
    handy if you want to see what's inside of an FPB file:
    
        fpcat fingerprints.fpb
    
    You can use also use fpcat to make an FPB file from an FPS file:
    
        fpcat fingerprints.fps -o fingerprints.fpb
    
    You might have generated a set of FPS file which you want to merge
    into a single FPB. (For example, you might have used GNU parallel to
    generate FPS files for each of the PubChem files, which you want to
    merge into a single file.):
    
        fpcat Compound_*.fps -o pubchem.fpb
    
    By default the FPB format sorts the fingerprints by popcount. (Use
    --preserve-order if you really want to preserve the input order.)  The
    sort overhead for PubChem uses about 10 GB of RAM. If you don't have
    that much memory then ask fpcat to use less memory:
    
        fpcat --max-spool-size 1GB Compound_*.fps -o pubchem.fpb 
    
    This will use about 2 GB of RAM and the --tmpdir for the rest. (Yes,
    it would be nice if I could get those two memory size numbers to
    match.)
    
    The --merge option is experimental. Use it if the input fingerprints
    are in popcount order, because sorted output is a simple merge sort of
    the individual sorted inputs. However, this option opens all input
    files at the same time, which may exceed your resource limit on file
    descriptors. The current implementation also requires a lot of disk
    seeks so is slow for many files.
    
    The flush format is only available if the chemfp_converter package was
    installed.


.. _ob2fps:

ob2fps command-line options
===========================

The following comes from ``ob2fps --help``:

.. code-block:: none 
		
  usage: ob2fps [-h]
                [--FP2 | --FP3 | --FP4 | --MACCS | --ECFP0 | --ECFP2 | --ECFP4 | 
                 --ECFP6 | --ECFP8 | --ECFP10 | --substruct | --rdmaccs | --rdmaccs/1]
                [--nBits INT] [--id-tag NAME] [--type TYPE_STRING] [--using FILENAME] [--in FORMAT]
                [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [--help-formats]
                [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header]
                [--version] [--license-check]
                [filenames ...]
  
  Generate FPS or FPB fingerprints from a structure file using Open Babel
  
  positional arguments:
    filenames             input structure files (default is stdin)
  
  optional arguments:
    -h, --help            show this help message and exit
    --FP2                 linear fragments up to 7 atoms
    --FP3                 SMARTS patterns specified in the file patterns.txt
    --FP4                 SMARTS patterns specified in the file SMARTS_InteLigand.txt
    --MACCS               Open Babel's implementation of the MACCS 166 keys
    --ECFP0               ECFP (circular) fingerprints with diameter 0
    --ECFP2               ECFP (circular) fingerprints with diameter 2
    --ECFP4               ECFP (circular) fingerprints with diameter 4
    --ECFP6               ECFP (circular) fingerprints with diameter 6
    --ECFP8               ECFP (circular) fingerprints with diameter 8
    --ECFP10              ECFP (circular) fingerprints with diameter 10
    --substruct           ChemFP substructure fingerprints
    --rdmaccs, --rdmaccs/2
                          166 bit RDKit/MACCS fingerprints (version 2)
    --rdmaccs/1           use the version 1 definition for --rdmaccs
    --id-tag NAME         tag name containing the record id (SD files only)
    --type TYPE_STRING    Specify a chemfp type string
  --using FILENAME      Get the fingerprint type from the metadata of a fingerprint file
  --in FORMAT           input structure format (default autodetects from the filename extension)
  -o FILENAME, --output FILENAME
                        save the fingerprints to FILENAME (default=stdout)
  --out FORMAT          output structure format (default guesses from output filename, or is
                        'fps')
  --errors {strict,report,ignore}
                        how should structure parse errors be handled? (default=ignore)
  --help-formats        list the available formats and reader arguments
  -R NAME=VALUE         specify a reader argument
  --delimiter {tab,whitespace,to-eol,space}
                        delimiter style for SMILES and InChI files. Alias for '-R
                        delimiter=VALUE'.
  --has-header          Skip the first line of a SMILES or InChI file Alias for '-R has_header=1'
  --version             show program's version number and exit
  --license-check       Check the license and report results to stdout.
  
  ECFP argument:
    --nBits INT           number of bits in the fingerprint (default=4096)
  
  By default the Open Babel structure reader determines the file format
  and compression type based on the filename extension. Unknown
  filename extensions are treated as a uncompressed SMILES files.
  
  If the data comes from stdin, or the guess based on extension name is
  wrong, then use "--in FORMAT" option to change the default input format.
  For examples:
  
     --in smi
     --in sdf.gz
  
  Use `-R` to specify format-specific reader arguments.
  
  Use `--help-formats` for a list of available formats and reader arguments.

The following comes from ``ob2fps --help-formats``, though I've
removed most of the Open Babel formats from the list.

.. code-block:: none 

    chemfp has special support for the SMILES, InChI, and SDF formats when
    using the Open Babel toolkit.
    
    For these formats, by default, chemfp uses the filename extension to
    determine the format type. If the filename ends with ".gz" or ".zst"
    then it is intepreted as a gzip or Zstandard compressed file, and the
    second-to-last extension is used to determine the format type. Unknown
    or unsupported extensions are then tested against Open Babel format
    names (see below), and if still unknown, interpreted as a SMILES file.
    
    Note: To enable Zstandard compression, please install the "zstandard"
    Python package from https://pypi.org/project/zstandard/ .
    
    You will need to use "-R implementation=chemfp" to enable zst support for
    the SDF format.
    
    You may instead specify the file format by name (see below), which is
    especially important when reading from stdin, which has no associated
    filename extension.
    
    These specially supported filename extensions are:
    
       File Type    Extension(s)
       ==========   =============
         SMILES     can, ism, isosmi, smi, usm
          SDF       sdf
         InChI      inchi
    
    The format can also be specified by name using the '--in' option:
    
       File Type    Format name (append .gz or .zst if compressed)
       ==========   ===========
         SMILES     smi, can, usm
          SDF       sdf
         InChI      inchi
    
    The input format parsers can be configured with the "-R" option. For
    examples, the following reader arguments tell the SMILES readers that
    the fields are whitespace delimited and the first line is a header.
    
       -R delimiter=whitespace -R has_header=true
    
    All of the readers support the 'options' reader argument, which is a
    string passed directly to OBConversion(). This is a compact way to
    encode all of the Open Babel parameters used in the conversion. For
    example, 'ab"text"', would set option 'a' to True, and option 'b' to
    the string "text".
    
    The SMILES format parsers use two additional reader arguments:
       * 'delimiter' specifies the delimiter type. The default is 'to-eol'.
         The other values are 'tab', 'whitespace', 'space' and 'native'.
         Use "-R delimiter=native" to match Open Babel's native delimiter
         style, which is 'to-eol'.
       * 'has_header', if false will skip the first line
         of the SMILES file (because it is a header line).
    
    The SDF format parser supports one additional reader argument:
       * 'implementation': if "openbabel" or "native", use Open Babel's
         native SDF parser. If "chemfp" use chemfp's own implementation
         to find SDF records, which are then passed to Open Babel for
         parsing. This gives more fine-grained error reporting, and
         supports zst compression, and with similar performance.
      (Note: Open Babel supports additional options.)
    
    The InChI format parser supports one additional reader argument:
       * 'delimiter' works the same as it does for the SMILES formats
    
    In addition, you may specify an Open Babel formats, either by one of
    the following format names, or by reading a filename ending with one
    of the format names, optionally with a .gz suffix. Zstandard
    compression is not supported by the native Open Babel reader.
    
    
      Format  Description and options
    ========= ==========================
      CONFIG  DL-POLY CONFIG
     CONTCAR  VASP format
                s  Output single bonds only
                b  Disable bonding entirely
      CONTFF  MDFF format
     HISTORY  DL-POLY HISTORY 
               .... many lines removed from the chemfp documentation ...
    xyz    XYZ cartesian coordinates format
               s  Output single bonds only
               b  Disable bonding entirely
    yob    YASARA.org YOB format

    You will need to consult the Open Babel documentation
    (see http://openbabel.org/wiki/List_of_extensions ) and
    implementation for full details about how these options work.
    
.. _oe2fps:

oe2fps command-line options
===========================

The following comes from ``oe2fps --help``:

.. code-block:: none

  usage: oe2fps [-h] [--path] [--circular] [--tree] [--numbits INT] [--minbonds INT]
                [--maxbonds INT] [--minradius INT] [--maxradius INT] [--atype ATYPE] [--btype BTYPE]
                [--maccs166] [--substruct] [--rdmaccs] [--rdmaccs/1] [--aromaticity NAME]
                [--id-tag NAME] [--type TYPE_STRING] [--using FILENAME] [--in FORMAT] [-o FILENAME]
                [--out FORMAT] [--errors {strict,report,ignore}] [--help-formats] [-R NAME=VALUE]
                [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version]
                [--license-check]
                [filenames ...]
  
  Generate FPS or FPB fingerprints from a structure file using OEChem
  
  positional arguments:
    filenames             input structure files (default is stdin)
  
  optional arguments:
    -h, --help            show this help message and exit
    --aromaticity NAME    use the named aromaticity model (same as '-R aromaticity=NAME')
    --id-tag NAME         tag name containing the record id (SD files only)
    --type TYPE_STRING    Specify a chemfp type string
    --using FILENAME      Get the fingerprint type from the metadata of a fingerprint file
    --in FORMAT           input structure format (default guesses from filename)
    -o FILENAME, --output FILENAME
                          save the fingerprints to FILENAME (default=stdout)
    --out FORMAT          output structure format (default guesses from output filename, or is
                          'fps')
    --errors {strict,report,ignore}
                          how should structure parse errors be handled? (default=ignore)
    --help-formats        list the available formats and reader arguments
    -R NAME=VALUE         specify a reader argument
    --delimiter {tab,whitespace,to-eol,space}
                          delimiter style for SMILES and InChI files. Alias for '-R
                          delimiter=VALUE'.
    --has-header          Skip the first line of a SMILES or InChI file Alias for '-R has_header=1'
    --version             show program's version number and exit
    --license-check       Check the license and report results to stdout.
  
  path, circular, and tree fingerprints:
    --path                generate path fingerprints (default)
    --circular            generate circular fingerprints
    --tree                generate tree fingerprints
    --numbits INT         number of bits in the fingerprint (default=4096)
    --minbonds INT        minimum number of bonds in the path or tree fingerprint (default=0)
    --maxbonds INT        maximum number of bonds in the path or tree fingerprint (path default=5,
                          tree default=4)
    --minradius INT       minimum radius for the circular fingerprint (default=0)
    --maxradius INT       maximum radius for the circular fingerprint (default=5)
    --atype ATYPE         atom type flags, described below (default=Default)
    --btype BTYPE         bond type flags, described below (default=Default)
  
  166 bit MACCS substructure keys:
    --maccs166            generate MACCS fingerprints
  
  881 bit ChemFP substructure keys:
    --substruct           generate ChemFP substructure fingerprints
  
  ChemFP version of the 166 bit RDKit/MACCS keys:
    --rdmaccs, --rdmaccs/2
                          generate 166 bit RDKit/MACCS fingerprints (version 2)
    --rdmaccs/1           use the version 1 definition for --rdmaccs
  
  ATYPE is one or more of the following, separated by the '|' character
  
    Arom AtmNum Chiral EqArom EqHBAcc EqHBDon EqHalo FCharge HCount HvyDeg
    Hyb InRing
  
  The following shorthand terms and expansions are also available:
   DefaultPathAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo
   DefaultCircularAtom = AtmNum|Arom|Chiral|FCharge|HCount|EqHalo
   DefaultTreeAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb
  and 'Default' selects the correct value for the specified fingerprint.
  
  Examples:
    --atype Default
    --atype "Arom|AtmNum|FCharge|HCount"
    --atype Arom,AtmNum,FCharge,HCount
  
  BTYPE is one or more of the following, separated by the '|' character
  
    Chiral InRing Order
  
  The following shorthand terms and expansions are also available:
   DefaultPathBond = Order|Chiral
   DefaultCircularBond = Order
   DefaultTreeBond = Order
  and 'Default' selects the correct value for the specified fingerprint.
  
  Examples:
     --btype Default
     --btype Order|InRing
  
  To simplify command-line use, a comma may be used instead of a '|' to
  separate different fields. Example:
    --atype AtmNum,HvyDegree
  
  By default, chemfp will use the filename extension to determine the
  structure file format type and possible compression. Most of the file
  readers support configuration parameters. Use the '-R' option to
  specify those parameters.
  
  Use '--help-formats' to list available formats and reader parameters.

The following comes from ``oe2fps --help-formats``

.. code-block:: none

    These are the structure file formats that chemfp can read when using
    the OEChem toolkit.
    
    By default, chemfp uses the filename extension to determine the format
    type. If the filename ends with ".gz" then it is intepreted as a gzip
    compressed file, and the second-to-last extension is used to determine
    the format type. Unknown or unsupported extensions are interpreted as
    a SMILES file.
    
    (The OEChem structure file readers do not support Zstandard
    compression.)
    
    You may instead specify the file format by name (see below), which is
    especially important when reading from stdin, which has no associated
    filename extension.
    
    The supported filename extensions are:
    
       File Type    Extension(s)
       ==========   =============
         SMILES     can, ism, isosmi, smi, usm
          SDF       mdl, rxn, sd, sdf
         InChI      inchi
      Tripos Mol2   mol2, mol2h
          PDB       ent, pdb
          XYZ       xyz
          SKC       skc
       Macromodel   mmd, mmod
      ChemDraw CDX  cdx
       OE binary    oeb
     OEB compressed oez
          CIF       cif
         mmCIF      mmcif
         FASTA      fasta
          CSV       csv
    
    Append a '.gz' to the filename to indicate that the contents are
    gzip-compressed.
    
    The format can also be specified by name using the '--in' option:
    
       File Type    Format name
       ==========   =============
         SMILES     smi, can, usm
          SDF       sdf
         InChI      inchi
      Tripos Mol2   mol2, mol2h
          PDB       pdb
          XYZ       xyz
          SKC       skc
       Macromodel   mmod
      ChemDraw CDX  cdx
       OE binary    oeb
     OEB compressed oez
          CIF       cif
         mmCIF      mmcif
         FASTA      fasta
          CSV       csv
    
    Append a '.gz' to the format name to indicate that the contents are
    gzip-compressed.
    
    The input format parsers can be configured with the "-R" option. For
    example, the following reader arguments tell the SMILES readers that
    the fields are whitespace delimited and the first line is a header.
    
       -R delimiter=whitespace -R has_header=true
    
    All formats handle the following two reader arguments:
    
      aromaticity - one of 'openeye', 'daylight', 'tripos', 'mdl', or 'mmff'
          (this can also be set via the older '--aromaticity' command-line option)
    
      flavor - a '|' or ',' separated list of flavor names, or a numeric value.
           A leading '-' means to remove the given flavor. Examples include:
           
           o  Canon,Strict  -- the bitwise merger of the format's Canon and Strict values
           o  Default,-Kekule -- the format's Default flavor but without the Kekule bits
                          (every flavor has a Default)
           o  42  -- the specific OEChem flavor value 42
    
    The SMILES and InChI formats also handle reader arguments for the
    delimiter style and the presence of an initial header line using the
    following:
    
       delimiter - one of 'to-eol' (Daylight/OEChem style), 'tab',
            'whitespace', 'space', or 'native' (for the native toolkit style)
    
       has_header - '1' if the first line contains a header, else '0'.
    
    The supported format, default reader arguments, and input flavors are:
    
    Format: can
        aromaticity: openeye
        delimiter: to-eol
        flavor: Default
            default flags: <none>
            available flags: Canon, Strict
        has_header: 0
    
    Format: cdx
        aromaticity: openeye
        flavor: Default
            default flags: SuperAtom
            available flags: SuperAtom
    
    Format: cif
        aromaticity: openeye
        flavor: Default
            default flags: BondHydToClosest, BondOrder, FormalCrg, ImplicitH,
                NormalizeHydPos, OccFilterOneHalf, RemovePBCImages,
                RemoveQuestionMarkInLabel, Rings
            available flags: BondHydToClosest, BondOrder, FormalCrg, ImplicitH,
                NormalizeHydPos, OccFilterOneHalf, RemovePBCImages,
                RemoveQuestionMarkInLabel, Rings
    
    Format: csv
        aromaticity: openeye
        flavor: Default
            default flags: Header
            available flags: Header
    
    Format: fasta
        aromaticity: openeye
        flavor: Default
            default flags: <none>
            available flags: CustomResidues, EmbeddedSMILES
    
    Format: inchi
        aromaticity: <N/A>
        delimiter: to-eol
        flavor: Default
          no flavor flags available
        has_header: 0
    
    Format: mmcif
        aromaticity: openeye
        flavor: Default
            default flags: <none>
            available flags: NoAltLoc
    
    Format: mmod
        aromaticity: openeye
        flavor: Default
            default flags: <none>
            available flags: FormalCrg
    
    Format: mol2
        aromaticity: openeye
        flavor: Default
            default flags: <none>
            available flags: Forcefield, M2H
    
    Format: mol2h
        aromaticity: openeye
        flavor: Default
            default flags: M2H
            available flags: M2H
    
    Format: oeb
        aromaticity: <N/A>
        flavor: Default
          no flavor flags available
    
    Format: oez
        aromaticity: <N/A>
        flavor: Default
          no flavor flags available
    
    Format: pdb
        aromaticity: openeye
        flavor: Default
            default flags: BondOrder, Connect, END, ENDM, FormalCrg, ImplicitH,
                Rings, SecStruct
            available flags: ALL, ALTLOC, BondOrder, CHARGE, Connect, DATA, END,
                ENDM, FORMALCHARGE, FormalCrg, ImplicitH, RADIUS, Rings,
                SecStruct, TER
    
    Format: sdf
        aromaticity: openeye
        flavor: Default
            default flags: <none>
            available flags: FixBondMarks, SuppressEmptyMolSkip,
                SuppressImp2ExpENHSTE
    
    Format: skc
        aromaticity: openeye
        flavor: Default
          no flavor flags available
    
    Format: smi
        aromaticity: openeye
        delimiter: to-eol
        flavor: Default
            default flags: <none>
            available flags: Canon, Strict
        has_header: 0
    
    Format: usm
        aromaticity: openeye
        delimiter: to-eol
        flavor: Default
            default flags: <none>
            available flags: Canon, Strict
        has_header: 0
    
    Format: xyz
        aromaticity: openeye
        flavor: Default
            default flags: BondOrder, Connect, FormalCrg, ImplicitH, Rings
            available flags: BondOrder, Connect, FormalCrg, ImplicitH, Rings
    
    
    See https://docs.eyesopen.com/toolkits/cpp/oechemtk/molreadwrite.html#flavored-input-and-output
    for documentation about the flavors for each format.


.. _rdkit2fps:

rdkit2fps command-line options
==============================

The following comes from ``rdkit2fps --help``:

.. code-block:: none

  usage: rdkit2fps [-h] [--fpSize INT] [--radius INT] [--nBitsPerEntry INT] [--includeChirality 0|1]
                   [--from-atoms INT,INT,...] [--RDK] [--minPath INT] [--maxPath INT]
                   [--nBitsPerHash INT] [--useHs 0|1] [--branchedPaths 0|1] [--useBondOrder 0|1]
                   [--morgan] [--useFeatures 0|1] [--useChirality 0|1] [--useBondTypes 0|1]
                   [--includeRedundantEnvironments 0|1] [--torsions] [--targetSize INT] [--pairs]
                   [--minLength INT] [--maxLength INT] [--use2D 0|1] [--maccs166] [--avalon]
                   [--isQuery 0_or_1] [--bitFlags INT] [--secfp] [--rings 0|1] [--isomeric 0|1]
                   [--kekulize 0|1] [--min_radius INT] [--pattern] [--substruct] [--rdmaccs]
                   [--rdmaccs/1] [--id-tag NAME] [--type TYPE_STRING] [--using FILENAME]
                   [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}]
                   [--help-formats] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}]
                   [--has-header] [--version]
                   [filenames ...]
  
  Generate FPS or FPB fingerprints from a structure file using RDKit
  
  positional arguments:
    filenames             input structure files (default is stdin)
  
  optional arguments:
    -h, --help            show this help message and exit
    --id-tag NAME         tag name containing the record id (SD files only)
    --type TYPE_STRING    Specify a chemfp type string
    --using FILENAME      Get the fingerprint type from the metadata of a fingerprint file
    --in FORMAT           input structure format (default guesses from filename)
    -o FILENAME, --output FILENAME
                          save the fingerprints to FILENAME (default=stdout)
    --out FORMAT          output structure format (default guesses from output filename, or is
                          'fps')
    --errors {strict,report,ignore}
                          how should structure parse errors be handled? (default=ignore)
    --help-formats        list the available formats and reader arguments
    -R NAME=VALUE         specify a reader argument
    --delimiter {tab,whitespace,to-eol,space}
                          delimiter style for SMILES and InChI files. Alias for '-R
                          delimiter=VALUE'.
    --has-header          Skip the first line of a SMILES or InChI file Alias for '-R has_header=1'
    --version             show program's version number and exit
  
  Common Parameters (used by more than one fingerprint type):
    --fpSize INT          number of bits in the fingerprint. Default of 2048 for RDK, Morgan,
                          topological torsion, atom pair, pattern and SECFP fingerprints, and 512
                          for Avalon fingerprints
    --radius INT          radius for the Morgan or SECFP fingerprints. Default of 2 for Morgan, 3
                          for SECFP
    --nBitsPerEntry INT   number of bits per entry
    --includeChirality 0|1
                          include chirality information in the atom invariants
    --from-atoms INT,INT,...
                          fingerprint generation must use these atom indices (out of range indices
                          are ignored)
  
  RDKit topological fingerprints:
    Branched or linear hash fingerprint.
    Uses --fpSize and --fromAtoms plus:
  
    --RDK                 generate RDK fingerprints (default)
    --minPath INT         minimum number of bonds to include in the subgraph (default=1)
    --maxPath INT         maximum number of bonds to include in the subgraph (default=7)
    --nBitsPerHash INT    number of bits to set per path (default=2)
    --useHs 0|1           include information about the number of hydrogens on each atom (default=1)
    --branchedPaths 0|1   if set both branched and unbranched paths will be used in the fingerprint
                          (default=1)
    --useBondOrder 0|1    if set both bond orders will be used in the path hashes (default=1)
  
  RDKit Morgan fingerprints:
    Circular fingerprints similar to ECFP or FCFP fingerprints.
    Uses --fpSize, --radius, and --fromAtoms plus:
  
    --morgan              generate Morgan fingerprints
    --useFeatures 0|1     use chemical-feature invariants (default=0)
    --useChirality 0|1    include chirality information (default=0)
    --useBondTypes 0|1    include bond type information (default=1)
    --includeRedundantEnvironments 0|1
                          if set, the check for redundant atom environments will not be done
                          (default=0)
  
  RDKit Topological Torsion fingerprints:
    See Nilakantan et al., JCICS 27, 82-85 (1987).
    Uses --fpSize, --nBitsPerEntry, --includeChirality, and --fromAtoms plus:
  
    --torsions            generate Topological Torsion fingerprints
    --targetSize INT      number of bonds per torsion (default=4)
  
  RDKit Atom Pair fingerprints:
    See Carhart et al., JCICS 25, 64-73 (1985).
    Uses --fpSize, --nBitsPerEntry, --includeChirality, and --fromAtoms plus:
  
    --pairs               generate Atom Pair fingerprints
    --minLength INT       minimum bond count for a pair (default=1)
    --maxLength INT       maximum bond count for a pair (default=30)
    --use2D 0|1           use 2D instead of 3D distance matrix (default=1)
  
  166 bit MACCS substructure keys:
    --maccs166            generate MACCS fingerprints
  
  Avalon fingerprints:
    Fingerprints from the Avalon toolkit.
    Uses --fpSize plus:
  
    --avalon              generate Avalon fingerprints
    --isQuery 0_or_1      is the fingerprint for a query structure? (1 if yes, 0 if no) (default=0)
    --bitFlags INT        bit flags, SSSBits are 32767 and similarity bits are 15761407
                          (default=15761407)
  
  SECFP fingerprints:
    A circular fingerprint based on fragment SMILES instead of hashing.
    Uses --fpSize and --radius plus:
  
    --secfp               generate SECFP fingerprints
    --rings 0|1           if 1, add SSSR ring to the fingerprint (default=1)
    --isomeric 0|1        if 1, use isomeric SMILES instead of non-isomeric SMILES (default=0)
    --kekulize 0|1        if 1, use Kekule SMILES instead of aromatic SMILES (default=1)
    --min_radius INT      minimum radius used to extract n-grams (default=1)
  
  RDKit Pattern fingerprints:
    Fingerprints for substructure search screening.
  
    --pattern             generate (substructure) pattern fingerprints
  
  chemfp's version of the 881 bit PubChem substructure keys:
    --substruct           generate ChemFP substructure fingerprints
  
  chemfp's version of the 166 bit RDKit/MACCS keys:
    --rdmaccs, --rdmaccs/2
                          generate 166 bit RDKit/MACCS fingerprints (version 2)
    --rdmaccs/1           use the version 1 definition for --rdmaccs
  
  This program guesses the input structure format and the compression
  based on the filename extension. If the guess fails then it assumes
  the input is an uncompressed SMILES file.
  
  If the data comes from stdin, or the guess based on extension name is
  wrong, then use "--in" to change the default input format.
  
  Use the '-R' reader arguments option to pass in format-specific structure
  reader arguments. The details depend on the specific format.
  
  Use the command-line option `--help-formats` to display a list of
  available formats and reader arguments.

The following comes from ``rdkit2fps --help-formats``

.. code-block:: none

    These are the structure file formats that chemfp can read when using
    the RDKit toolkit.
    
    By default, chemfp uses the filename extension to determine the format
    type. If the filename ends with ".gz" or ".zst" then it is intepreted
    as a gzip or Zstandard compressed file, and the second-to-last
    extension is used to determine the format type. Unknown or unsupported
    extensions are interpreted as a SMILES file.
    
    You may instead specify the file format by name (see below), which is
    especially important when reading from stdin, which has no associated
    filename extension.
    
    The supported filename extensions are:
    
       File Type    Extension(s)
       ==========   =============
         SMILES     can, ism, isosmi, smi, usm
          SDF       mdl, sd, sdf
         InChI      inchi
      Tripos Mol2   mol2
          PDB       ent, pdb
        Maestro     mae, maegz
         FASTA      faa, fasta
    
    The format can also be specified by name using the '--in' option:
    
       File Type    Format name (append .gz or .zst if compressed)
       ==========   ==============================================
         SMILES     smi, can, usm
          SDF       sdf
         InChI      inchi
      Tripos Mol2   mol2
          PDB       pdb
        Maestro     mae
         FASTA      fasta
    
    The input format parsers can be configured with the "-R" option. For
    example, the following reader arguments tell the SMILES readers that
    the fields are whitespace delimited and the first line is a header.
    
       -R delimiter=whitespace -R has_header=true
    
    All of the input formats implement the 'sanitize' option, which is
    enabled by default. Use "-R sanitize=false" to disable sanitization.
    
    The SMILES format parsers use two additional reader arguments:
       * 'delimiter' specifies the delimiter type. The default is 'to-eol'.
         The other values are 'tab', 'whitespace', 'space' and 'native'.
         Use "-R delimiter=native" to match RDKit's native delimiter
         style, which is 'whitespace'.
       * 'has_header', if false will skip the first line
         of the SMILES file (because it is a header line).
    
    The SDF format parser supports two additional reader arguments:
       * 'strictParsing', if false will disable strict parsing
       * 'removeHs', if false will keep all of the hydrogens
    
    The InChI format parser supports four additional reader arguments:
       * 'delimiter' works the same as it does for the SMILES formats
       * 'removeHs' works the same as it does for the SDF format
       * 'treatWarningAsError', if true treats all warnings as errors
       * 'logLevel' specifies the RDKit/InChI library log level, as an integer
    
    The Tripos Mol2 format parser supports two additional reader arguments:
       * 'removeHs' works the same as it does for the SDF format
       * 'cleanupSubstructures' if false disables standardizing
          some substructures found in Mol2 files
    
    The PDB format parser supports three additional reader arguments:
       * 'removeHs' works the same as it does for the SDF format
       * 'flavor', an input parameter with no documented meaning
       * 'proximityBonding', if false will disable automatic
           automatic proximity bonding
    
    The Maestro format parser supports one additional reader argument:
       * 'removeHs' works the same as it does for the SDF format
    
    The FASTA format parser supports one additional reader argument:
       * 'flavor', an integer from 0 to 9. The values mean:
           0 - the sequence contains L-amino acids 
           1 - allow lowercase for D-amino acids
           2 - RNA with no cap        6 - DNA with no cap
           3 - RNA with 5' cap        7 - DNA with 5' cap
           4 - RNA with 3' cap        8 - DNA with 3' cap
           5 - RNA with both caps     9 - DNA with both caps


.. _cdk2fps:

cdk2fps command-line options
==============================

The following comes from ``cdk2fps --help``:

.. code-block:: none

  usage: cdk2fps [-h]
                 [--type TYPE_STRING | --using FILENAME | --Daylight | --GraphOnly |
                  --MACCS | --EState | --Extended | --Hybridization | --KlekotaRoth |
                  --Pubchem | --Substructure | --ShortestPath | --ECFP0 | --ECFP2 | 
                 --ECFP4 | --ECFP6 | --FCFP0 | --FCFP2 | --FCFP4 | --FCFP6 | 
                 --AtomPairs2D]
                 [--substruct] [--rdmaccs] [--rdmaccs/1] [--size INT] [--searchDepth INT]
                 [--pathLimit INT] [--hashPseudoAtoms 0|1] [--perceiveStereochemistry 0|1]
                 [--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT]
                 [--errors {strict,report,ignore}] [--help-formats] [-R NAME=VALUE]
                 [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version]
                 [--license-check]
                 [filenames ...]
  
  Generate FPS or FPB fingerprints from a structure file using CDK via JPype
  
  positional arguments:
    filenames             input structure files (default is stdin)
  
  optional arguments:
    -h, --help            show this help message and exit
    --type TYPE_STRING    Specify a chemfp type string
    --using FILENAME      Get the fingerprint type from the metadata of a fingerprint file
    --Daylight            Make Daylight-like fingerprints using cdk.fingerprinter.Fingerprinter
                          (default)
    --GraphOnly           Make Daylight-like fingerprints (ignoring bond types) using
                          GraphOnlyFingerprinter
    --MACCS               Make 166-bit MACCS keys using MACCSFingerprinter
    --EState              Make 79-bit EState fingerprints using EStateFingerprinter
    --Extended            Make Daylight-like fingerprints extended with ring feature bits, using
                          ExtendedFingerprinter
    --Hybridization       Make Daylight-like fingerprints based on SP2 hybridization instead of
                          aromaticity, using HybridizationFingerprinter
    --KlekotaRoth         Make 4860-bit Klekota-Roth fingerprint, using KlekotaRothFingerprinter
    --Pubchem             Make 881-bit PubChem fingerprint, using PubchemFingerprinter
    --Substructure        Make 307-bit substructure fingerprint, using SubstructureFingerprinter
    --ShortestPath        Make fingerprints based on the shortest path between atoms, ring systems,
                          and more, using ShortestPathFingerprinter
    --ECFP0               Make ECFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP0)
    --ECFP2               Make ECFP2-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP2)
    --ECFP4               Make ECFP4-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP4)
    --ECFP6               Make ECFP6-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP6)
    --FCFP0               Make FCFP0-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP0)
    --FCFP2               Make FCFP2-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP2)
    --FCFP4               Make FCFP4-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP4)
    --FCFP6               Make FCFP6-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP6)
    --AtomPairs2D         Make 780-bit atom-pair fingerprints adapted from Yap Chun Wei's PaDEL,
                          using AtomPairs2DFingerprinter
    --size INT            fingerprint size (default=1024)
    --searchDepth INT     search depth (default=7)
    --pathLimit INT       path limit (default=42000)
    --hashPseudoAtoms 0|1
                          include pseudo-atoms in path enumeration (default=0)
    --perceiveStereochemistry 0|1
                          re-perceive stereochemistry from 2D/3D coordinates (default=0)
    --id-tag NAME         tag name containing the record id (SD files only)
    --in FORMAT           input structure format (default autodetects from the filename extension)
    -o FILENAME, --output FILENAME
                          save the fingerprints to FILENAME (default=stdout)
    --out FORMAT          output structure format (default guesses from output filename, or is
                          'fps')
    --errors {strict,report,ignore}
                          how should structure parse errors be handled? (default=ignore)
    --help-formats        list the available formats and reader arguments
    -R NAME=VALUE         specify a reader argument
    --delimiter {tab,whitespace,to-eol,space}
                          delimiter style for SMILES and InChI files. Alias for '-R
                          delimiter=VALUE'.
    --has-header          Skip the first line of a SMILES or InChI file Alias for '-R has_header=1'
    --version             show program's version number and exit
    --license-check       Check the license and report results to stdout.
  
  By default the CDK structure reader determines the file format
  and compression type based on the filename extension. Unknown
  filename extensions are treated as a uncompressed SMILES files.
  
  If the data comes from stdin, or the guess based on extension name is
  wrong, then use "--in FORMAT" option to change the default input format.
  For examples:
  
     --in smi
     --in sdf.gz
  
  Use `-R` to specify format-specific reader arguments.
  
  Use `--help-formats` for a list of available formats and reader arguments.

The following comes from ``cdk2fps --help-formats``

.. code-block:: none

  These are the structure file formats that chemfp and read when using
  the CDK toolkit.
  
  By default, chemfp uses the filename extension to determine the format
  type. If the filename ends with ".gz" or ".zst" then it is intepreted
  as a gzip or Zstandard compressed file, and the second-to-last
  extension is used to determine the format type. Unknown or unsupported
  extensions are interpreted as a SMILES file.
  
  Note: Zstandard support may depend the "zstandard" Python package
  and/or the "zstd-jni" Java package. To install the Python package see
  https://pypi.org/project/zstandard/ . To get the Java jar file, see
  https://github.com/luben/zstd-jni and place it in your CLASSPATH. 
  
  You may instead specify the file format by name (see below), which is
  especially important when reading from stdin, which has no associated
  filename extension.
  
  The supported filename extensions are:
  
     File Type    Extension(s)
     ==========   =============
       SMILES     can, ism, isosmi, smi, usm
        SDF       mdl, sd, sdf
       InChI      inchi
  
  The format can also be specified by name using the '--in' option:
  
     File Type    Format name (append .gz or .zst if compressed)
     ==========   ==============================================
       SMILES     smi, can, usm
        SDF       sdf
       InChI      inchi
  
  The input format parsers can be configured with the "-R" option. For
  example, the following reader arguments tell the SMILES readers that
  the fields are whitespace delimited and the first line is a header.
  
     -R delimiter=whitespace -R has_header=true
  
  The SMILES format parsers use two additional reader arguments:
     * 'delimiter' specifies the delimiter type. The default is 'to-eol'.
       The other values are 'tab', 'whitespace', 'space' and 'native'.
       Use "-R delimiter=native" to match RDKit's native delimiter
       style, which is 'whitespace'.
     * 'has_header', if false will skip the first line of the SMILES
       file (because it is a header line).
     * 'kekulise': The default of '1' will Kekulize the SMILES. Use '0'
       to skip this step.
     * 'implementation': The default 'cdk' uses CDK's IteratingSMILESReader()
       to parse the SMILES file. The 'chemfp' implementation uses chemfp's
       Python-based SMILES file parser and CDK's SmilesParser() to parse
       parse each SMILES string. The chemfp implementation is slower
       but may have better error-handling and/or reporting.
  
  The SDF format parser supports five reader arguments:
     * 'mode' can be one of 'RELAXED' or 'STRICT'. The default relaxed
       mode supports some records with recoverable errors. The strict
       mode fails to parse those records.
     * 'ForceReadAs3DCoordinates', with the default of '0' interprets
       V2000 records where all z-coordinates == 0.0 as 2D records. The
       value '1' tells CDK to interpret all records as 3D.
     * 'AddStereoElements' with the default of '1' adds 0D stereochemistry
       to V2000 records. The value of '0' skips that step.
     * 'InterpretHydrogenIsotopes with the default of '1' interprets the
       atom symbols 'D' and 'T' as [2H] and [3H], respectively. Use
       '0' to disable this interpretation.
     * 'implementation': The default 'cdk' uses CDK's SDFReaderFactory()
       to parse the SD file. The 'chemfp' implementation uses chemfp's
       SD file parser to parse records, and CDK's MDLReader(),
       MDLV2000Reader(), or MDLV3000Reader() to parse each record. The
       chemfp implementation is about 50% slower than the cdk parser but
       may have better error-handling and/or reporting.
  
  The InChI format parser supports one reader argument:
     * 'delimiter' works the same as it does for the SMILES formats


.. _sdf2fps:

sdf2fps command-line options
============================

The following comes from ``sdf2fps --help``:

.. code-block:: none

    usage: sdf2fps [-h] [--id-tag TAG] [--fp-tag TAG] [--in FORMAT]
                      [--num-bits INT] [--errors {strict,report,ignore}]
                      [-o FILENAME] [--out FORMAT] [--software TEXT] [--type TEXT]
                      [--version] [--license-check] [--binary] [--binary-msb]
                      [--hex] [--hex-lsb] [--hex-msb] [--base64] [--cactvs]
                      [--daylight] [--decoder DECODER] [--pubchem]
                      [filenames ...]
    
    Extract a fingerprint tag from an SD file and generate FPS or FPB fingerprints
    
    positional arguments:
      filenames             input SD files (default is stdin)
    
    optional arguments:
      -h, --help            show this help message and exit
      --id-tag TAG          get the record id from TAG instead of the first line
                            of the record
      --fp-tag TAG          get the fingerprint from tag TAG (required)
      --in FORMAT           Specify the input format (one of "sdf", "sdf.gz", or
                            "sdf.zst")
      --num-bits INT        use the first INT bits of the input. Use only when the
                            last 1-7 bits of the last byte are not part of the
                            fingerprint. Unexpected errors will occur if these
                            bits are not all zero.
      --errors {strict,report,ignore}
                            how should structure parse errors be handled?
                            (default=strict)
      -o FILENAME, --output FILENAME
                            save the fingerprints to FILENAME (default=stdout)
      --out FORMAT          output format, one of 'fps', 'fps.gz', 'fps.zst',
                            'fpb', or 'flush' (default guesses from output
                            filename, or is 'fps')
      --software TEXT       use TEXT as the software description
      --type TEXT           use TEXT as the fingerprint type description
      --version             show program's version number and exit
      --license-check       Check the license and report results to stdout.
    
    Fingerprint decoding options:
      --binary              Encoded with the characters '0' and '1'. Bit #0 comes
                            first. Example: 00100000 encodes the value 4
      --binary-msb          Encoded with the characters '0' and '1'. Bit #0 comes
                            last. Example: 00000100 encodes the value 4
      --hex                 Hex encoded. Bit #0 is the first bit (1<<0) of the
                            first byte. Example: 01f2 encodes the value \x01\xf2 =
                            498
      --hex-lsb             Hex encoded. Bit #0 is the eigth bit (1<<7) of the
                            first byte. Example: 804f encodes the value \x01\xf2 =
                            498
      --hex-msb             Hex encoded. Bit #0 is the first bit (1<<0) of the
                            last byte. Example: f201 encodes the value \x01\xf2 =
                            498
      --base64              Base-64 encoded. Bit #0 is first bit (1<<0) of first
                            byte. Example: AfI= encodes value \x01\xf2 = 498
      --cactvs              CACTVS encoding, based on base64 and includes a
                            version and bit length
      --daylight            Daylight encoding, which is a base64 variant
      --decoder DECODER     import and use the DECODER function to decode the
                            fingerprint
    
    shortcuts:
      --pubchem             decode CACTVS substructure keys used in PubChem. Same as
                            --software=CACTVS/unknown --type 'CACTVS-E_SCREEN/1.0
                            extended=2' --fp-tag=PUBCHEM_CACTVS_SUBSKEYS --cactvs

.. _simsearch:

simsearch command-line options
==============================

The following comes from ``simsearch --help``:

.. code-block:: none

    usage: simsearch [-h] [-k K_NEAREST] [-t THRESHOLD] [--alpha ALPHA]
                        [--beta BETA] [--queries QUERIES] [--NxN] [--query QUERY]
                        [--hex-query HEX_QUERY] [--query-id QUERY_ID]
                        [--query-format FORMAT] [--target-format FORMAT]
                        [--query-type STRING] [--id-tag NAME]
                        [--errors {strict,report,ignore}] [-R NAME=VALUE]
                        [--delimiter {tab,whitespace,to-eol,space}] [--has-header]
                        [-o FILENAME] [-c] [-b BATCH_SIZE] [--scan] [--memory]
                        [--no-mmap] [--times] [--version] [--license-check]
                        target_filename
    
    Search an FPS or FPB file for similar fingerprints
    
    positional arguments:
      target_filename       target filename
    
    optional arguments:
      -h, --help            show this help message and exit
      -k K_NEAREST, --k-nearest K_NEAREST
                            select the k nearest neighbors (use 'all' for all
                            neighbors)
      -t THRESHOLD, --threshold THRESHOLD
                            minimum similarity score threshold
      --alpha ALPHA         Tversky alpha parameter (default: 1.0)
      --beta BETA           Tversky beta parameter (default: the value of --alpha)
      --queries QUERIES, -q QUERIES
                            filename containing the query fingerprints
      --NxN                 use the targets as the queries, and exclude the self-
                            similarity term
      --query QUERY         query as a structure record (default format: 'smi')
      --hex-query HEX_QUERY
                            query in hex
      --query-id QUERY_ID   id for the query or hex-query (default: 'Query1'
      --query-format FORMAT, --in FORMAT
                            input query format (default uses the file extension,
                            else 'fps')
      --target-format FORMAT
                            input target format (default uses the file extension,
                            else 'fps')
      --query-type STRING   fingerprint type string if the queries are structures
                            (default: use the target fingerprint type)
      --id-tag NAME         tag containing the record id if --query-format is an
                            SD file)
      --errors {strict,report,ignore}
                            how should structure parse errors be handled?
                            (default=ignore)
      -R NAME=VALUE         specify a reader argument
      --delimiter {tab,whitespace,to-eol,space}
                            delimiter style for SMILES and InChI files. Alias for
                            '-R delimiter=VALUE'.
      --has-header          Skip the first line of a SMILES or InChI file Alias
                            for '-R has_header=1'
      -o FILENAME, --output FILENAME
                            output filename (default is stdout)
      -c, --count           report counts
      -b BATCH_SIZE, --batch-size BATCH_SIZE
                            batch size
      --scan                scan the file to find matches (low memory overhead)
      --memory              build and search an in-memory data structure
                            (faster for multiple queries)
      --no-mmap             don't use mmap to read uncompressed FPB
                            files. May give better performance on
                            networked file systems, at the expense
                            of higher memory use.
      --times               report load and execution times to stderr
      --version             show program's version number and exit
      --license-check       Check the license and report results to stdout.