.. _cdk2fps:

cdk2fps command-line options
====================================

The following comes from ``cdk2fps --help``:

.. code-block:: none

  usage: cdk2fps [-h]
                 [--type TYPE_STRING | --using FILENAME | --Daylight |
                    --GraphOnly | --MACCS | --EState | --Extended |
                    --Hybridization | --KlekotaRoth | --Pubchem |
                    --Substructure | --ShortestPath | --ECFP0 | --ECFP2 |
                    --ECFP4 | --ECFP6 |   --FCFP0 | --FCFP2 | --FCFP4 |
                    --FCFP6 | --AtomPairs2D]
                 [--substruct] [--rdmaccs] [--rdmaccs/1] [--size INT]
                 [--searchDepth INT] [--pathLimit INT] [--hashPseudoAtoms 0|1]
                 [--perceiveStereochemistry 0|1] [--id-tag NAME] [--in FORMAT]
                 [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}]
                 [--progress | --no-progress] [--help-formats] [-R NAME=VALUE]
                 [--delimiter {tab,whitespace,to-eol,space}] [--has-header]
                 [--version] [--license-check]
                 [filenames ...]
  
  Generate FPS or FPB fingerprints from a structure file using CDK via JPype
  
  positional arguments:
    filenames             input structure files (default is stdin)
  
  options:
    -h, --help            show this help message and exit
    --type TYPE_STRING    Specify a chemfp type string
    --using FILENAME      Get the fingerprint type from the metadata of a
                          fingerprint file
    --Daylight            Make Daylight-like fingerprints using
                          cdk.fingerprinter.Fingerprinter (default)
    --GraphOnly           Make Daylight-like fingerprints (ignoring bond types)
                          using GraphOnlyFingerprinter
    --MACCS               Make 166-bit MACCS keys using MACCSFingerprinter
    --EState              Make 79-bit EState fingerprints using
                          EStateFingerprinter
    --Extended            Make Daylight-like fingerprints extended with ring
                          feature bits, using ExtendedFingerprinter
    --Hybridization       Make Daylight-like fingerprints based on SP2
                          hybridization instead of aromaticity, using
                          HybridizationFingerprinter
    --KlekotaRoth         Make 4860-bit Klekota-Roth fingerprint, using
                          KlekotaRothFingerprinter
    --Pubchem             Make 881-bit PubChem fingerprint, using
                          PubchemFingerprinter
    --Substructure        Make 307-bit substructure fingerprint, using
                          SubstructureFingerprinter
    --ShortestPath        Make fingerprints based on the shortest path between
                          atoms, ring systems, and more, using
                          ShortestPathFingerprinter
    --ECFP0               Make ECFP0-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP0)
    --ECFP2               Make ECFP2-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP2)
    --ECFP4               Make ECFP4-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP4)
    --ECFP6               Make ECFP6-like circular fingerprints, using
                          CircularFingerprinter(CLASS_ECFP6)
    --FCFP0               Make FCFP0-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP0)
    --FCFP2               Make FCFP2-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP2)
    --FCFP4               Make FCFP4-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP4)
    --FCFP6               Make FCFP6-like circular feature fingerprints, using
                          CircularFingerprinter(CLASS_FCFP6)
    --AtomPairs2D         Make 780-bit atom-pair fingerprints adapted from Yap
                          Chun Wei's PaDEL, using AtomPairs2DFingerprinter
    --size INT            fingerprint size (default=1024)
    --searchDepth INT     search depth (default=7)
    --pathLimit INT       path limit (default=42000)
    --hashPseudoAtoms 0|1
                          include pseudo-atoms in path enumeration (default=0)
    --perceiveStereochemistry 0|1
                          re-perceive stereochemistry from 2D/3D coordinates
                          (default=0)
    --id-tag NAME         tag name containing the record id (SD files only)
    --in FORMAT           input structure format (default autodetects from the
                          filename extension)
    -o FILENAME, --output FILENAME
                          save the fingerprints to FILENAME (default=stdout)
    --out FORMAT          output structure format (default guesses from output
                          filename, or is 'fps')
    --errors {strict,report,ignore}
                          how should structure parse errors be handled?
                          (default=ignore)
    --progress, --no-progress
                          Show a progress bar (default: show unless the output
                          is a terminal)
    --help-formats        list the available formats and reader arguments
    -R NAME=VALUE         specify a reader argument
    --delimiter {tab,whitespace,to-eol,space}
                          delimiter style for SMILES and InChI files. Alias for
                          '-R delimiter=VALUE'.
    --has-header          Skip the first line of a SMILES or InChI file Alias
                          for '-R has_header=1'
    --version             show program's version number and exit
    --license-check       Check the license and report results to stdout.
  
  By default the CDK structure reader determines the file format
  and compression type based on the filename extension. Unknown
  filename extensions are treated as a uncompressed SMILES files.
  
  If the data comes from stdin, or the guess based on extension name is
  wrong, then use "--in FORMAT" option to change the default input format.
  For examples:
  
     --in smi
     --in sdf.gz
  
  Use `-R` to specify format-specific reader arguments.
  
  Use `--help-formats` for a list of available formats and reader arguments.

Supported cdk2fps formats
----------------------------------------------------

The following comes from ``cdk2fps --help-formats``:

.. code-block:: none

  These are the structure file formats that chemfp and read when using
  the CDK toolkit.
  
  By default, chemfp uses the filename extension to determine the format
  type. If the filename ends with ".gz" or ".zst" then it is intepreted
  as a gzip or Zstandard compressed file, and the second-to-last
  extension is used to determine the format type. Unknown or unsupported
  extensions are interpreted as a SMILES file.
  
  Note: Zstandard support may depend the "zstandard" Python package
  and/or the "zstd-jni" Java package. To install the Python package see
  https://pypi.org/project/zstandard/ . To get the Java jar file, see
  https://github.com/luben/zstd-jni and place it in your CLASSPATH. 
  
  You may instead specify the file format by name (see below), which is
  especially important when reading from stdin, which has no associated
  filename extension.
  
  The supported filename extensions are:
  
     File Type    Extension(s)
     ==========   =============
       SMILES     can, ism, isosmi, smi, usm
        SDF       mdl, sd, sdf
       InChI      inchi
  
  The format can also be specified by name using the '--in' option:
  
     File Type    Format name (append .gz or .zst if compressed)
     ==========   ==============================================
       SMILES     smi, can, usm
        SDF       sdf
       InChI      inchi
  
  The input format parsers can be configured with the "-R" option. For
  example, the following reader arguments tell the SMILES readers that
  the fields are whitespace delimited and the first line is a header.
  
     -R delimiter=whitespace -R has_header=true
  
  The SMILES format parsers use two additional reader arguments:
     * 'delimiter' specifies the delimiter type. The default is 'to-eol'.
       The other values are 'tab', 'whitespace', 'space' and 'native'.
       Use "-R delimiter=native" to match RDKit's native delimiter
       style, which is 'whitespace'.
     * 'has_header', if false will skip the first line of the SMILES
       file (because it is a header line).
     * 'kekulise': The default of '1' will Kekulize the SMILES. Use '0'
       to skip this step.
     * 'implementation': The default 'cdk' uses CDK's IteratingSMILESReader()
       to parse the SMILES file. The 'chemfp' implementation uses chemfp's
       Python-based SMILES file parser and CDK's SmilesParser() to parse
       parse each SMILES string. The chemfp implementation is slower
       but may have better error-handling and/or reporting.
  
  The SDF format parser supports five reader arguments:
     * 'mode' can be one of 'RELAXED' or 'STRICT'. The default relaxed
       mode supports some records with recoverable errors. The strict
       mode fails to parse those records.
     * 'ForceReadAs3DCoordinates', with the default of '0' interprets
       V2000 records where all z-coordinates == 0.0 as 2D records. The
       value '1' tells CDK to interpret all records as 3D.
     * 'AddStereoElements' with the default of '1' adds 0D stereochemistry
       to V2000 records. The value of '0' skips that step.
     * 'InterpretHydrogenIsotopes with the default of '1' interprets the
       atom symbols 'D' and 'T' as [2H] and [3H], respectively. Use
       '0' to disable this interpretation.
     * 'implementation': The default 'cdk' uses CDK's SDFReaderFactory()
       to parse the SD file. The 'chemfp' implementation uses chemfp's
       SD file parser to parse records, and CDK's MDLReader(),
       MDLV2000Reader(), or MDLV3000Reader() to parse each record. The
       chemfp implementation is about 50% slower than the cdk parser but
       may have better error-handling and/or reporting.
  
  The InChI format parser supports one reader argument:
     * 'delimiter' works the same as it does for the SMILES formats