.. _cdk2fps:

cdk2fps command-line options
====================================

The following comes from ``cdk2fps --help``:

.. code-block:: none

  Usage: cdk2fps [OPTIONS] [FILENAMES]...
  
    Generate fingerprints from a structure file using CDK.
  
    If specified, process the filenames, otherwise read from stdin.
  
  Fingerprint types:
    --Daylight              Make Daylight-like fingerprints using
                            cdk.fingerprinter.Fingerprinter (default)
    --GraphOnly             Make Daylight-like fingerprints (ignoring bond
                            types) using GraphOnlyFingerprinter
    --MACCS                 Make 166-bit MACCS keys using MACCSFingerprinter
    --EState                Make 79-bit EState fingerprints using
                            EStateFingerprinter
    --Extended              Make Daylight-like fingerprints extended with ring
                            feature bits, using ExtendedFingerprinter
    --Hybridization         Make Daylight-like fingerprints based on SP2
                            hybridization instead of aromaticity, using
                            HybridizationFingerprinter
    --KlekotaRoth           Make 4860-bit Klekota-Roth fingerprint, using
                            KlekotaRothFingerprinter
    --Pubchem               Make 881-bit PubChem fingerprint, using
                            PubchemFingerprinter
    --Substructure          Make 307-bit substructure fingerprint, using
                            SubstructureFingerprinter
    --ShortestPath          Make fingerprints based on the shortest path between
                            atoms, ring systems, and more, using
                            ShortestPathFingerprinter
    --ECFP0                 Make ECFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_ECFP0)
    --ECFP2                 Make ECFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_ECFP2)
    --ECFP4                 Make ECFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_ECFP4)
    --ECFP6                 Make ECFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_ECFP6)
    --FCFP0                 Make FCFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_FCFP0)
    --FCFP2                 Make FCFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_FCFP2)
    --FCFP4                 Make FCFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_FCFP4)
    --FCFP6                 Make FCFP0-like circular fingerprints, using
                            CircularFingerprinter(CLASS_FCFP6)
    --AtomPairs2D           Make 780-bit atom-pair fingerprints adapted from Yap
                            Chun Wei's PaDEL, using AtomPairs2DFingerprinter
    --substruct             Generate ChemFP substructure fingerprints (you
                            likely want to use --Pubchem instead)
    --rdmaccs, --rdmaccs/2  Generate chemfp's MACCS fingerprints, version 2.
    --rdmaccs/1             Generate chemfp's MACCS fingerprints, version 1.
    --type TYPE_STR         Specify a chemfp type string
    --using FILENAME        Get the fingerprint type from the metadata of a
                            fingerprint file
  
  Fingerprint options:
    --size INT                     Fingerprint size (default=1024)
                                   [ShortestPath, Hybridization, Daylight, ECFP,
                                   Extended, FCFP, GraphOnly]
    --searchDepth INT              Search depth (default=7) [GraphOnly,
                                   Daylight, Hybridization, Extended]
    --pathLimit INT                Path limit (default=42000) [GraphOnly,
                                   Daylight, Hybridization, Extended]
    --hashPseudoAtoms 0|1          Include pseudo-atoms in path enumeration
                                   (default=0) [GraphOnly, Daylight,
                                   Hybridization, Extended]
    --perceiveStereochemistry 0|1  Re-perceive stereochemistry from 2D/3D
                                   coordinates (default=0) [FCFP, ECFP]
  
  Options:
    --id-tag TAG                    Get the record it from the tag TAG instead
                                    of the first line of the record.
    --in FORMAT                     Input structure format (default guesses from
                                    filename)
    -o, --output FILENAME           Save the fingerprints to FILENAME
                                    (default=stdout)
    --out FORMAT                    Output structure format (default guesses
                                    from output filename, or is 'fps')
    --include-metadata / --no-metadata
                                    With --no-metadata, do not include the
                                    header metadata for FPS output.
    --no-date                       Do not include the 'date' metadata in the
                                    output header
    --date STR                      An ISO 8601 date (like
                                    '2022-02-07T11:10:15') to use for the 'date'
                                    metadata in the output header
    --delimiter VALUE               Delimiter style for SMILES and InChI files.
                                    Forces '-R delimiter=VALUE'.
    --has-header                    Skip the first line of a SMILES or InChI
                                    file. Forces '-R has_header=1'.
    -R NAME=VALUE                   Specify a reader argument
    --cxsmiles / --no-cxsmiles      Use --no-cxsmiles to disable the default
                                    support for CXSMILES extensions. Forces '-R
                                    cxsmiles=1' or '-R cxsmiles=0'.
    --errors [strict|report|ignore]
                                    How should structure parse errors be
                                    handled? (default=ignore)
    --progress / --no-progress      Show a progress bar (default: show unless
                                    the output is a terminal)
    --help-formats                  List the available formats and reader
                                    arguments
    --version                       Show the version and exit.
    --license-check                 Check the license and report results to
                                    stdout.
    --help                          Show this message and exit.
  
    By default the CDK structure reader determines the file format and
    compression type based on the filename extension. Unknown filename
    extensions are treated as a uncompressed SMILES files.
  
    If the data comes from stdin, or the guess based on extension name is wrong,
    then use "--in FORMAT" option to change the default input format. For
    examples:
  
       --in smi    --in sdf.gz
  
    Use `-R` to specify format-specific reader arguments.
  
    Use `--help-formats` for a list of available formats and reader arguments.

Supported cdk2fps formats
----------------------------------------------------

 The following comes from ``cdk2fps --help-formats``:

.. code-block:: none
    
  These are the structure file formats that chemfp and read when using the CDK
  toolkit.
  
  By default, chemfp uses the filename extension to determine the format type.
  If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or
  Zstandard compressed file, and the second-to-last extension is used to
  determine the format type. Unknown or unsupported extensions are interpreted
  as a SMILES file.
  
  Note: Zstandard support may depend on the "zstandard" Python package and/or
  the "zstd-jni" Java package. To install the Python package see
  https://pypi.org/project/zstandard/ . To get the Java jar file, see
  https://github.com/luben/zstd-jni and place it in your CLASSPATH.
  
  You may instead specify the file format by name (see below), which is
  especially important when reading from stdin, which has no associated filename
  extension.
  
  The supported filename extensions are:
  
     File Type    Extension(s)
     ==========   =============
       SMILES     can, ism, isosmi, smi, usm
        SDF       mdl, sd, sdf
       InChI      inchi
  
  The format can also be specified by name using the '--in' option:
  
     File Type    Format name (append .gz or .zst if compressed)
     ==========   ==============================================
       SMILES     smi, can, usm
        SDF       sdf
       InChI      inchi
  
  The input format parsers can be configured with the "-R" option. For example,
  the following reader arguments tell the SMILES readers that the fields are
  whitespace delimited and the first line is a header.
  
     -R delimiter=whitespace -R has_header=true
  
  The SMILES format parsers use five additional reader arguments:
  
     * 'delimiter' specifies the delimiter type. The default is 'to-eol'.
       The other values are 'tab', 'whitespace', 'space' and 'native'.
       Use "-R delimiter=native" to match RDKit's native delimiter
       style, which is 'whitespace'.
     * 'has_header', if false will skip the first line of the SMILES
       file (because it is a header line).
     * 'cxsmiles' describes how to handle CXSMILES extensions. The
       default (true) will have CDK process the extension. If false
       any extension will be treated as part of the identifier.
     * 'kekulise': The default of '1' will Kekulize the SMILES. Use '0'
       to skip this step.
     * 'implementation': The default 'cdk' uses CDK's IteratingSMILESReader()
       to parse the SMILES file. The 'chemfp' implementation uses chemfp's
       Python-based SMILES file parser and CDK's SmilesParser() to parse
       parse each SMILES string. The chemfp implementation is slower
       but may have better error-handling and/or reporting.
  
  The SDF format parser supports five reader arguments:
  
     * 'mode' can be one of 'RELAXED' or 'STRICT'. The default relaxed
       mode supports some records with recoverable errors. The strict
       mode fails to parse those records.
     * 'ForceReadAs3DCoordinates', with the default of '0' interprets
       V2000 records where all z-coordinates == 0.0 as 2D records. The
       value '1' tells CDK to interpret all records as 3D.
     * 'AddStereoElements' with the default of '1' adds 0D stereochemistry
       to V2000 records. The value of '0' skips that step.
     * 'InterpretHydrogenIsotopes with the default of '1' interprets the
       atom symbols 'D' and 'T' as [2H] and [3H], respectively. Use
       '0' to disable this interpretation.
     * 'implementation': The default 'cdk' uses CDK's SDFReaderFactory()
       to parse the SD file. The 'chemfp' implementation uses chemfp's
       SD file parser to parse records, and CDK's MDLReader(),
       MDLV2000Reader(), or MDLV3000Reader() to parse each record. The
       chemfp implementation is about 50% slower than the cdk parser but
       may have better error-handling and/or reporting.
  
  The InChI format parser supports one reader argument:
  
     * 'delimiter' works the same as it does for the SMILES formats