.. _cdk2fps: cdk2fps command-line options ==================================== The following comes from ``cdk2fps --help``: .. code-block:: none Usage: cdk2fps [OPTIONS] [FILENAMES]... Generate fingerprints from a structure file using CDK. If specified, process the filenames, otherwise read from stdin. Fingerprint types: --Daylight Make Daylight-like fingerprints using cdk.fingerprinter.Fingerprinter (default) --GraphOnly Make Daylight-like fingerprints (ignoring bond types) using GraphOnlyFingerprinter --MACCS Make 166-bit MACCS keys using MACCSFingerprinter --EState Make 79-bit EState fingerprints using EStateFingerprinter --Extended Make Daylight-like fingerprints extended with ring feature bits, using ExtendedFingerprinter --Hybridization Make Daylight-like fingerprints based on SP2 hybridization instead of aromaticity, using HybridizationFingerprinter --KlekotaRoth Make 4860-bit Klekota-Roth fingerprint, using KlekotaRothFingerprinter --Pubchem Make 881-bit PubChem fingerprint, using PubchemFingerprinter --Substructure Make 307-bit substructure fingerprint, using SubstructureFingerprinter --ShortestPath Make fingerprints based on the shortest path between atoms, ring systems, and more, using ShortestPathFingerprinter --ECFP0 Make ECFP0-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP0) --ECFP2 Make ECFP0-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP2) --ECFP4 Make ECFP0-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP4) --ECFP6 Make ECFP0-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP6) --FCFP0 Make FCFP0-like circular fingerprints, using CircularFingerprinter(CLASS_FCFP0) --FCFP2 Make FCFP0-like circular fingerprints, using CircularFingerprinter(CLASS_FCFP2) --FCFP4 Make FCFP0-like circular fingerprints, using CircularFingerprinter(CLASS_FCFP4) --FCFP6 Make FCFP0-like circular fingerprints, using CircularFingerprinter(CLASS_FCFP6) --AtomPairs2D Make 780-bit atom-pair fingerprints adapted from Yap Chun Wei's PaDEL, using AtomPairs2DFingerprinter --substruct Generate ChemFP substructure fingerprints (you likely want to use --Pubchem instead) --rdmaccs, --rdmaccs/2 Generate chemfp's MACCS fingerprints, version 2. --rdmaccs/1 Generate chemfp's MACCS fingerprints, version 1. --type TYPE_STR Specify a chemfp type string --using FILENAME Get the fingerprint type from the metadata of a fingerprint file Fingerprint options: --size INT Fingerprint size (default=1024) [ShortestPath, Hybridization, Daylight, ECFP, Extended, FCFP, GraphOnly] --searchDepth INT Search depth (default=7) [GraphOnly, Daylight, Hybridization, Extended] --pathLimit INT Path limit (default=42000) [GraphOnly, Daylight, Hybridization, Extended] --hashPseudoAtoms 0|1 Include pseudo-atoms in path enumeration (default=0) [GraphOnly, Daylight, Hybridization, Extended] --perceiveStereochemistry 0|1 Re-perceive stereochemistry from 2D/3D coordinates (default=0) [FCFP, ECFP] Options: --id-tag TAG Get the record it from the tag TAG instead of the first line of the record. --in FORMAT Input structure format (default guesses from filename) -o, --output FILENAME Save the fingerprints to FILENAME (default=stdout) --out FORMAT Output structure format (default guesses from output filename, or is 'fps') --include-metadata / --no-metadata With --no-metadata, do not include the header metadata for FPS output. --no-date Do not include the 'date' metadata in the output header --date STR An ISO 8601 date (like '2022-02-07T11:10:15') to use for the 'date' metadata in the output header --delimiter VALUE Delimiter style for SMILES and InChI files. Forces '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file. Forces '-R has_header=1'. -R NAME=VALUE Specify a reader argument --cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default support for CXSMILES extensions. Forces '-R cxsmiles=1' or '-R cxsmiles=0'. --errors [strict|report|ignore] How should structure parse errors be handled? (default=ignore) --progress / --no-progress Show a progress bar (default: show unless the output is a terminal) --help-formats List the available formats and reader arguments --version Show the version and exit. --license-check Check the license and report results to stdout. --help Show this message and exit. By default the CDK structure reader determines the file format and compression type based on the filename extension. Unknown filename extensions are treated as a uncompressed SMILES files. If the data comes from stdin, or the guess based on extension name is wrong, then use "--in FORMAT" option to change the default input format. For examples: --in smi --in sdf.gz Use `-R` to specify format-specific reader arguments. Use `--help-formats` for a list of available formats and reader arguments. Supported cdk2fps formats ---------------------------------------------------- The following comes from ``cdk2fps --help-formats``: .. code-block:: none These are the structure file formats that chemfp and read when using the CDK toolkit. By default, chemfp uses the filename extension to determine the format type. If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or Zstandard compressed file, and the second-to-last extension is used to determine the format type. Unknown or unsupported extensions are interpreted as a SMILES file. Note: Zstandard support may depend on the "zstandard" Python package and/or the "zstd-jni" Java package. To install the Python package see https://pypi.org/project/zstandard/ . To get the Java jar file, see https://github.com/luben/zstd-jni and place it in your CLASSPATH. You may instead specify the file format by name (see below), which is especially important when reading from stdin, which has no associated filename extension. The supported filename extensions are: File Type Extension(s) ========== ============= SMILES can, ism, isosmi, smi, usm SDF mdl, sd, sdf InChI inchi The format can also be specified by name using the '--in' option: File Type Format name (append .gz or .zst if compressed) ========== ============================================== SMILES smi, can, usm SDF sdf InChI inchi The input format parsers can be configured with the "-R" option. For example, the following reader arguments tell the SMILES readers that the fields are whitespace delimited and the first line is a header. -R delimiter=whitespace -R has_header=true The SMILES format parsers use five additional reader arguments: * 'delimiter' specifies the delimiter type. The default is 'to-eol'. The other values are 'tab', 'whitespace', 'space' and 'native'. Use "-R delimiter=native" to match RDKit's native delimiter style, which is 'whitespace'. * 'has_header', if false will skip the first line of the SMILES file (because it is a header line). * 'cxsmiles' describes how to handle CXSMILES extensions. The default (true) will have CDK process the extension. If false any extension will be treated as part of the identifier. * 'kekulise': The default of '1' will Kekulize the SMILES. Use '0' to skip this step. * 'implementation': The default 'cdk' uses CDK's IteratingSMILESReader() to parse the SMILES file. The 'chemfp' implementation uses chemfp's Python-based SMILES file parser and CDK's SmilesParser() to parse parse each SMILES string. The chemfp implementation is slower but may have better error-handling and/or reporting. The SDF format parser supports five reader arguments: * 'mode' can be one of 'RELAXED' or 'STRICT'. The default relaxed mode supports some records with recoverable errors. The strict mode fails to parse those records. * 'ForceReadAs3DCoordinates', with the default of '0' interprets V2000 records where all z-coordinates == 0.0 as 2D records. The value '1' tells CDK to interpret all records as 3D. * 'AddStereoElements' with the default of '1' adds 0D stereochemistry to V2000 records. The value of '0' skips that step. * 'InterpretHydrogenIsotopes with the default of '1' interprets the atom symbols 'D' and 'T' as [2H] and [3H], respectively. Use '0' to disable this interpretation. * 'implementation': The default 'cdk' uses CDK's SDFReaderFactory() to parse the SD file. The 'chemfp' implementation uses chemfp's SD file parser to parse records, and CDK's MDLReader(), MDLV2000Reader(), or MDLV3000Reader() to parse each record. The chemfp implementation is about 50% slower than the cdk parser but may have better error-handling and/or reporting. The InChI format parser supports one reader argument: * 'delimiter' works the same as it does for the SMILES formats