.. _rdkit2fps: rdkit2fps command-line options ==================================== The following comes from ``rdkit2fps --help``: .. code-block:: none Usage: rdkit2fps [OPTIONS] [FILENAMES]... Generate fingerprints from a structure file using RDKit. If specified, process the filenames, otherwise read from stdin. Fingerprint types: --RDK Generate RDK fingerprints (default). --morgan1 Generate Morgan fingerprints (radius=1). --morgan, --morgan2 Generate Morgan fingerprints (radius=2). --morgan3 Generate Morgan fingerprints (radius=3). --morgan4 Generate Morgan fingerprints (radius=4). --torsion, --torsions Generate Topological Torsion fingerprints. --pair, --pairs Generate Atom Pair fingerprints. --maccs166, --maccs Generate MACCS fingerprints. --avalon Generate Avalon fingerprints. --pattern Generate (substructure) pattern fingerprints. --secfp Generate SECFP fingerprints, a circular fingerprint based on fragment SMILES instead of hashing. --substruct Generate chemfp's PubChem-like substructure fingerprints. --rdmaccs, --rdmaccs/2 Generate chemfp's MACCS fingerprints, version 2. --rdmaccs/1 Generate chemfp's MACCS fingerprints, version 1. --type TYPE_STR Specify a chemfp type string --using FILENAME Get the fingerprint type from the metadata of a fingerprint file Fingerprint options: --minPath INT Minimum number of bonds to include in the subgraph (default=1) [RDKit] --maxPath INT Maximum number of bonds to include in the subgraph (default=7) [RDKit] --fpSize INT number of bits in the fingerprint [Morgan, RDKit, Pattern, Avalon, SECFP, AtomPair, Torsion] --nBitsPerHash INT Number of bits to set per path (default=2) [RDKit] --useHs 0|1 Include information about the number of hydrogens on each atom (default=1) [RDKit] --fromAtoms, --from-atoms INT,INT,... Specify the atom indices to use (default=None) [AtomPair, Morgan, RDKit, Torsion] --branchedPaths 0|1 If 1, both branched and unbranched paths will be used in the fingerprint (default=1) [RDKit] --useBondOrder 0|1 If 1, both bond orders will be used in the path hashes (default=1) [RDKit] --radius INT radius for the Morgan or SECFP fingerprints [SECFP, Morgan] --useFeatures 0|1 Use chemical-feature invariants (default=0) [Morgan] --useChirality 0|1 Include chirality information (default=0) [Morgan] --useBondTypes 0|1 Include bond type information (default=1) [Morgan] --includeRedundantEnvironments 0|1 If 1, do not check for redundant atom environments (default=0) [Morgan] --minLength INT Minimum bond count for a pair (default=1) [AtomPair] --maxLength INT Maximum bond count for a pair (default=30) [AtomPair] --nBitsPerEntry INT Number of bits per entry (default=4) [AtomPair, Torsion] --includeChirality 0|1 include chirality information [AtomPair, Torsion] --use2D 0|1 Use 2D instead of 3D distance matrix (default=1) [AtomPair] --targetSize INT Number of bonds per torsion (default=4) [Torsion] --isQuery 0|1 Is the fingerprint for a query structure? (1 if yes, 0 if no) (default=0) [Avalon] --bitFlags INT Bit flags, SSSBits are 32767 and similarity bits are 15761407 (default=15761407) [Avalon] --rings 0|1 If 1, add SSSR ring to the fingerprint (default=1) [SECFP] --isomeric 0|1 If 1, use isomeric SMILES instead of non- isomeric SMILES (default=0) [SECFP] --kekulize 0|1 If 1, use Kekule SMILES instead of aromatic SMILES (default=0) [SECFP] --min_radius, --min-radius INT Minimum radius used to extract n-grams (default=1) [SECFP] Options: --id-tag TAG Tag name containing the record id (SD files only) --delimiter VALUE Delimiter style for SMILES and InChI files. Forces '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file. Forces '-R has_header=1'. -R NAME=VALUE Specify a reader argument --cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default support for CXSMILES extensions. Forces '-R cxsmiles=1' or '-R cxsmiles=0'. --in FORMAT Input structure format (default guesses from filename) -o, --output FILENAME Save the fingerprints to FILENAME (default=stdout) --out FORMAT Output structure format (default guesses from output filename, or is 'fps') --include-metadata / --no-metadata With --no-metadata, do not include the header metadata for FPS output. --no-date Do not include the 'date' metadata in the output header --date STR An ISO 8601 date (like '2022-02-07T11:10:15') to use for the 'date' metadata in the output header --errors [strict|report|ignore] How should structure parse errors be handled? (default=ignore) --progress / --no-progress Show a progress bar (default: show unless the output is a terminal) --help-formats List the available formats and reader arguments --version Show the version and exit. --license-check Check the license and report results to stdout. --help Show this message and exit. This program guesses the input structure format and the compression based on the filename extension. If the guess fails then it assumes the input is an uncompressed SMILES file. If the data comes from stdin, or the guess based on extension name is wrong, then use "--in" to change the default input format. Use the '-R' reader arguments option to pass in format-specific structure reader arguments. The details depend on the specific format. Use the command-line option `--help-formats` to display a list of available formats and reader arguments. Supported rdkit2fps formats ---------------------------------------------------- The following comes from ``rdkit2fps --help-formats``: .. code-block:: none These are the structure file formats that chemfp can read when using the RDKit toolkit. By default, chemfp uses the filename extension to determine the format type. If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or Zstandard compressed file, and the second-to-last extension is used to determine the format type. Unknown or unsupported extensions are interpreted as a SMILES file. You may instead specify the file format by name (see below), which is especially important when reading from stdin, which has no associated filename extension. The supported filename extensions are: File Type Extension(s) ========== ============= SMILES can, ism, isosmi, smi, usm SDF mdl, sd, sdf InChI inchi Tripos Mol2 mol2 PDB ent, pdb Maestro mae, maegz FASTA faa, fasta The format can also be specified by name using the '--in' option: File Type Format name (append .gz or .zst if compressed) ========== ============================================== SMILES smi, can, usm SDF sdf InChI inchi Tripos Mol2 mol2 PDB pdb Maestro mae FASTA fasta The input format parsers can be configured with the "-R" option. For example, the following reader arguments tell the SMILES readers that the fields are whitespace delimited and the first line is a header. -R delimiter=whitespace -R has_header=true All of the input formats implement the 'sanitize' option, which is enabled by default. Use "-R sanitize=false" to disable sanitization. The SMILES format parsers use three additional reader arguments: * 'delimiter' specifies the delimiter type. The default is 'to-eol'. The other values are 'tab', 'whitespace', 'space' and 'native'. Use "-R delimiter=native" to match RDKit's native delimiter style, which is 'whitespace'. * 'has_header', if false will skip the first line of the SMILES file (because it is a header line). * 'cxsmiles' describes how to handle CXSMILES extensions. The default (true) will have RDKit process the extension. If false any extension will be treated as part of the identifier. The SDF format parser supports two additional reader arguments: * 'strictParsing', if false will disable strict parsing * 'removeHs', if false will keep all of the hydrogens The InChI format parser supports four additional reader arguments: * 'delimiter' works the same as it does for the SMILES formats * 'removeHs' works the same as it does for the SDF format * 'treatWarningAsError', if true treats all warnings as errors * 'logLevel' specifies the RDKit/InChI library log level, as an integer The Tripos Mol2 format parser supports two additional reader arguments: * 'removeHs' works the same as it does for the SDF format * 'cleanupSubstructures' if false disables standardizing some substructures found in Mol2 files The PDB format parser supports three additional reader arguments: * 'removeHs' works the same as it does for the SDF format * 'flavor', an input parameter with no documented meaning * 'proximityBonding', if false will disable automatic automatic proximity bonding The Maestro format parser supports one additional reader argument: * 'removeHs' works the same as it does for the SDF format The FASTA format parser supports one additional reader argument: * 'flavor', an integer from 0 to 9. The values mean: 0 - the sequence contains L-amino acids 1 - allow lowercase for D-amino acids 2 - RNA with no cap 6 - DNA with no cap 3 - RNA with 5' cap 7 - DNA with 5' cap 4 - RNA with 3' cap 8 - DNA with 3' cap 5 - RNA with both caps 9 - DNA with both caps