############################### Help for the command-line tools ############################### The chemfp command-line tools are: * :ref:`fpcat ` - merge multiple fingerprint files into one * :ref:`ob2fps ` - use Open Babel to generate fingerprints * :ref:`oe2fps ` - use OEChem/OEGraphSim to generate fingerprints * :ref:`rdkit2fps ` - use RDKit to generate fingerprints * :ref:`cdk2fps ` - use CDK to generate fingerprints * :ref:`sdf2fps ` - extract fingerprints from an SD file * :ref:`simsearch ` - search a fingerprint file for similar fingerprints .. _fpcat: fpcat command-line options ========================== The following comes from ``fpcat --help``: .. code-block:: none usage: fpcat [-h] [--in FORMAT] [--merge] [-o FILENAME] [--out FORMAT] [--level LEVEL] [--reorder] [--preserve-order] [--alignment N] [--show-progress] [--max-spool-size SIZE] [--tmpdir DIRNAME] [--version] [--license-check] [filename ...] Combine multiple fingerprint files into a single file. positional arguments: filename input fingerprint filenames (default: use stdin) optional arguments: -h, --help show this help message and exit --in FORMAT input fingerprint format. One of fps or fpb (with optional gz or zst compression), or flush. (default guesses from filename or is fps) --merge assume the input fingerprint files are in popcount order and do a merge sort -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output fingerprint format. One of fps, fps.gz, fps.zst, fpb, or flush. (default guesses from output filename, or is 'fps') --level LEVEL compression level. Must be a positive integer or one of 'min', 'default', or 'max'. --reorder reorder the output fingerprints by popcount (default for FPB output) --preserve-order save the output fingerprints in the same order as the input (default for FPS output) --alignment N alignment size when saving a FPB file (default=8) --show-progress show progress --max-spool-size SIZE use temporary files for extra storage space for huge FPB files (default uses RAM) --tmpdir DIRNAME directory for the temporary files (default uses the system temp directory) --version show program's version number and exit --license-check Check the license and report results to stdout. Examples: fpcat can be used to convert between FPS and FPB formats. This is handy if you want to see what's inside of an FPB file: fpcat fingerprints.fpb You can use also use fpcat to make an FPB file from an FPS file: fpcat fingerprints.fps -o fingerprints.fpb You might have generated a set of FPS file which you want to merge into a single FPB. (For example, you might have used GNU parallel to generate FPS files for each of the PubChem files, which you want to merge into a single file.): fpcat Compound_*.fps -o pubchem.fpb By default the FPB format sorts the fingerprints by popcount. (Use --preserve-order if you really want to preserve the input order.) The sort overhead for PubChem uses about 10 GB of RAM. If you don't have that much memory then ask fpcat to use less memory: fpcat --max-spool-size 1GB Compound_*.fps -o pubchem.fpb This will use about 2 GB of RAM and the --tmpdir for the rest. (Yes, it would be nice if I could get those two memory size numbers to match.) The --merge option is experimental. Use it if the input fingerprints are in popcount order, because sorted output is a simple merge sort of the individual sorted inputs. However, this option opens all input files at the same time, which may exceed your resource limit on file descriptors. The current implementation also requires a lot of disk seeks so is slow for many files. The flush format is only available if the chemfp_converter package was installed. .. _ob2fps: ob2fps command-line options =========================== The following comes from ``ob2fps --help``: .. code-block:: none usage: ob2fps [-h] [--FP2 | --FP3 | --FP4 | --MACCS | --ECFP0 | --ECFP2 | --ECFP4 | --ECFP6 | --ECFP8 | --ECFP10 | --substruct | --rdmaccs | --rdmaccs/1] [--nBits INT] [--id-tag NAME] [--type TYPE_STRING] [--using FILENAME] [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [--help-formats] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version] [--license-check] [filenames ...] Generate FPS or FPB fingerprints from a structure file using Open Babel positional arguments: filenames input structure files (default is stdin) optional arguments: -h, --help show this help message and exit --FP2 linear fragments up to 7 atoms --FP3 SMARTS patterns specified in the file patterns.txt --FP4 SMARTS patterns specified in the file SMARTS_InteLigand.txt --MACCS Open Babel's implementation of the MACCS 166 keys --ECFP0 ECFP (circular) fingerprints with diameter 0 --ECFP2 ECFP (circular) fingerprints with diameter 2 --ECFP4 ECFP (circular) fingerprints with diameter 4 --ECFP6 ECFP (circular) fingerprints with diameter 6 --ECFP8 ECFP (circular) fingerprints with diameter 8 --ECFP10 ECFP (circular) fingerprints with diameter 10 --substruct ChemFP substructure fingerprints --rdmaccs, --rdmaccs/2 166 bit RDKit/MACCS fingerprints (version 2) --rdmaccs/1 use the version 1 definition for --rdmaccs --id-tag NAME tag name containing the record id (SD files only) --type TYPE_STRING Specify a chemfp type string --using FILENAME Get the fingerprint type from the metadata of a fingerprint file --in FORMAT input structure format (default autodetects from the filename extension) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) --help-formats list the available formats and reader arguments -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file Alias for '-R has_header=1' --version show program's version number and exit --license-check Check the license and report results to stdout. ECFP argument: --nBits INT number of bits in the fingerprint (default=4096) By default the Open Babel structure reader determines the file format and compression type based on the filename extension. Unknown filename extensions are treated as a uncompressed SMILES files. If the data comes from stdin, or the guess based on extension name is wrong, then use "--in FORMAT" option to change the default input format. For examples: --in smi --in sdf.gz Use `-R` to specify format-specific reader arguments. Use `--help-formats` for a list of available formats and reader arguments. The following comes from ``ob2fps --help-formats``, though I've removed most of the Open Babel formats from the list. .. code-block:: none chemfp has special support for the SMILES, InChI, and SDF formats when using the Open Babel toolkit. For these formats, by default, chemfp uses the filename extension to determine the format type. If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or Zstandard compressed file, and the second-to-last extension is used to determine the format type. Unknown or unsupported extensions are then tested against Open Babel format names (see below), and if still unknown, interpreted as a SMILES file. Note: To enable Zstandard compression, please install the "zstandard" Python package from https://pypi.org/project/zstandard/ . You will need to use "-R implementation=chemfp" to enable zst support for the SDF format. You may instead specify the file format by name (see below), which is especially important when reading from stdin, which has no associated filename extension. These specially supported filename extensions are: File Type Extension(s) ========== ============= SMILES can, ism, isosmi, smi, usm SDF sdf InChI inchi The format can also be specified by name using the '--in' option: File Type Format name (append .gz or .zst if compressed) ========== =========== SMILES smi, can, usm SDF sdf InChI inchi The input format parsers can be configured with the "-R" option. For examples, the following reader arguments tell the SMILES readers that the fields are whitespace delimited and the first line is a header. -R delimiter=whitespace -R has_header=true All of the readers support the 'options' reader argument, which is a string passed directly to OBConversion(). This is a compact way to encode all of the Open Babel parameters used in the conversion. For example, 'ab"text"', would set option 'a' to True, and option 'b' to the string "text". The SMILES format parsers use two additional reader arguments: * 'delimiter' specifies the delimiter type. The default is 'to-eol'. The other values are 'tab', 'whitespace', 'space' and 'native'. Use "-R delimiter=native" to match Open Babel's native delimiter style, which is 'to-eol'. * 'has_header', if false will skip the first line of the SMILES file (because it is a header line). The SDF format parser supports one additional reader argument: * 'implementation': if "openbabel" or "native", use Open Babel's native SDF parser. If "chemfp" use chemfp's own implementation to find SDF records, which are then passed to Open Babel for parsing. This gives more fine-grained error reporting, and supports zst compression, and with similar performance. (Note: Open Babel supports additional options.) The InChI format parser supports one additional reader argument: * 'delimiter' works the same as it does for the SMILES formats In addition, you may specify an Open Babel formats, either by one of the following format names, or by reading a filename ending with one of the format names, optionally with a .gz suffix. Zstandard compression is not supported by the native Open Babel reader. Format Description and options ========= ========================== CONFIG DL-POLY CONFIG CONTCAR VASP format s Output single bonds only b Disable bonding entirely CONTFF MDFF format HISTORY DL-POLY HISTORY .... many lines removed from the chemfp documentation ... xyz XYZ cartesian coordinates format s Output single bonds only b Disable bonding entirely yob YASARA.org YOB format You will need to consult the Open Babel documentation (see http://openbabel.org/wiki/List_of_extensions ) and implementation for full details about how these options work. .. _oe2fps: oe2fps command-line options =========================== The following comes from ``oe2fps --help``: .. code-block:: none usage: oe2fps [-h] [--path] [--circular] [--tree] [--numbits INT] [--minbonds INT] [--maxbonds INT] [--minradius INT] [--maxradius INT] [--atype ATYPE] [--btype BTYPE] [--maccs166] [--substruct] [--rdmaccs] [--rdmaccs/1] [--aromaticity NAME] [--id-tag NAME] [--type TYPE_STRING] [--using FILENAME] [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [--help-formats] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version] [--license-check] [filenames ...] Generate FPS or FPB fingerprints from a structure file using OEChem positional arguments: filenames input structure files (default is stdin) optional arguments: -h, --help show this help message and exit --aromaticity NAME use the named aromaticity model (same as '-R aromaticity=NAME') --id-tag NAME tag name containing the record id (SD files only) --type TYPE_STRING Specify a chemfp type string --using FILENAME Get the fingerprint type from the metadata of a fingerprint file --in FORMAT input structure format (default guesses from filename) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) --help-formats list the available formats and reader arguments -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file Alias for '-R has_header=1' --version show program's version number and exit --license-check Check the license and report results to stdout. path, circular, and tree fingerprints: --path generate path fingerprints (default) --circular generate circular fingerprints --tree generate tree fingerprints --numbits INT number of bits in the fingerprint (default=4096) --minbonds INT minimum number of bonds in the path or tree fingerprint (default=0) --maxbonds INT maximum number of bonds in the path or tree fingerprint (path default=5, tree default=4) --minradius INT minimum radius for the circular fingerprint (default=0) --maxradius INT maximum radius for the circular fingerprint (default=5) --atype ATYPE atom type flags, described below (default=Default) --btype BTYPE bond type flags, described below (default=Default) 166 bit MACCS substructure keys: --maccs166 generate MACCS fingerprints 881 bit ChemFP substructure keys: --substruct generate ChemFP substructure fingerprints ChemFP version of the 166 bit RDKit/MACCS keys: --rdmaccs, --rdmaccs/2 generate 166 bit RDKit/MACCS fingerprints (version 2) --rdmaccs/1 use the version 1 definition for --rdmaccs ATYPE is one or more of the following, separated by the '|' character Arom AtmNum Chiral EqArom EqHBAcc EqHBDon EqHalo FCharge HCount HvyDeg Hyb InRing The following shorthand terms and expansions are also available: DefaultPathAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo DefaultCircularAtom = AtmNum|Arom|Chiral|FCharge|HCount|EqHalo DefaultTreeAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb and 'Default' selects the correct value for the specified fingerprint. Examples: --atype Default --atype "Arom|AtmNum|FCharge|HCount" --atype Arom,AtmNum,FCharge,HCount BTYPE is one or more of the following, separated by the '|' character Chiral InRing Order The following shorthand terms and expansions are also available: DefaultPathBond = Order|Chiral DefaultCircularBond = Order DefaultTreeBond = Order and 'Default' selects the correct value for the specified fingerprint. Examples: --btype Default --btype Order|InRing To simplify command-line use, a comma may be used instead of a '|' to separate different fields. Example: --atype AtmNum,HvyDegree By default, chemfp will use the filename extension to determine the structure file format type and possible compression. Most of the file readers support configuration parameters. Use the '-R' option to specify those parameters. Use '--help-formats' to list available formats and reader parameters. The following comes from ``oe2fps --help-formats`` .. code-block:: none These are the structure file formats that chemfp can read when using the OEChem toolkit. By default, chemfp uses the filename extension to determine the format type. If the filename ends with ".gz" then it is intepreted as a gzip compressed file, and the second-to-last extension is used to determine the format type. Unknown or unsupported extensions are interpreted as a SMILES file. (The OEChem structure file readers do not support Zstandard compression.) You may instead specify the file format by name (see below), which is especially important when reading from stdin, which has no associated filename extension. The supported filename extensions are: File Type Extension(s) ========== ============= SMILES can, ism, isosmi, smi, usm SDF mdl, rxn, sd, sdf InChI inchi Tripos Mol2 mol2, mol2h PDB ent, pdb XYZ xyz SKC skc Macromodel mmd, mmod ChemDraw CDX cdx OE binary oeb OEB compressed oez CIF cif mmCIF mmcif FASTA fasta CSV csv Append a '.gz' to the filename to indicate that the contents are gzip-compressed. The format can also be specified by name using the '--in' option: File Type Format name ========== ============= SMILES smi, can, usm SDF sdf InChI inchi Tripos Mol2 mol2, mol2h PDB pdb XYZ xyz SKC skc Macromodel mmod ChemDraw CDX cdx OE binary oeb OEB compressed oez CIF cif mmCIF mmcif FASTA fasta CSV csv Append a '.gz' to the format name to indicate that the contents are gzip-compressed. The input format parsers can be configured with the "-R" option. For example, the following reader arguments tell the SMILES readers that the fields are whitespace delimited and the first line is a header. -R delimiter=whitespace -R has_header=true All formats handle the following two reader arguments: aromaticity - one of 'openeye', 'daylight', 'tripos', 'mdl', or 'mmff' (this can also be set via the older '--aromaticity' command-line option) flavor - a '|' or ',' separated list of flavor names, or a numeric value. A leading '-' means to remove the given flavor. Examples include: o Canon,Strict -- the bitwise merger of the format's Canon and Strict values o Default,-Kekule -- the format's Default flavor but without the Kekule bits (every flavor has a Default) o 42 -- the specific OEChem flavor value 42 The SMILES and InChI formats also handle reader arguments for the delimiter style and the presence of an initial header line using the following: delimiter - one of 'to-eol' (Daylight/OEChem style), 'tab', 'whitespace', 'space', or 'native' (for the native toolkit style) has_header - '1' if the first line contains a header, else '0'. The supported format, default reader arguments, and input flavors are: Format: can aromaticity: openeye delimiter: to-eol flavor: Default default flags: available flags: Canon, Strict has_header: 0 Format: cdx aromaticity: openeye flavor: Default default flags: SuperAtom available flags: SuperAtom Format: cif aromaticity: openeye flavor: Default default flags: BondHydToClosest, BondOrder, FormalCrg, ImplicitH, NormalizeHydPos, OccFilterOneHalf, RemovePBCImages, RemoveQuestionMarkInLabel, Rings available flags: BondHydToClosest, BondOrder, FormalCrg, ImplicitH, NormalizeHydPos, OccFilterOneHalf, RemovePBCImages, RemoveQuestionMarkInLabel, Rings Format: csv aromaticity: openeye flavor: Default default flags: Header available flags: Header Format: fasta aromaticity: openeye flavor: Default default flags: available flags: CustomResidues, EmbeddedSMILES Format: inchi aromaticity: delimiter: to-eol flavor: Default no flavor flags available has_header: 0 Format: mmcif aromaticity: openeye flavor: Default default flags: available flags: NoAltLoc Format: mmod aromaticity: openeye flavor: Default default flags: available flags: FormalCrg Format: mol2 aromaticity: openeye flavor: Default default flags: available flags: Forcefield, M2H Format: mol2h aromaticity: openeye flavor: Default default flags: M2H available flags: M2H Format: oeb aromaticity: flavor: Default no flavor flags available Format: oez aromaticity: flavor: Default no flavor flags available Format: pdb aromaticity: openeye flavor: Default default flags: BondOrder, Connect, END, ENDM, FormalCrg, ImplicitH, Rings, SecStruct available flags: ALL, ALTLOC, BondOrder, CHARGE, Connect, DATA, END, ENDM, FORMALCHARGE, FormalCrg, ImplicitH, RADIUS, Rings, SecStruct, TER Format: sdf aromaticity: openeye flavor: Default default flags: available flags: FixBondMarks, SuppressEmptyMolSkip, SuppressImp2ExpENHSTE Format: skc aromaticity: openeye flavor: Default no flavor flags available Format: smi aromaticity: openeye delimiter: to-eol flavor: Default default flags: available flags: Canon, Strict has_header: 0 Format: usm aromaticity: openeye delimiter: to-eol flavor: Default default flags: available flags: Canon, Strict has_header: 0 Format: xyz aromaticity: openeye flavor: Default default flags: BondOrder, Connect, FormalCrg, ImplicitH, Rings available flags: BondOrder, Connect, FormalCrg, ImplicitH, Rings See https://docs.eyesopen.com/toolkits/cpp/oechemtk/molreadwrite.html#flavored-input-and-output for documentation about the flavors for each format. .. _rdkit2fps: rdkit2fps command-line options ============================== The following comes from ``rdkit2fps --help``: .. code-block:: none usage: rdkit2fps [-h] [--fpSize INT] [--radius INT] [--nBitsPerEntry INT] [--includeChirality 0|1] [--from-atoms INT,INT,...] [--RDK] [--minPath INT] [--maxPath INT] [--nBitsPerHash INT] [--useHs 0|1] [--branchedPaths 0|1] [--useBondOrder 0|1] [--morgan] [--useFeatures 0|1] [--useChirality 0|1] [--useBondTypes 0|1] [--includeRedundantEnvironments 0|1] [--torsions] [--targetSize INT] [--pairs] [--minLength INT] [--maxLength INT] [--use2D 0|1] [--maccs166] [--avalon] [--isQuery 0_or_1] [--bitFlags INT] [--secfp] [--rings 0|1] [--isomeric 0|1] [--kekulize 0|1] [--min_radius INT] [--pattern] [--substruct] [--rdmaccs] [--rdmaccs/1] [--id-tag NAME] [--type TYPE_STRING] [--using FILENAME] [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [--help-formats] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version] [filenames ...] Generate FPS or FPB fingerprints from a structure file using RDKit positional arguments: filenames input structure files (default is stdin) optional arguments: -h, --help show this help message and exit --id-tag NAME tag name containing the record id (SD files only) --type TYPE_STRING Specify a chemfp type string --using FILENAME Get the fingerprint type from the metadata of a fingerprint file --in FORMAT input structure format (default guesses from filename) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) --help-formats list the available formats and reader arguments -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file Alias for '-R has_header=1' --version show program's version number and exit Common Parameters (used by more than one fingerprint type): --fpSize INT number of bits in the fingerprint. Default of 2048 for RDK, Morgan, topological torsion, atom pair, pattern and SECFP fingerprints, and 512 for Avalon fingerprints --radius INT radius for the Morgan or SECFP fingerprints. Default of 2 for Morgan, 3 for SECFP --nBitsPerEntry INT number of bits per entry --includeChirality 0|1 include chirality information in the atom invariants --from-atoms INT,INT,... fingerprint generation must use these atom indices (out of range indices are ignored) RDKit topological fingerprints: Branched or linear hash fingerprint. Uses --fpSize and --fromAtoms plus: --RDK generate RDK fingerprints (default) --minPath INT minimum number of bonds to include in the subgraph (default=1) --maxPath INT maximum number of bonds to include in the subgraph (default=7) --nBitsPerHash INT number of bits to set per path (default=2) --useHs 0|1 include information about the number of hydrogens on each atom (default=1) --branchedPaths 0|1 if set both branched and unbranched paths will be used in the fingerprint (default=1) --useBondOrder 0|1 if set both bond orders will be used in the path hashes (default=1) RDKit Morgan fingerprints: Circular fingerprints similar to ECFP or FCFP fingerprints. Uses --fpSize, --radius, and --fromAtoms plus: --morgan generate Morgan fingerprints --useFeatures 0|1 use chemical-feature invariants (default=0) --useChirality 0|1 include chirality information (default=0) --useBondTypes 0|1 include bond type information (default=1) --includeRedundantEnvironments 0|1 if set, the check for redundant atom environments will not be done (default=0) RDKit Topological Torsion fingerprints: See Nilakantan et al., JCICS 27, 82-85 (1987). Uses --fpSize, --nBitsPerEntry, --includeChirality, and --fromAtoms plus: --torsions generate Topological Torsion fingerprints --targetSize INT number of bonds per torsion (default=4) RDKit Atom Pair fingerprints: See Carhart et al., JCICS 25, 64-73 (1985). Uses --fpSize, --nBitsPerEntry, --includeChirality, and --fromAtoms plus: --pairs generate Atom Pair fingerprints --minLength INT minimum bond count for a pair (default=1) --maxLength INT maximum bond count for a pair (default=30) --use2D 0|1 use 2D instead of 3D distance matrix (default=1) 166 bit MACCS substructure keys: --maccs166 generate MACCS fingerprints Avalon fingerprints: Fingerprints from the Avalon toolkit. Uses --fpSize plus: --avalon generate Avalon fingerprints --isQuery 0_or_1 is the fingerprint for a query structure? (1 if yes, 0 if no) (default=0) --bitFlags INT bit flags, SSSBits are 32767 and similarity bits are 15761407 (default=15761407) SECFP fingerprints: A circular fingerprint based on fragment SMILES instead of hashing. Uses --fpSize and --radius plus: --secfp generate SECFP fingerprints --rings 0|1 if 1, add SSSR ring to the fingerprint (default=1) --isomeric 0|1 if 1, use isomeric SMILES instead of non-isomeric SMILES (default=0) --kekulize 0|1 if 1, use Kekule SMILES instead of aromatic SMILES (default=1) --min_radius INT minimum radius used to extract n-grams (default=1) RDKit Pattern fingerprints: Fingerprints for substructure search screening. --pattern generate (substructure) pattern fingerprints chemfp's version of the 881 bit PubChem substructure keys: --substruct generate ChemFP substructure fingerprints chemfp's version of the 166 bit RDKit/MACCS keys: --rdmaccs, --rdmaccs/2 generate 166 bit RDKit/MACCS fingerprints (version 2) --rdmaccs/1 use the version 1 definition for --rdmaccs This program guesses the input structure format and the compression based on the filename extension. If the guess fails then it assumes the input is an uncompressed SMILES file. If the data comes from stdin, or the guess based on extension name is wrong, then use "--in" to change the default input format. Use the '-R' reader arguments option to pass in format-specific structure reader arguments. The details depend on the specific format. Use the command-line option `--help-formats` to display a list of available formats and reader arguments. The following comes from ``rdkit2fps --help-formats`` .. code-block:: none These are the structure file formats that chemfp can read when using the RDKit toolkit. By default, chemfp uses the filename extension to determine the format type. If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or Zstandard compressed file, and the second-to-last extension is used to determine the format type. Unknown or unsupported extensions are interpreted as a SMILES file. You may instead specify the file format by name (see below), which is especially important when reading from stdin, which has no associated filename extension. The supported filename extensions are: File Type Extension(s) ========== ============= SMILES can, ism, isosmi, smi, usm SDF mdl, sd, sdf InChI inchi Tripos Mol2 mol2 PDB ent, pdb Maestro mae, maegz FASTA faa, fasta The format can also be specified by name using the '--in' option: File Type Format name (append .gz or .zst if compressed) ========== ============================================== SMILES smi, can, usm SDF sdf InChI inchi Tripos Mol2 mol2 PDB pdb Maestro mae FASTA fasta The input format parsers can be configured with the "-R" option. For example, the following reader arguments tell the SMILES readers that the fields are whitespace delimited and the first line is a header. -R delimiter=whitespace -R has_header=true All of the input formats implement the 'sanitize' option, which is enabled by default. Use "-R sanitize=false" to disable sanitization. The SMILES format parsers use two additional reader arguments: * 'delimiter' specifies the delimiter type. The default is 'to-eol'. The other values are 'tab', 'whitespace', 'space' and 'native'. Use "-R delimiter=native" to match RDKit's native delimiter style, which is 'whitespace'. * 'has_header', if false will skip the first line of the SMILES file (because it is a header line). The SDF format parser supports two additional reader arguments: * 'strictParsing', if false will disable strict parsing * 'removeHs', if false will keep all of the hydrogens The InChI format parser supports four additional reader arguments: * 'delimiter' works the same as it does for the SMILES formats * 'removeHs' works the same as it does for the SDF format * 'treatWarningAsError', if true treats all warnings as errors * 'logLevel' specifies the RDKit/InChI library log level, as an integer The Tripos Mol2 format parser supports two additional reader arguments: * 'removeHs' works the same as it does for the SDF format * 'cleanupSubstructures' if false disables standardizing some substructures found in Mol2 files The PDB format parser supports three additional reader arguments: * 'removeHs' works the same as it does for the SDF format * 'flavor', an input parameter with no documented meaning * 'proximityBonding', if false will disable automatic automatic proximity bonding The Maestro format parser supports one additional reader argument: * 'removeHs' works the same as it does for the SDF format The FASTA format parser supports one additional reader argument: * 'flavor', an integer from 0 to 9. The values mean: 0 - the sequence contains L-amino acids 1 - allow lowercase for D-amino acids 2 - RNA with no cap 6 - DNA with no cap 3 - RNA with 5' cap 7 - DNA with 5' cap 4 - RNA with 3' cap 8 - DNA with 3' cap 5 - RNA with both caps 9 - DNA with both caps .. _cdk2fps: cdk2fps command-line options ============================== The following comes from ``cdk2fps --help``: .. code-block:: none usage: cdk2fps [-h] [--type TYPE_STRING | --using FILENAME | --Daylight | --GraphOnly | --MACCS | --EState | --Extended | --Hybridization | --KlekotaRoth | --Pubchem | --Substructure | --ShortestPath | --ECFP0 | --ECFP2 | --ECFP4 | --ECFP6 | --FCFP0 | --FCFP2 | --FCFP4 | --FCFP6 | --AtomPairs2D] [--substruct] [--rdmaccs] [--rdmaccs/1] [--size INT] [--searchDepth INT] [--pathLimit INT] [--hashPseudoAtoms 0|1] [--perceiveStereochemistry 0|1] [--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT] [--errors {strict,report,ignore}] [--help-formats] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [--version] [--license-check] [filenames ...] Generate FPS or FPB fingerprints from a structure file using CDK via JPype positional arguments: filenames input structure files (default is stdin) optional arguments: -h, --help show this help message and exit --type TYPE_STRING Specify a chemfp type string --using FILENAME Get the fingerprint type from the metadata of a fingerprint file --Daylight Make Daylight-like fingerprints using cdk.fingerprinter.Fingerprinter (default) --GraphOnly Make Daylight-like fingerprints (ignoring bond types) using GraphOnlyFingerprinter --MACCS Make 166-bit MACCS keys using MACCSFingerprinter --EState Make 79-bit EState fingerprints using EStateFingerprinter --Extended Make Daylight-like fingerprints extended with ring feature bits, using ExtendedFingerprinter --Hybridization Make Daylight-like fingerprints based on SP2 hybridization instead of aromaticity, using HybridizationFingerprinter --KlekotaRoth Make 4860-bit Klekota-Roth fingerprint, using KlekotaRothFingerprinter --Pubchem Make 881-bit PubChem fingerprint, using PubchemFingerprinter --Substructure Make 307-bit substructure fingerprint, using SubstructureFingerprinter --ShortestPath Make fingerprints based on the shortest path between atoms, ring systems, and more, using ShortestPathFingerprinter --ECFP0 Make ECFP0-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP0) --ECFP2 Make ECFP2-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP2) --ECFP4 Make ECFP4-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP4) --ECFP6 Make ECFP6-like circular fingerprints, using CircularFingerprinter(CLASS_ECFP6) --FCFP0 Make FCFP0-like circular feature fingerprints, using CircularFingerprinter(CLASS_FCFP0) --FCFP2 Make FCFP2-like circular feature fingerprints, using CircularFingerprinter(CLASS_FCFP2) --FCFP4 Make FCFP4-like circular feature fingerprints, using CircularFingerprinter(CLASS_FCFP4) --FCFP6 Make FCFP6-like circular feature fingerprints, using CircularFingerprinter(CLASS_FCFP6) --AtomPairs2D Make 780-bit atom-pair fingerprints adapted from Yap Chun Wei's PaDEL, using AtomPairs2DFingerprinter --size INT fingerprint size (default=1024) --searchDepth INT search depth (default=7) --pathLimit INT path limit (default=42000) --hashPseudoAtoms 0|1 include pseudo-atoms in path enumeration (default=0) --perceiveStereochemistry 0|1 re-perceive stereochemistry from 2D/3D coordinates (default=0) --id-tag NAME tag name containing the record id (SD files only) --in FORMAT input structure format (default autodetects from the filename extension) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output structure format (default guesses from output filename, or is 'fps') --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) --help-formats list the available formats and reader arguments -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file Alias for '-R has_header=1' --version show program's version number and exit --license-check Check the license and report results to stdout. By default the CDK structure reader determines the file format and compression type based on the filename extension. Unknown filename extensions are treated as a uncompressed SMILES files. If the data comes from stdin, or the guess based on extension name is wrong, then use "--in FORMAT" option to change the default input format. For examples: --in smi --in sdf.gz Use `-R` to specify format-specific reader arguments. Use `--help-formats` for a list of available formats and reader arguments. The following comes from ``cdk2fps --help-formats`` .. code-block:: none These are the structure file formats that chemfp and read when using the CDK toolkit. By default, chemfp uses the filename extension to determine the format type. If the filename ends with ".gz" or ".zst" then it is intepreted as a gzip or Zstandard compressed file, and the second-to-last extension is used to determine the format type. Unknown or unsupported extensions are interpreted as a SMILES file. Note: Zstandard support may depend the "zstandard" Python package and/or the "zstd-jni" Java package. To install the Python package see https://pypi.org/project/zstandard/ . To get the Java jar file, see https://github.com/luben/zstd-jni and place it in your CLASSPATH. You may instead specify the file format by name (see below), which is especially important when reading from stdin, which has no associated filename extension. The supported filename extensions are: File Type Extension(s) ========== ============= SMILES can, ism, isosmi, smi, usm SDF mdl, sd, sdf InChI inchi The format can also be specified by name using the '--in' option: File Type Format name (append .gz or .zst if compressed) ========== ============================================== SMILES smi, can, usm SDF sdf InChI inchi The input format parsers can be configured with the "-R" option. For example, the following reader arguments tell the SMILES readers that the fields are whitespace delimited and the first line is a header. -R delimiter=whitespace -R has_header=true The SMILES format parsers use two additional reader arguments: * 'delimiter' specifies the delimiter type. The default is 'to-eol'. The other values are 'tab', 'whitespace', 'space' and 'native'. Use "-R delimiter=native" to match RDKit's native delimiter style, which is 'whitespace'. * 'has_header', if false will skip the first line of the SMILES file (because it is a header line). * 'kekulise': The default of '1' will Kekulize the SMILES. Use '0' to skip this step. * 'implementation': The default 'cdk' uses CDK's IteratingSMILESReader() to parse the SMILES file. The 'chemfp' implementation uses chemfp's Python-based SMILES file parser and CDK's SmilesParser() to parse parse each SMILES string. The chemfp implementation is slower but may have better error-handling and/or reporting. The SDF format parser supports five reader arguments: * 'mode' can be one of 'RELAXED' or 'STRICT'. The default relaxed mode supports some records with recoverable errors. The strict mode fails to parse those records. * 'ForceReadAs3DCoordinates', with the default of '0' interprets V2000 records where all z-coordinates == 0.0 as 2D records. The value '1' tells CDK to interpret all records as 3D. * 'AddStereoElements' with the default of '1' adds 0D stereochemistry to V2000 records. The value of '0' skips that step. * 'InterpretHydrogenIsotopes with the default of '1' interprets the atom symbols 'D' and 'T' as [2H] and [3H], respectively. Use '0' to disable this interpretation. * 'implementation': The default 'cdk' uses CDK's SDFReaderFactory() to parse the SD file. The 'chemfp' implementation uses chemfp's SD file parser to parse records, and CDK's MDLReader(), MDLV2000Reader(), or MDLV3000Reader() to parse each record. The chemfp implementation is about 50% slower than the cdk parser but may have better error-handling and/or reporting. The InChI format parser supports one reader argument: * 'delimiter' works the same as it does for the SMILES formats .. _sdf2fps: sdf2fps command-line options ============================ The following comes from ``sdf2fps --help``: .. code-block:: none usage: sdf2fps [-h] [--id-tag TAG] [--fp-tag TAG] [--in FORMAT] [--num-bits INT] [--errors {strict,report,ignore}] [-o FILENAME] [--out FORMAT] [--software TEXT] [--type TEXT] [--version] [--license-check] [--binary] [--binary-msb] [--hex] [--hex-lsb] [--hex-msb] [--base64] [--cactvs] [--daylight] [--decoder DECODER] [--pubchem] [filenames ...] Extract a fingerprint tag from an SD file and generate FPS or FPB fingerprints positional arguments: filenames input SD files (default is stdin) optional arguments: -h, --help show this help message and exit --id-tag TAG get the record id from TAG instead of the first line of the record --fp-tag TAG get the fingerprint from tag TAG (required) --in FORMAT Specify the input format (one of "sdf", "sdf.gz", or "sdf.zst") --num-bits INT use the first INT bits of the input. Use only when the last 1-7 bits of the last byte are not part of the fingerprint. Unexpected errors will occur if these bits are not all zero. --errors {strict,report,ignore} how should structure parse errors be handled? (default=strict) -o FILENAME, --output FILENAME save the fingerprints to FILENAME (default=stdout) --out FORMAT output format, one of 'fps', 'fps.gz', 'fps.zst', 'fpb', or 'flush' (default guesses from output filename, or is 'fps') --software TEXT use TEXT as the software description --type TEXT use TEXT as the fingerprint type description --version show program's version number and exit --license-check Check the license and report results to stdout. Fingerprint decoding options: --binary Encoded with the characters '0' and '1'. Bit #0 comes first. Example: 00100000 encodes the value 4 --binary-msb Encoded with the characters '0' and '1'. Bit #0 comes last. Example: 00000100 encodes the value 4 --hex Hex encoded. Bit #0 is the first bit (1<<0) of the first byte. Example: 01f2 encodes the value \x01\xf2 = 498 --hex-lsb Hex encoded. Bit #0 is the eigth bit (1<<7) of the first byte. Example: 804f encodes the value \x01\xf2 = 498 --hex-msb Hex encoded. Bit #0 is the first bit (1<<0) of the last byte. Example: f201 encodes the value \x01\xf2 = 498 --base64 Base-64 encoded. Bit #0 is first bit (1<<0) of first byte. Example: AfI= encodes value \x01\xf2 = 498 --cactvs CACTVS encoding, based on base64 and includes a version and bit length --daylight Daylight encoding, which is a base64 variant --decoder DECODER import and use the DECODER function to decode the fingerprint shortcuts: --pubchem decode CACTVS substructure keys used in PubChem. Same as --software=CACTVS/unknown --type 'CACTVS-E_SCREEN/1.0 extended=2' --fp-tag=PUBCHEM_CACTVS_SUBSKEYS --cactvs .. _simsearch: simsearch command-line options ============================== The following comes from ``simsearch --help``: .. code-block:: none usage: simsearch [-h] [-k K_NEAREST] [-t THRESHOLD] [--alpha ALPHA] [--beta BETA] [--queries QUERIES] [--NxN] [--query QUERY] [--hex-query HEX_QUERY] [--query-id QUERY_ID] [--query-format FORMAT] [--target-format FORMAT] [--query-type STRING] [--id-tag NAME] [--errors {strict,report,ignore}] [-R NAME=VALUE] [--delimiter {tab,whitespace,to-eol,space}] [--has-header] [-o FILENAME] [-c] [-b BATCH_SIZE] [--scan] [--memory] [--no-mmap] [--times] [--version] [--license-check] target_filename Search an FPS or FPB file for similar fingerprints positional arguments: target_filename target filename optional arguments: -h, --help show this help message and exit -k K_NEAREST, --k-nearest K_NEAREST select the k nearest neighbors (use 'all' for all neighbors) -t THRESHOLD, --threshold THRESHOLD minimum similarity score threshold --alpha ALPHA Tversky alpha parameter (default: 1.0) --beta BETA Tversky beta parameter (default: the value of --alpha) --queries QUERIES, -q QUERIES filename containing the query fingerprints --NxN use the targets as the queries, and exclude the self- similarity term --query QUERY query as a structure record (default format: 'smi') --hex-query HEX_QUERY query in hex --query-id QUERY_ID id for the query or hex-query (default: 'Query1' --query-format FORMAT, --in FORMAT input query format (default uses the file extension, else 'fps') --target-format FORMAT input target format (default uses the file extension, else 'fps') --query-type STRING fingerprint type string if the queries are structures (default: use the target fingerprint type) --id-tag NAME tag containing the record id if --query-format is an SD file) --errors {strict,report,ignore} how should structure parse errors be handled? (default=ignore) -R NAME=VALUE specify a reader argument --delimiter {tab,whitespace,to-eol,space} delimiter style for SMILES and InChI files. Alias for '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file Alias for '-R has_header=1' -o FILENAME, --output FILENAME output filename (default is stdout) -c, --count report counts -b BATCH_SIZE, --batch-size BATCH_SIZE batch size --scan scan the file to find matches (low memory overhead) --memory build and search an in-memory data structure (faster for multiple queries) --no-mmap don't use mmap to read uncompressed FPB files. May give better performance on networked file systems, at the expense of higher memory use. --times report load and execution times to stderr --version show program's version number and exit --license-check Check the license and report results to stdout.