.. _chemfp_csv2fps: chemfp csv2fps command-line options ========================================================== The following comes from ``chemfp csv2fps --help``: .. code-block:: none Usage: chemfp csv2fps [OPTIONS] [FILENAMES]... Generate fingerprints from fields of a CSV file Options: --id-column, --id-col COL Column containing the record identifier, as column number or title (default: 1) --molecule-column, --mol-column, --mol-col COL Column containing the molecular structure, as column number or title (default: 2) --fingerprint-column, --fp-column, --fp-col COL Column containing the fingerprint, as column number or title (default: 2) --id-from-molecule, --id-from-mol If specified, get the identifier from the parsed molecule column instead of --id-col --type TYPE_STR The chemfp type string used to generate fingerprints from a molecule (default: 'RDKit-Morgan') --using FILENAME Get the fingerprint type from the metadata of a fingerprint file -d, --dialect STR 'auto' to use the filename (or default to 'csv'), 'csv' for comma-separated, 'tsv' for tab-separated, or one of the dialects from Python's csv module (default: 'auto') --has-header / --no-header With --has-header (the default), the first line contains column titles. Use --no-header if there is no title line. --errors [strict|report|ignore] Describe how to handle errors when parsing a molecule or fingerprint. If 'strict', write an error message (to stderr) and stop processing. If 'report', print an error message and continue processing. If 'ignore', skip and continue processing. Default is 'report' for molecules and 'strict' for fingerprints. --encoding CODEC Specify the character encoding type. Default is 'utf8'. Other common options include 'latin1', 'utf16', and 'cp1252'. --encoding-errors [strict|ignore|replace|backslashreplace] Specify how to handle character encoding errors. Use 'strict' to stop, 'ignore' to ignore', 'replace' to substitute with '�', and 'backslashreplace' for backslashed escape sequences. (default: strict) --in-compression [auto|none|gz|zst] Input compression format. The default of 'auto' uses the filename extension. Specify 'none' for uncompressed, 'gz' for gzip, and 'zst' for ZStandard. --csv-errors [strict|report|ignore] If a required column is missing from a row, the default 'strict' treats the failure as an error. Use 'ignore' to silently skip all errors, and 'report' to print information to stderr and continue processing. --describe Show details about the files to process (dialect, titles, first row) but do not process the files. --progress / --no-progress Show a progress bar (default: show unless the output is a terminal) --version Show the version and exit. --help Show this message and exit. Low-level dialect configuration: --separator, --sep CHAR The character used to separate CSV fields --doublequote / --no-doublequote If true, unescape doubled --quotechar to a single quotechar --escapechar CHAR If specified, this character removes any special meaning from the following character --quotechar CHAR The character used to quote fields containing special characters, including newline --quoting [minimal|none] If 'none' then do not process quote characters. --skipinitialspace / --no-skipinitialspace If True, ignore spaces immediately following the separator Structure parsing options: --format NAME Molecule column format name (default: 'smi') --id-tag TAG Tag name containing the record id (SDF columns only). Using this also enables --id-from- molecule. --delimiter VALUE Delimiter style for SMILES and InChI records. Forces '-R delimiter=VALUE'. -R NAME=VALUE Specify a reader argument used to configure the structure parser --cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default support for CXSMILES extensions. Forces '-R cxsmiles=1' or '-R cxsmiles=0'. Fingerprint decoders: --binary Encoded with the characters '0' and '1'. Bit #0 comes first. Example: 00100000 encodes the value 4 --binary-msb Encoded with the characters '0' and '1'. Bit #0 comes last. Example: 00000100 encodes the value 4 --hex Hex encoded. Bit #0 is the first bit (1<<0) of the first byte. Example: 01f2 encodes the value \x01\xf2 = 498 --hex-lsb Hex encoded. Bit #0 is the eigth bit (1<<7) of the first byte. Example: 804f encodes the value \x01\xf2 = 498 --hex-msb Hex encoded. Bit #0 is the first bit (1<<0) of the last byte. Example: f201 encodes the value \x01\xf2 = 498 --base64 Base-64 encoded. Bit #0 is first bit (1<<0) of first byte. Example: AfI= encodes value \x01\xf2 = 498 --cactvs CACTVS encoding, based on base64 and includes a version and bit length --daylight Daylight encoding, which is a base64 variant --decoder DECODER Import and use the DECODER function to decode the fingerprint Output: -o, --output FILENAME Save the fingerprints to FILENAME (default=stdout) --out FORMAT Output structure format (default guesses from output filename, or is 'fps') --include-metadata / --no-metadata With --no-metadata, do not include the header metadata for FPS output. --no-date Do not include the 'date' metadata in the output header --date STR An ISO 8601 date (like '2022-02-07T11:10:15') to use for the 'date' metadata in the output header --type-str STR When extracting fingerprints, the string to use for the output '#type' header This program processes a CSV file to create a fingerprint file. The fingerprints may come from a column containing a structure record like SMILES or InChI, or a column containing pre-computed fingerprints in one of several encodings. If it is a molecule, this program will use the appropriate toolkit to generate fingerprints from the specified fingerprint type. If it is a pre-compute fingerprint, this program will decode it as specified. # Dialects There are many dialects of the CSV format. Use `--dialect` to specify one of the registered dialects. These are "csv" (or "excel") for an Excel-style comma-separated file, and "tsv` (or "excel-tab") for an Excel-style tab- separated file. These are equivalent to the "excel" and "excel-tab" dialects from the Python's "csv" module, at https://docs.python.org/3/library/csv.html . The registered Python dialect 'unix' is also supported. Alternatively, the low-level CSV options can be changed using `--separator`, `--doublequote' / '--no-doublequote', '--escapechar', '--quotechar', '-- quoting', and '--skipinitialspace' / '--no-skipinitialspace'. These options start with the specified `--dialect` then modify the appropraite settings. See the Python csv module documentation for details. Note: the options which take a CHAR expect either a single character or one of the special names "tab", "backslash", "space", "quote", "doublequote", "singlequote", or "bang". # Molecule processing By default the program expects the identifier in the first column and the molecule in the second, and it expect the first line contains column titles. The default is to process the molecules as SMILES using RDKit to generate "RDKit-Morgan" fingerprints, and write the fingerprints to stdout in FPS format. Use `--type` to specify the fingerprint type as a chemfp fingerprint string, or use `--using` to get the fingerprint type from the metadata of an existing fingerprint file. There is no need to specify which toolkit to use as chemfp can determine that from the fingerprint type. Use `--no-header` if the first line does not contain column titles. (The default is `--has-header`.) If the input structures are not in "smi" format use `--format` to specify the correct one. For most cases this will be "smi", "smistring", "inchi", or "inchistring", though "molfile", "sdf", and other formats are also possible, depending on the toolkit. The default `--cxsmiles` will also parse (or ignore) CXSMILES extensions. Use `--no-cxsmiles` to disable that option. If the record id is stored in the structure record, rather than as one of the columns in the input file, then `--id-from-molecule` to have csv2fps extract the id from the structure record. The `--delimiter` option affects how to parse the title from a SMILES file, and the `--id-tag` specifies which SDF record tag contains the id. If the id and molecule are not in the first and second columns, respectively, then use `--id-column` and `--molecule-column` to specify a different location. If the value is an integer, or "#" followed by an integer, then is the integer is treated as a column number; the first column is column #1. If the value starts with '@' followed by a string, or the value is anything other string, then the string is treated as a column title. Column titles cannot be specified with `--no-header`. For examples: --id-column 3 -- id comes from the third column --mol-column 4 -- molecule comes from the fourth column --id-col name -- id comes from the column with title 'name' --mol-col @9 -- molecule comes from the column with title '9' --mol-col #9 -- molecule come from the ninth column # Fingerprint processing If `--fingerprint-column` is specified (in which case `--molecule-column` must not be specified) then is the column containing pre-computed fingerprints. By default csv2fps will parse them as hex-encoded fingerprints. See the "Fingerprint decoders" section for alternative decoders. # Processing errors The `--errors` option describes how to handle structure processing errors. The default of "report" prints an error message to stderr and skips to then next record. "ignore" does not print an error message, and "strict" prints an error message an exists. The `--csv-errors` option describes how to handle CSV processing errors, like when the specified column does not exist for a given row. The options are the same as `--errors` but the default is "strict". If each of the first 100 records contain errors then csv2fps will give an error message and stop processing, even with "ignore". # Encodings If the input file is not UTF-8 encoded then use `--encoding` to specify the encoding type, like "utf16" or "cp1252". See the full list at https://docs.python.org/3/library/codecs.html#standard-encodings Use `--encoding-errors` to describe how to handle input which could not be decoded. For a description of the different options see https://docs.python.org/3/library/codecs.html#error-handlers # "sniff" dialect The special "sniff" dialect inspects the start of the input file to attempt to guess the format. It is not accurate enough to trust for all of the input, but it may be useful as an initial attempt, especially when combined with the `--describe` option. # --describe The `--describe` option prints an overview of each file instead of generating fingerprints. It prints the dialect details, the column titles (unless `--no-header` is used), and the contents of the first data line, if present. When combined with `--dialect sniff` this gives insight to how to process a previously unseen CSV file. # Examples: 1) See the description of a MolPort file: % chemfp csv2fps --dialect sniff --describe \ fulldb_smiles-000-000-000--000-499-999.txt.gz 2) Process the MolPort file to generate OECircular fingerprints. Use the 'MOLPORTID' column for the identifiers as the 'SMILES' column for the structures. Save the result to 'molport.fps': % chemfp csv2fps --dialect tsv --id-col MOLPORTID --mol-col SMILES \ --type OpenEye-Circular -o molport.fps \ fulldb_smiles-000-000-000-- 000-499-999.txt.gz 3) Process the MolPort file to generate RDKit Morgan fingerprints from the InChI column, use column #3 (MOLPORTID) for the ids, and send the results to stdout: % chemfp csv2fps --dialect tsv --id-col #3 --mol-col STANDARD_INCHI \ --format inchistring fulldb_smiles-000-000-000--000-499-999.txt.gz