.. _chemfp_csv2fps:

chemfp csv2fps command-line options
==========================================================

The following comes from ``chemfp csv2fps --help``:

.. code-block:: none

  Usage: chemfp csv2fps [OPTIONS] [FILENAMES]...
  
    Generate fingerprints from fields of a CSV file
  
  Options:
    --id-column, --id-col COL       Column containing the record identifier, as
                                    column number or title (default: 1)
    --molecule-column, --mol-column, --mol-col COL
                                    Column containing the molecular structure,
                                    as column number or title (default: 2)
    --fingerprint-column, --fp-column, --fp-col COL
                                    Column containing the fingerprint, as column
                                    number or title (default: 2)
    --id-from-molecule, --id-from-mol
                                    If specified, get the identifier from the
                                    parsed molecule column instead of --id-col
    --type TYPE_STR                 The chemfp type string used to generate
                                    fingerprints from a molecule (default:
                                    'RDKit-Morgan')
    --using FILENAME                Get the fingerprint type from the metadata
                                    of a fingerprint file
    -d, --dialect STR               'auto' to use the filename (or default to
                                    'csv'), 'csv' for comma-separated, 'tsv' for
                                    tab-separated, or one of the dialects from
                                    Python's csv module (default: 'auto')
    --has-header / --no-header      With --has-header (the default), the first
                                    line contains column titles. Use --no-header
                                    if there is no title line.
    --errors [strict|report|ignore]
                                    Describe how to handle errors when parsing a
                                    molecule or fingerprint. If 'strict', write
                                    an error message (to stderr) and stop
                                    processing. If 'report', print an error
                                    message and continue processing. If
                                    'ignore', skip and continue processing.
                                    Default is 'report' for molecules and
                                    'strict' for fingerprints.
    --encoding CODEC                Specify the character encoding type. Default
                                    is 'utf8'. Other common options include
                                    'latin1', 'utf16', and 'cp1252'.
    --encoding-errors [strict|ignore|replace|backslashreplace]
                                    Specify how to handle character encoding
                                    errors. Use 'strict' to stop, 'ignore' to
                                    ignore', 'replace' to substitute with '�',
                                    and 'backslashreplace' for backslashed
                                    escape sequences. (default: strict)
    --in-compression [auto|none|gz|zst]
                                    Input compression format. The default of
                                    'auto' uses the filename extension. Specify
                                    'none' for uncompressed, 'gz' for gzip, and
                                    'zst' for ZStandard.
    --csv-errors [strict|report|ignore]
                                    If a required column is missing from a row,
                                    the default 'strict' treats the failure as
                                    an error. Use 'ignore' to silently skip all
                                    errors, and 'report' to print information to
                                    stderr and continue processing.
    --describe                      Show details about the files to process
                                    (dialect, titles, first row) but do not
                                    process the files.
    --progress / --no-progress      Show a progress bar (default: show unless
                                    the output is a terminal)
    --version                       Show the version and exit.
    --help                          Show this message and exit.
  
  Low-level dialect configuration:
    --separator, --sep CHAR         The character used to separate CSV fields
    --doublequote / --no-doublequote
                                    If true, unescape doubled --quotechar to a
                                    single quotechar
    --escapechar CHAR               If specified, this character removes any
                                    special meaning from the following character
    --quotechar CHAR                The character used to quote fields
                                    containing special characters, including
                                    newline
    --quoting [minimal|none]        If 'none' then do not process quote
                                    characters.
    --skipinitialspace / --no-skipinitialspace
                                    If True, ignore spaces immediately following
                                    the separator
  
  Structure parsing options:
    --format NAME               Molecule column format name (default: 'smi')
    --id-tag TAG                Tag name containing the record id (SDF columns
                                only). Using this also enables --id-from-
                                molecule.
    --delimiter VALUE           Delimiter style for SMILES and InChI records.
                                Forces '-R delimiter=VALUE'.
    -R NAME=VALUE               Specify a reader argument used to configure the
                                structure parser
    --cxsmiles / --no-cxsmiles  Use --no-cxsmiles to disable the default support
                                for CXSMILES extensions. Forces '-R cxsmiles=1'
                                or '-R cxsmiles=0'.
  
  Fingerprint decoders:
    --binary           Encoded with the characters '0' and '1'. Bit #0 comes
                       first. Example: 00100000 encodes the value 4
    --binary-msb       Encoded with the characters '0' and '1'. Bit #0 comes
                       last. Example: 00000100 encodes the value 4
    --hex              Hex encoded. Bit #0 is the first bit (1<<0) of the first
                       byte. Example: 01f2 encodes the value \x01\xf2 = 498
    --hex-lsb          Hex encoded. Bit #0 is the eigth bit (1<<7) of the first
                       byte. Example: 804f encodes the value \x01\xf2 = 498
    --hex-msb          Hex encoded. Bit #0 is the first bit (1<<0) of the last
                       byte. Example: f201 encodes the value \x01\xf2 = 498
    --base64           Base-64 encoded. Bit #0 is first bit (1<<0) of first
                       byte. Example: AfI= encodes value \x01\xf2 = 498
    --cactvs           CACTVS encoding, based on base64 and includes a version
                       and bit length
    --daylight         Daylight encoding, which is a base64 variant
    --decoder DECODER  Import and use the DECODER function to decode the
                       fingerprint
  
  Output:
    -o, --output FILENAME           Save the fingerprints to FILENAME
                                    (default=stdout)
    --out FORMAT                    Output structure format (default guesses
                                    from output filename, or is 'fps')
    --include-metadata / --no-metadata
                                    With --no-metadata, do not include the
                                    header metadata for FPS output.
    --no-date                       Do not include the 'date' metadata in the
                                    output header
    --date STR                      An ISO 8601 date (like
                                    '2022-02-07T11:10:15') to use for the 'date'
                                    metadata in the output header
    --type-str STR                  When extracting fingerprints, the string to
                                    use for the output '#type' header
  
    This program processes a CSV file to create a fingerprint file. The
    fingerprints may come from a column containing a structure record like
    SMILES or InChI, or a column containing pre-computed fingerprints in one of
    several encodings. If it is a molecule, this program will use the
    appropriate toolkit to generate fingerprints from the specified fingerprint
    type. If it is a pre-compute fingerprint, this program will decode it as
    specified.
  
    # Dialects
  
    There are many dialects of the CSV format. Use `--dialect` to specify one of
    the registered dialects. These are "csv" (or "excel") for an Excel-style
    comma-separated file, and "tsv` (or "excel-tab") for an Excel-style tab-
    separated file. These are equivalent to the "excel" and "excel-tab" dialects
    from the Python's "csv" module, at
    https://docs.python.org/3/library/csv.html . The registered Python dialect
    'unix' is also supported.
  
    Alternatively, the low-level CSV options can be changed using `--separator`,
    `--doublequote' / '--no-doublequote', '--escapechar', '--quotechar', '--
    quoting', and '--skipinitialspace' / '--no-skipinitialspace'. These options
    start with the specified `--dialect` then modify the appropraite settings.
    See the Python csv module documentation for details. Note: the options which
    take a CHAR expect either a single character or one of the special names
    "tab", "backslash", "space", "quote", "doublequote", "singlequote", or
    "bang".
  
    # Molecule processing
  
    By default the program expects the identifier in the first column and the
    molecule in the second, and it expect the first line contains column titles.
    The default is to process the molecules as SMILES using RDKit to generate
    "RDKit-Morgan" fingerprints, and write the fingerprints to stdout in FPS
    format.
  
    Use `--type` to specify the fingerprint type as a chemfp fingerprint string,
    or use `--using` to get the fingerprint type from the metadata of an
    existing fingerprint file. There is no need to specify which toolkit to use
    as chemfp can determine that from the fingerprint type.
  
    Use `--no-header` if the first line does not contain column titles. (The
    default is `--has-header`.)
  
    If the input structures are not in "smi" format use `--format` to specify
    the correct one. For most cases this will be "smi", "smistring", "inchi", or
    "inchistring", though "molfile", "sdf", and other formats are also possible,
    depending on the toolkit. The default `--cxsmiles` will also parse (or
    ignore) CXSMILES extensions. Use `--no-cxsmiles` to disable that option.
  
    If the record id is stored in the structure record, rather than as one of
    the columns in the input file, then `--id-from-molecule` to have csv2fps
    extract the id from the structure record. The `--delimiter` option affects
    how to parse the title from a SMILES file, and the `--id-tag` specifies
    which SDF record tag contains the id.
  
    If the id and molecule are not in the first and second columns,
    respectively, then use `--id-column` and `--molecule-column` to specify a
    different location. If the value is an integer, or "#" followed by an
    integer, then is the integer is treated as a column number; the first column
    is column #1. If the value starts with '@' followed by a string, or the
    value is anything other string, then the string is treated as a column
    title. Column titles cannot be specified with `--no-header`.
  
    For examples:
  
      --id-column 3  -- id comes from the third column
  
      --mol-column 4  -- molecule comes from the fourth column
  
      --id-col name  -- id comes from the column with title 'name'
  
      --mol-col @9  -- molecule comes from the column with title '9'
  
      --mol-col #9  -- molecule come from the ninth column
  
    # Fingerprint processing
  
    If `--fingerprint-column` is specified (in which case `--molecule-column`
    must not be specified) then is the column containing pre-computed
    fingerprints. By default csv2fps will parse them as hex-encoded
    fingerprints. See the "Fingerprint decoders" section for alternative
    decoders.
  
    # Processing errors
  
    The `--errors` option describes how to handle structure processing errors.
    The default of "report" prints an error message to stderr and skips to then
    next record. "ignore" does not print an error message, and "strict" prints
    an error message an exists.
  
    The `--csv-errors` option describes how to handle CSV processing errors,
    like when the specified column does not exist for a given row. The options
    are the same as `--errors` but the default is "strict".
  
    If each of the first 100 records contain errors then csv2fps will give an
    error message and stop processing, even with "ignore".
  
    # Encodings
  
    If the input file is not UTF-8 encoded then use `--encoding` to specify the
    encoding type, like "utf16" or "cp1252". See the full list at
    https://docs.python.org/3/library/codecs.html#standard-encodings
  
    Use `--encoding-errors` to describe how to handle input which could not be
    decoded. For a description of the different options see
    https://docs.python.org/3/library/codecs.html#error-handlers
  
    # "sniff" dialect
  
    The special "sniff" dialect inspects the start of the input file to attempt
    to guess the format. It is not accurate enough to trust for all of the
    input, but it may be useful as an initial attempt, especially when combined
    with the `--describe` option.
  
    # --describe
  
    The `--describe` option prints an overview of each file instead of
    generating fingerprints. It prints the dialect details, the column titles
    (unless `--no-header` is used), and the contents of the first data line, if
    present. When combined with `--dialect sniff` this gives insight to how to
    process a previously unseen CSV file.
  
    # Examples:
  
    1) See the description of a MolPort file:
  
      % chemfp csv2fps --dialect sniff --describe \
      fulldb_smiles-000-000-000--000-499-999.txt.gz
  
    2) Process the MolPort file to generate OECircular fingerprints.  Use the
    'MOLPORTID' column for the identifiers as the 'SMILES' column for the
    structures. Save the result to 'molport.fps':
  
      % chemfp csv2fps --dialect tsv --id-col MOLPORTID --mol-col SMILES \
      --type OpenEye-Circular -o molport.fps \      fulldb_smiles-000-000-000--
      000-499-999.txt.gz
  
    3) Process the MolPort file to generate RDKit Morgan fingerprints from the
    InChI column, use column #3 (MOLPORTID) for the ids, and send the results to
    stdout:
  
      % chemfp csv2fps --dialect tsv --id-col #3 --mol-col STANDARD_INCHI \
      --format inchistring fulldb_smiles-000-000-000--000-499-999.txt.gz