chemfp.text_toolkit module

Methods to work with SD and SMILES files as records rather than molecules.

The text_toolkit implements the chemfp toolkit API but where the “molecules” are simple `TextRecord instances which store the records as text strings. It does not use a back-end chemistry toolkit, and it cannot convert between different chemistry representations.

The TextRecord is a base class. The actual records depend on the format, and will be one of:

The text toolkit will let you “convert” between the different SMILES formats, but it doesn’t actually change the SMILES string. The SMILES records have the attributes id, record and smiles.

The toolkit also knows a bit about the SD format. The SDF records have the attributes id, id_bytes and record, and there are methods to get SD tag values and add a tag to the end of the tag data block.

The text_toolkit also supports a few SDF-specific I/O functions to read SDF records directly as a string instead of wrapped in a TextRecord.

The record types also have the attributes encoding and encoding_errors which affect how the record bytes are parsed.

chemfp.text_toolkit.is_licensed()

Return True - chemfp’s text toolkit is always licensed

Returns:True
chemfp.text_toolkit.get_formats(include_unavailable=False)

Get the list of structure formats that chemfp’s text toolkit supports

This version of chemfp will always support the structure formats available to chemfp so ‘include_unavailable’ does not affect anything. (It may affect other toolkits.)

Parameters:include_unavailable – include unavailable formats?
Value include_unavailable:
 True or False
Returns:a list of chemfp.base_toolkit.Format objects
chemfp.text_toolkit.get_input_formats()

Get the list of supported chemfp text toolkit input formats

Returns:a list of chemfp.base_toolkit.Format objects
chemfp.text_toolkit.get_output_formats()

Get the list of supported chemfp text toolkit output formats

Returns:a list of chemfp.base_toolkit.Format objects
chemfp.text_toolkit.get_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError for unknown format names.

Parameters:format_name – the format name
Value format_name:
 a string
Returns:a chemfp.base_toolkit.Format object
chemfp.text_toolkit.get_input_format(format_name)

Get the named input format, or raise a ValueError

This will raise a ValueError for unknown format names or if that format is not an input format.

Parameters:format_name – the format name
Value format_name:
 a string
Returns:a chemfp.base_toolkit.Format object
chemfp.text_toolkit.get_output_format(format_name)

Get the named format, or raise a ValueError

This will raise a ValueError for unknown format names or if that format is not an output format.

Parameters:format_name – the format name
Value format_name:
 a string
Returns:a chemfp.base_toolkit.Format object
chemfp.text_toolkit.get_input_format_from_source(source=None, format=None)

Get the most appropriate format given the available source and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the source to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • source (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
  • format (A Format(-like) object, string, or None) – Format information, if known.
Returns:

a chemfp.base_toolkit.Format object

chemfp.text_toolkit.get_output_format_from_destination(destination=None, format=None)

Get the most appropriate format given the available destination and format information

If format is a chemfp.base_toolkit.Format then return it. If it’s a Format-like object with “name” and “compression” attributes use it to make a real Format object with the same attributes. If it’s a string then use it to create a Format object.

If format is None, use the destination to auto-detect the format. If auto-detection is not possible, assume it’s an uncompressed SMILES file.

Parameters:
  • destination (A filename (as a string), a file object, or None to read from stdin) – The structure data source.
  • format (A Format(-like) object, string, or None) – format information, if known.
Returns:

A chemfp.base_toolkit.Format object

chemfp.text_toolkit.read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict')

Return an iterator that reads TextRecord instances from a structure file

Iterate through the format structure records in source. If format is None then auto-detect the format based on the source. For SD files, use id_tag to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the id_tag. It exists to make it easier to switch between reader functions.)

Only the SMILES formats use the reader_args dictionary. The supported parameters are:

  • delimiter - one of “tab”, “space”, “to-eol”, the space or tab characters, or None
  • has_header - True or False

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

See read_ids_and_molecules() if you want (id, TextRecord) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader parameters passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.base_toolkit.MoleculeReader iterating TextRecord molecules

chemfp.text_toolkit.read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors='strict', location=None)

Return an iterator that reads TextRecord instances from a string containing structure records

content is a string containing 0 or more records in the format format. See read_molecules() for details about the other parameters. See read_ids_and_molecules_from_string() if you want to read (id, TextRecord) pairs instead of just molecules.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.base_toolkit.MoleculeReader iterating TextRecord molecules

chemfp.text_toolkit.read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict')

Return an iterator that reads (id, TextRecord) pairs from a structure file

See chemfp.text_toolkit.read_molecules() for full parameter details. The major difference is that this returns an iterator of (id, TextRecord) pairs instead of just the molecules.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the structure source
  • format (a format name string, or Format object, or None to auto-detect) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.text_toolkit.IdAndMoleculeReader iterating (id, TextRecord) pairs

chemfp.text_toolkit.read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors='strict', location=None)

Return an iterator that reads (id, TextRecord) pairs from a string containing structure records

content is a string containing 0 or more records in the format format. See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.read_molecules_from_string() if you just want to read the TextRecord molecules instead of (id, TextRecord) pairs.

Parameters:
  • content (a string) – the string containing structure records
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a chemfp.base_toolkit.IdAndMoleculeReader iterating (id, TextRecord) pairs

chemfp.text_toolkit.make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors='strict')

Create a specialized function which takes a record and returns an (id, TextRecord) pair

The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven’t really noticed much of a performance difference between this and chemfp.text_toolkit.parse_id_and_molecule() so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.)

See chemfp.text_toolkit.read_molecules() for details about the other parameters. The specific TextRecord subclass returned depends on the format.

Parameters:
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a function of the form parser(record string) -> (id, text_record)

chemfp.text_toolkit.parse_molecule(content, format, id_tag=None, reader_args=None, errors='strict')

Parse the first structure record from the content string and return a TextRecord.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.text_toolkit.read_molecules() for details about the other parameters. See chemfp.text_toolkit.parse_id_and_molecule() if you want the (id, TextRecord) pair instead of just the text record.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

a TextRecord

chemfp.text_toolkit.parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors='strict')

Parse the first structure record from content and return the (id, TextRecord) pair.

content is a string containing a single structure record in format format. (Additional records are ignored). See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters.

See chemfp.rdkit_toolkit.read_molecules() for details about the other parameters. See chemfp.rdkit_toolkit.parse_molecule() if just want the TextRecord and not the the (id, TextRecord) pair.

Parameters:
  • content (a string) – the string containing a structure record
  • format (a format name string, or Format object) – the input structure format
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (a dictionary) – reader arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
Returns:

an (id, TextRecord molecule) pair

chemfp.text_toolkit.create_string(mol, format, id=None, writer_args=None, errors='strict')

Convert a TextRecord into a structure record in the given format as a Unicode string

If id is not None then use it instead of the molecule’s own id.

Parameters:
  • mol (a TextRecord) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
Returns:

a Unicode string

chemfp.text_toolkit.create_bytes(mol, format, id=None, writer_args=None, errors='strict', level=None)

Convert a TextRecord into a structure record in the given format as a byte string

If id is not None then use it instead of the molecule’s own id.

Parameters:
  • mol (a TextRecord) – the molecule to use for the output
  • format (a format name string, or Format object) – the output structure format
  • id (a string, or None to use the molecule's own id) – an alternate record id
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats
Returns:

a byte string

chemfp.text_toolkit.open_molecule_writer(destination=None, format=None, writer_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', level=None)

Return a MoleculeWriter which can write TextRecord instances to a destination.

A chemfp.base_toolkit.MoleculeWriter has the methods write_molecule, write_molecules, and write_ids_and_molecules, which are ways to write an TextRecord, an TextRecord iterator, or an (id, TextRecord) pair iterator to a file.

TextRecords are written to destination. The output format can be a string like “sdf.gz” or “smi”, a chemfp.base_toolkit.Format, or Format-like object with “name” and “compression” attributes, or None to auto-detect based on the destination. If auto-detection is not possible, the output will be written as uncompressed SMILES.

That said, the text toolkit doesn’t know how to convert between SMILES and SDF formats, and will raise an exception if you try.

The writer_args is only used for the “smi”, “can”, and “usm” output formats. The only supported parameter is:

* delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

Parameters:
  • destination (a filename, file object, or None to write to stdout) – the structure destination
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
  • encoding (string (typically 'utf8' or 'latin1')) – the byte encoding
  • encoding_errors (string (typically 'strict', 'ignore', or 'replace')) – how to handle decoding failure
  • level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats
Returns:

a chemfp.base_toolkit.MoleculeWriter expecting TextRecord instances

chemfp.text_toolkit.open_molecule_writer_to_string(format, writer_args=None, errors='strict', location=None)

Return a MoleculeStringWriter which can write TextRecord instances to a string.

See chemfp.text_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a Unicode string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting TextRecord instances

chemfp.text_toolkit.open_molecule_writer_to_bytes(format, writer_args=None, errors='strict', location=None, level=None)

Return a MoleculeStringWriter which can write TextRecord instances to a string.

See chemfp.text_toolkit.open_molecule_writer() for full parameter details.

Use the writer’s chemfp.base_toolkit.MoleculeStringWriter.getvalue() to get the output as a byte string.

Parameters:
  • format (a format name string, or Format(-like) object, or None to auto-detect) – the output structure format
  • writer_args (a dictionary) – writer arguments passed to the underlying toolkit
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track writer state information
  • level (None, a positive integer, or one of the strings 'min', 'default', or 'max') – compression level to use for compressed formats
Returns:

a chemfp.base_toolkit.MoleculeStringWriter expecting TextRecord instances

chemfp.text_toolkit.copy_molecule(mol)

Return a new TextRecord which is a copy of the given TextRecord

Parameters:mol (a TextRecord) – the text record
Returns:a new TextRecord
chemfp.text_toolkit.add_tag(mol, tag, value)

Add an SD tag value to the TextRecord

If the mol is in “sdf” format then this will modify mol.record to append the new tag and value to the end of the tag block. The other tags will not be modified, including tags with the same tag name.

Parameters:
  • mol (a TextRecord) – the text record
  • tag (string) – the SD tag name
  • value (string) – the text for the tag
Returns:

None

chemfp.text_toolkit.get_tag(mol, tag)

Get the named SD tag value, or None if it doesn’t exist

If the mol is in “sdf” format then this will return the corresponding tag value from mol.record, or None if the tag does not exist.

If the record is in any other format then it will return None.

Parameters:
  • mol (a TextRecord) – the molecule
  • tag (string) – the SD tag name
Returns:

a string, or None

chemfp.text_toolkit.get_tag_pairs(mol)

Get a list of all SD tag (name, value) pairs for the TextRecord

If the mol is in “sdf” format then this will return the list of (tag, value) pairs in mol.record, where the tag and value are strings.

If the record is in any other format then it will return an empty list.

Parameters:mol (a TextRecord) – the molecule
Returns:a list of (tag name, tag value) pairs
chemfp.text_toolkit.get_id(mol)

Get the molecule’s id from the TextRecord’s id field

This is toolkit-portable way to get mol.id.

Parameters:mol (a TextRecord) – the molecule
Returns:a string
chemfp.text_toolkit.set_id(mol, id)

Set the TextRecord’s id to the new id

This is the toolkit-portable way to write mol.id = id.

Note: this does not modify mol.record. Use chemfp.text_toolkit.create_string() or similar text_toolkit functions to get the record text with a new identifier.

Parameters:
  • mol (a TextRecord) – the molecule
  • id (string) – the new id
Returns:

None

chemfp.text_toolkit.read_sdf_records(source=None, reader_args=None, compression=None, errors='strict', location=None, block_size=327680)

Return an iterator that reads each record from an SD file as a string.

Iterate through the records in source, which must be in SD format. If compression is None or “auto” then auto-detect the compression type based on source, and default to uncompressed when it can’t be determined. Use “gz” when the input is gzip compressed, and “none” or “” if uncompressed.

The reader_args parameter is currently unused. It exists for future compatability.

The errors parameter specifies how to handle errors. “strict” raises an exception, “report” sends a message to stderr and goes to the next record, and “ignore” goes to the next record.

The location parameter takes a chemfp.io.Location instance. If None then a default Location will be created.

The block_size parameter is the number of bytes to read from the SD file. The current implementation reads a block, iterates through the records in the block, then prepends any remaining text to the start of the next block. You shouldn’t need to change this parameter, but if you do, please let me know.

Note: to prevent accidental memory consumption if the input is in the wrong format, a complete record must be found within the first 327680 bytes or 5*block_size bytes, whichever is larger.

The parser has only a basic understanding of the SD format. It knows how to handle the counts line, the SKP property, and even tag data with the value ‘$$$$’. It is not a full validator and it does not know chemistry.

See read_sdf_ids_and_records() if you want (id, record) pairs, and read_sdf_ids_and_values() if you want (id, tag data) pairs. See read_sdf_ids_and_records_from_string() to read from a string instead of a file or file-like object.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.RecordReader() iterating over the records as a string

chemfp.text_toolkit.read_sdf_ids_and_records(source=None, id_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, record string) pairs from an SD file

See read_sdf_records() for most parameter details. That function iterates over the records, while this one iterates over the (id, record) pairs. By default the id comes from the title line. Use id_tag to get the record id from the given SD tag instead.

See read_sdf_ids_and_values() if you want to read an identifier and tag value, or two tag values, instead of returning the full record.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating (id, record string) pairs

chemfp.text_toolkit.read_sdf_ids_and_values(source=None, id_tag=None, value_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, tag value string) pairs from an SD file

See read_sdf_records() for most parameter details. That function iterates over the records, while this one iterates over the (id, tag value) pairs.

By default this uses the title line for both the id and tag value strings. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.

Parameters:
  • source (a filename, file object, or None to read from stdin) – the SDF source
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • value_tag (string, or None to use the record title) – SD tag containing the value
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating (id, value string) pairs

chemfp.text_toolkit.read_sdf_records_from_string(content, reader_args=None, compression=None, errors='strict', location=None, block_size=327680)

Return an iterator that reads each record from a string containing SD records

See read_sdf_records_from_string() for the parameter details. The main difference is that this function reads from content, which is a string containing 0 or more SDF records.

If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, and the compression option is not supported. If content is a byte string then the records will be returned as byte strings, and compression is supported.

See read_sdf_ids_and_records_from_string() to read (id, record) pairs and read_sdf_ids_and_values_from_string() to read (id, tag value) pairs.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.RecordReader iterating over each record as a string

chemfp.text_toolkit.read_sdf_ids_and_records_from_string(content=None, id_tag=None, reader_args=None, compression=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, record) pairs from a string containing SD records

This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, record) pairs. By default the id comes from the first line of the SD record. Use id_tag to use a given tag value instead. See read_sdf_records() for details about the other parameters.

If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.

If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating over the (id, record string) pairs

chemfp.text_toolkit.read_sdf_ids_and_values_from_string(content=None, id_tag=None, value_tag=None, compression=None, reader_args=None, errors='strict', location=None, encoding='utf8', encoding_errors='strict', block_size=327680)

Return an iterator that reads the (id, value) pairs from a string containing SD records

This function reads the records from content, which is a string containing 0 or more SDF records. It iterates over the (id, value) pairs, which by default both contain the title line. Use id_tag and value_tag, respectively, to use a given tag value instead. If a tag doesn’t exist then None will be used.

If content is a (Unicode) string then it must only contain ASCII characters, the compression option is not supported, and the encoding and encoding_errors parameters are ignored.

If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id and value.

See read_sdf_records() for details about the other parameters.

Parameters:
  • content (string or bytes) – a string containing zero or more SD records
  • id_tag (string, or None to use the record title) – SD tag containing the record id
  • value_tag (string, or None to use the record title) – SD tag containing the value
  • reader_args (currently ignored) – currently ignored
  • compression (one of "auto", "none", "", or "gz") – the data content compression method
  • errors (one of "strict", "report", or "ignore") – specify how to handle errors
  • location (a chemfp.io.Location object, or None) – object used to track parser state information
Returns:

a chemfp.base_toolkit.IdAndRecordReader iterating over the (id, value) pairs

chemfp.text_toolkit.get_sdf_tag(sdf_record, tag)

Return the value for a named tag in an SDF record string

Get the value for the tag named tag from the string sdf_record containing an SD record.

Parameters:
  • sdf_record (string) – an SD record
  • tag (string) – a tag name
Returns:

the corresponding tag value as a string, or None

chemfp.text_toolkit.add_sdf_tag(sdf_record, tag, value)

Add an SD tag value to an SD record string

This will append the new tag and value to the end of the tag data block in the sdf_record string.

Parameters:
  • sdf_record (string) – an SD record
  • tag (string) – a tag name
  • value (string) – the new tag value
Returns:

a new SD record string with the new tag and value

chemfp.text_toolkit.get_sdf_tag_pairs(sdf_record)

Return the (tag, value) entries in the SDF record string

Parse the sdf_record and return the tag data as a list of (tag, value) pairs. The type of the returned strings will be the same as the type of the input sdf_record string.

Parameters:sdf_record (string) – an SDF record
Returns:a list of (tag, value) pairs
chemfp.text_toolkit.get_sdf_id(sdf_record)

Return the id for the SDF record string

The id is the first line of the sdf_record. A future version of this function may support an id_tag parameter. Let me know if that would be useful.

The returned id string will have the same type as the input sdf_record.

Parameters:sdf_record (string) – an SD record
Returns:the first line of the SD record
chemfp.text_toolkit.set_sdf_id(sdf_record, id)

Set the id of the SDF record string to a new value

Set the first line of sdf_record to the new id, which must not contain a newline.

The sdf_record and the id must have the same string type.

Parameters:
  • sdf_record (string) – an SDF record
  • id (string) – the new id