Text toolkit examples

The text toolkit separates record parsing from chemical parsing. It understands the basic text structure of SDF and SMILES-based files and records, but not chemistry. It’s designed with the following use cases in mind:

  • add tag data to an input SDF record but keep everything else unchanged. This preserves data which might be lost by converting to then from a chemical toolkit molecule.
  • synchronize operations between multiple toolkits; For example, consider a hybrid fingerprint using both OEChem and RDKit. The individual RDKit and OEChem SDF readers may get out of synch when on toolkit can’t parse a record which the other can. In that case, use the text toolkit to read the records then pass the record to the chemistry toolkit.
  • extract tags from an SD file. Chemfp’s sdf2fps uses the text toolkit to get the id and the tag value which contains the fingerprint.

The text toolkit implements the chemistry toolkit API, except that instead of real molecule objects it uses a thin wrapper around the text for each wrapper. This chapter uses many of the concepts developed in the chapter on Toolkit API examples.

Toolkits may modify the molecular structure

In this section you’ll learn that a chemistry toolkit might change details of a structure record so the input record and output record have some differences, even though the molecular “essence” is preserved. This is meant as an example for why you might not want to work through a chemistry toolkit molecule for everything.

The section Add fingerprints to an SD file using a toolkit gave an example of using a toolkit to read an SD file, compute a MACCS fingerprint, add the fingerprint as a new SD tag, and save the result to a new SD file. This is a very common task.

A problem is that toolkits can apply various normalizations, like aromaticity perception, which change atom and bond aromaticity assignments. RDKit by default will also convert explicit hydrogens into implicit hydrogens. In that section, the input record had 26 atoms and bonds while RDKit generated an output record with 15 atoms and bonds. RDKit may also ‘sanitize’ the structures further (for example, convert ‘neutral 5 coordinate Ns with double bonds to Os to the zwitterionic form’).

While it’s possible to configure RDKit to keep implicit hydrogens, the RDKit MACCS fingerprinter assumes there are no explicit hydrogens. You would need to make a copy of the molecule, remove the explicit hydrogens yourself, generate the fingerprint, and then add the fingerprint to the molecule which still has the explicit hydrogens.

Bear in mind that the number of explicit atoms and bonds is based on the molecular graph model, which is only one possible representation for the actual chemical molecule. While I said there was a semantic change, the 26 atom structure and the 15 atom structure are really the same structure, just at different levels of conceptualization.

Toolkits may modify SDF syntax

In this section you’ll see that passing a structure file through a chemistry toolkit and back to the same format will likely make syntax changes to the record. While not as significant as the previous section, it may help persuade you that there are cases where you want to work with the original record as text rather than as a molecule.

You will need Compound_014550001_014575000.sdf.gz from PubChem.

I’ll read an SD file to get the first record as a toolkit molecule, save the molecule to SDF format, and compare the original record with the new one. This is called a round-trip test. Will there be differences?

import chemfp

# Select your toolkit of choice
T = chemfp.get_toolkit("openeye")
#T = chemfp.get_toolkit("rdkit")
#T = chemfp.get_toolkit("openbabel")

reader_args = {"rdkit.*.removeHs": False}
with T.read_molecules("Compound_014550001_014575000.sdf.gz",
                      reader_args=reader_args) as reader:
    with T.open_molecule_writer("example.sdf") as writer:
        for mol in reader:
            break  # only process the first molecule

If I use the “openeye” toolkit and compare its output to the first record of the input then the difference is trivial:

<   -OEChem-03291708342D
>   -OEChem-05011700162D

This difference is shown in the diff utility’s default format. The “2c2” means there was a change in line 2, and the changed line is also on line 2. The “<” indicates the line in the first file (in this case the original PubChem file) and the “>” indicates the line in the second file (in this case “example.sdf”). The “---” is to make it easier for humans to see break between the two files.

But what does that line mean? The “CTfile” (“connection table file”) spec from MDL, err, I mean Accelry, err, I mean Symyx, err, I mean BIOVIA, gives the full details. The first two characters (both blank here) are the user’s initials, the next 8 characters (OpenEye uses “-” to pad out “OEChem”) are the program name.

The next six character are the date, followed by 4 characters for the time. The PubChem record was created on 29 March 2017 at 08:34 while I did the transformation on 1 May 2017 at 00:16. The last two characters are the dimensionality; in this case the structure contains 2D coordinates.

PubChem used OEChem to make the file in the first place, so it’s not too suprising that there weren’t any differences. What about Open Babel? I changed the toolkit to “openbabel” and re-did the comparison. The first few lines of the diff were:

<   -OEChem-03291708342D
>  OpenBabel05011700222D
<  26 26  0     0  0  0  0  0  0999 V2000
>  26 26  0  0  0  0  0  0  0  0999 V2000

The 2c2 change you know already, and you can see it was a few minutes hour beween when I ran the OEChem code and the Open Babel code.

The change to line 4 is meaningless. If you look closely you’ll see that OEChem has a blank in column 12 where Open Babel has a “0”. The specification say that this field is obsolete, so I think you can do whatever you want there.

The rest of the differences are trivial and semantically meaningless: Open Babel uses two spaces between the “>” and “<” of a data header line, while OEChem uses one space:


Finally, I’ll use RDKit for the conversion. By default RDKit removes hydrogens, which would leave the result with 15 atoms. Unlike Open Babel, that action is configurable. I told RDKit to never remove hydrogens for any of its supported formats, via the reader_args:

reader_args = {"rdkit.*.removeHs": False}

(I didn’t actually need the “rdkit.*.” namespace prefix, but the “rdkit” helps as a reminder that this is an RDKit-specific option.)

There are the familiar changes in the second and fouth lines:

<   -OEChem-03291708342D
>      RDKit          2D
<  26 26  0     0  0  0  0  0  0999 V2000
>  26 26  0  0  0  0  0  0  0  0999 V2000

RDKit doesn’t include the timestamp so leaves that fields blank. (Then again, just how useful is the timestamp? On the third hand, the chemfp fingerprint formats include a timestamp as part of the metadata, so it’s odd that I question having it in another format.)

It’s a bit interesting to see that RDKit changes the charge field of some of the atoms:

<     6.0010   -3.6550    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
>     6.0010   -3.6550    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
<     6.0010   -2.6550    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
>     6.0010   -2.6550    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0

RDKit uses a 0 where the original used 5 (charge = -1) and 3 (charge = +1). My 2002 copy of the specification says that field is “[r]etained for compatibility with older Ctabs, M CHG and M RAD lines take precedence.” I can see that the original record and the new one both contain:

M  CHG  2   4  -1   6   1

Thus, RDKit isn’t as fully backwards compatible as the other two toolkits. Still, this is mostly a theoretical issue as I’ve not heard of someone running into a problem with how RDKit works.

Lastly, RDKit sorts the output SD tags alphabetically. I did not expect that.

While I love knowing these sorts of details, none of these (except for the explicit hydrogens) affect the semantic interpretation. Still, I can think of cases where you want to preserve the original syntax, like if you have fragile code which expects a “0” at a certain field and will crash if there’s a blank.

The text toolkit “molecules”

In this section you’ll learn about the molecule-like object used by the text_toolkit.

The text_toolkit implements the standard toolkit API, which means it reads and writes “molecules”. Remember that it isn’t really a chemical molecule but more like a thin layer around a molecule record. Internally these are subclasses of a TextRecord, though I’ll often refer to them as “text molecules” to distinguish them from the the actual record as a text string.

Every text molecule has an id attribute, which may be None if there is no identifier, and a record attribute containing the actual record as a string:

>>> from chemfp import text_toolkit
>>> mol = text_toolkit.parse_molecule("c1ccccc1O benzene", "smi")
>>> mol
SmiRecord(id=u'benzene', record='c1ccccc1O benzene', smiles=u'c1ccccc1O',
encoding='utf8', encoding_errors='strict')
>>> mol.id   # a Unicode string
>>> mol.record  # a byte string
'c1ccccc1O benzene'
>>> text_toolkit.create_string(mol, "smistring")
>>> text_toolkit.create_string(mol, "smi")
u'c1ccccc1O benzene\n'
>>> text_toolkit.create_bytes(mol, "smistring")
>>> text_toolkit.create_bytes(mol, "smistring.zlib")
>>> sdf_record = (
...   'methane\n' +
...   '\n' +
...   '\n' +
...   '  1  0  0  0  0  0  0  0  0  0999 V2000\n' +
...   '    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n' +
...   'M  END\n' +
...   '$$$$\n')
>>> sdf_mol = text_toolkit.parse_molecule(sdf_record, "sdf")
>>> sdf_mol
SDFRecord(id_bytes='methane'(id=u'methane'), record='methane\n\n\n  1  0  0  0  0  0  0  0  0  0999 V2000\n    0.0 ...',
encoding='utf8', encoding_errors='strict')
>>> sdf_mol.id
>>> sdf_mol.record[-20:]
'0  0  0\nM  END\n$$$$\n'

The record is always uncompressed.

Each of the SMILES-based records has its own unique class:

>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "smi")
SmiRecord(id=u'benzene', record='c1ccccc1O benzene', smiles=u'c1ccccc1O',
encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "can")
CanRecord(id=u'benzene', record='c1ccccc1O benzene', smiles=u'c1ccccc1O',
encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "usm")
UsmRecord(id=u'benzene', record='c1ccccc1O benzene', smiles=u'c1ccccc1O',
encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "smistring")
SmiStringRecord(id=None, record='c1ccccc1O', smiles=u'c1ccccc1O')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "canstring")
CanStringRecord(id=None, record='c1ccccc1O', smiles=u'c1ccccc1O')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "usmstring")
UsmStringRecord(id=None, record='c1ccccc1O', smiles=u'c1ccccc1O')

and for SMILES records you can access the SMILES directly through the smiles attribute:

>>> text_mol = text_toolkit.parse_molecule("C methane", "smistring")
>>> text_mol.smiles

Each text molecule also has a record_format attribute, which is the format name for the record.

>>> text_mol.record_format
>>> sdf_mol.record_format

The record_format values are “smi”, “can”, ..., “usmstring” for the SMILES formats or “sdf” for a file in SDF format. The record_format will never have a compression suffix.

Unlike the chemistry-backed toolkits, the text_toolkit has no real understanding of chemistry, only a limited knowledge of the format structure. It will parse an generate garbage:

>>> text_mol = text_toolkit.parse_molecule("garbage", "smi")
>>> text_toolkit.create_string(text_mol, "smi", id="and trash",
...               writer_args={"delimiter": "tab"})
u'garbage\tand trash\n'

The encoding and encoding_errors parameters describe the character encoding of the record bytes, and how to handle errors in converting to or from that encoding. For details see the section Unicode and other character encoding.

The text toolkit implements the toolkit API

In this section you’ll learn that the text toolkit is a pretty complete implementation of chemfp’s toolkit API.

The text toolkit implements all of the standard toolkit API, except that it doesn’t know how to convert between SMILES and SDF format. Here are some examples:

>>> from __future__ import print_function
>>> from chemfp import text_toolkit
>>> mol = text_toolkit.parse_molecule("C", "smistring")
>>> text_toolkit.get_id(mol) is None
>>> text_toolkit.set_id(mol, u"methane")
>>> text_toolkit.get_id(mol)
>>> text_toolkit.create_string(mol, "smi")
u'C methane\n'
>>> content = "C methane\nO=O molecular oxygen\n"
>>> with text_toolkit.read_ids_and_molecules_from_string(
...          content, "smi") as reader:
...     for id, mol in reader:
...         print("#%d %r" % (reader.location.recno, id))
#1 u'methane'
#2 u'molecular oxygen'
>>> writer = text_toolkit.open_molecule_writer("light.sdf")
>>> for mol in text_toolkit.read_molecules("Compound_014550001_014575000.sdf.gz"):
...   mass = text_toolkit.get_tag(mol, "PUBCHEM_EXACT_MASS")
...   mass = float(mass)
...   if mass > 100.0:
...     continue
...   cid = text_toolkit.get_tag(mol, "PUBCHEM_COMPOUND_CID")
...   print("Found", cid, mass)
...   writer.write_molecule(mol)
Found 14550416 77.977
Found 14550474 63.948
Found 14550599 84.021
Found 14550603 81.058
Found 14551989 85.053
Found 14556720 86.084
Found 14557343 97.039
Found 14567810 41.02
Found 14567812 57.034
Found 14567813 57.034
Found 14569188 89.097
Found 14571348 91.027
Found 14572168 86.037
Found 14572871 97.064
Found 14574249 83.032
Found 14574551 98.948
Found 14574635 50.967
Found 14574637 51.048
Found 14574638 65.064
>>> writer.close()
>>> for lineno, line in enumerate(open("light.sdf"), 1):
...     print(repr(line))
...     if lineno == 4:
...         break
'  -OEChem-03291708342D\n'
'  6  5  0     0  0  0  0  0  0999 V2000\n'

What you can’t do with the text_toolkit is convert from a SMILES-based format to SDF, or vice-versa. If you try you’ll either get an exception or a meaningless molecule representation.

While you can convert between the SMILES formats, the text toolkit doesn’t actually modify the SMILES term, so an input of “[238U]” will have a “canstring” (non-isomeric SMILES) of “[238U]”:

>>> U = text_toolkit.parse_molecule("[235U]", "smistring")
>>> text_toolkit.create_string(U, "canstring")

I don’t know if I should make this more strict in the future, and prohibit conversion between “smi”, “can”, and “usm” formats.

Reading and adding SD tags with the text_toolkit

In this section you’ll learn how to get and set the title line and get and add tag values to an SDF record when you have the record as a block of text.

There are two ways to get or modify SD tags for an SDFRecord, which is the TextRecord subclass for files in SDF format. The first is through the standard toolkit API functions chemfp.toolkit.get_tag(), chemfp.toolkit.get_tag_pairs(), and chemfp.toolkit.add_tag():

>>> from __future__ import print_function
>>> from chemfp import text_toolkit
>>> content = (
... "methane\n" +
... "     RDKit          \n" +
... "\n" +
... "  1  0  0  0  0  0  0  0  0  0999 V2000\n" +
... "    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n" +
... "M  END\n" +
... "$$$$\n")
>>> mol = text_toolkit.parse_molecule(content, "sdf")
>>> text_toolkit.add_tag(mol, "MW", "16.04246")
>>> new_record = text_toolkit.create_string(mol, "sdf")
>>> print(new_record)

  1  0  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
> <MW>


>>> new_mol = text_toolkit.parse_molecule(new_record, "sdf")
>>> text_toolkit.get_tag(new_mol, "MW")
>>> text_toolkit.get_tag_pairs(new_mol)
[(u'MW', u'16.04246')]

and the second is to use the corresponding methods of the text molecule: TextRecord.get_tag(), TextRecord.get_tag_pairs(), and TextRecord.add_tag():

>>> new_mol.get_tag_pairs()
[(u'MW', u'16.04246')]
>>> new_mol.get_tag("MW")
>>> text_toolkit.get_tag_pairs(new_mol)
[(u'MW', u'16.04246')]
>>> new_mol.get_tag_pairs()
[(u'MW', u'16.04246')]
>>> new_mol.get_tag("MW")
>>> new_mol.add_tag("NUM_ATOMS", "5")
>>> print(text_toolkit.create_string(new_mol, "sdf")[-39:])
> <MW>



Bear in mind that there is no way to delete a tag. This may be added in the future.

Synchronizing readers from different toolkits through the text toolkit

In this section you’ll learn how to keep two different toolkit parsers synchronized by using the text toolkit to parse the records, then pass the record over to each toolkit to convert it to a molecule.

A structure file may have a couple of records which cannot be parsed by a toolkit, usually due to odd chemistry definitions. It’s usually fine to skip those records, which is the purpose of the errors="ignore" setting. (See Handling errors when reading molecules from a string for more information about the errors parameter.)

Consider the following SMILES file with three lines:

% cat strange.smi
C methane
C--C ethane not for RDKit
CC ethane for everyone

The first and last are valid SMILES, but “C--C” is invalid. However, Open Babel will accept it, and OEChem will accept it because the default flavor does not add the “Strict” flavor flag. (See OpenEye-specific SMILES reader_args and writer_args for more information about OEChem flavors). As a result:

>>> from __future__ import print_function
>>> from chemfp import openeye_toolkit, rdkit_toolkit, openbabel_toolkit
>>> for id, mol in openeye_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
...     print("openeye found", repr(id))
openeye found u'methane'
openeye found u'ethane not for RDKit'
openeye found u'ethane for everyone'
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
...     print("rdkit found", repr(id))
rdkit found u'methane'
[02:10:28] SMILES Parse Error: syntax error for input: 'C--C'
rdkit found u'ethane for everyone'
>>> for id, mol in openbabel_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
...     print("openbabel found", repr(id))
openbabel found u'methane'
openbabel found u'ethane not for RDKit'
openbabel found u'ethane for everyone'

Sometime you want to work with multiple toolkits using the same input molecule. For example, you might want to compute a hybrid fingerprint, or make a model prediction where the descriptors come from different toolkits.

To do that, use the text_toolkit.read_ids_and_molecules() to read each record as a text molecule, and pass the actual record to the toolkit.parse_molecule() for each toolkit to get a molecule. Because I specifed the “ignore” error handler, the molecule will be None if the record could not be parsed. (See Specify alternate error behavior for more details.):

from chemfp import openeye_toolkit, rdkit_toolkit, openbabel_toolkit
from chemfp import text_toolkit

for id, text_mol in text_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
    if openeye_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors="ignore"):
        print("openeye parsed", repr(id))
        print("openeye could not parse", repr(id))

    if rdkit_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors="ignore"):
        print("rdkit parsed", repr(id))
        print("rdkit could not parse", repr(id))

    if openbabel_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors="ignore"):
        print("openbabel parsed", repr(id))
        print("openbabel could not parse", repr(id))

The output from running the above is:

openeye parsed 'methane'
rdkit parsed 'methane'
openbabel parsed 'methane'
openeye parsed 'ethane not for RDKit'
[03:23:38] SMILES Parse Error: syntax error for input: C--C
rdkit could not parse 'ethane not for RDKit'
openbabel parsed 'ethane not for RDKit'
openeye parsed 'ethane for everyone'
rdkit parsed 'ethane for everyone'
openbabel parsed 'ethane for everyone'

The above works, but there’s a lot of duplicate code, I don’t like the layout for the output, and there’s bit of extra overhead to re-interpret the parse_molecule() for each call. I’ll make a space-delimited file as output, and use toolkit.make_id_and_molecule_parser() to create a specialized parser for each available toolkit:

from __future__ import print_function
import chemfp
from chemfp import text_toolkit

reader = text_toolkit.read_ids_and_molecules("strange.smi")
format = reader.metadata.record_format

column_headers = []
parsers = []
for toolkit_name in ("openeye", "rdkit", "openbabel"):
    toolkit = chemfp.get_toolkit(toolkit_name)
  except ValueError:
    parser = toolkit.make_id_and_molecule_parser(format, errors="ignore")


print(*column_headers, sep="\t")  # print the header

for id, text_mol in reader:
  columns = []
  for parser in parsers:
    if parser is None:
      id, mol = parser(text_mol.record)
      if mol is not None:
  print(*columns, sep="\t")

This writes a tab-delimited file to stdout, ready for import into any spreadsheet program:

openeye       rdkit   openbabel       ID
Yes   Yes     Yes     methane
Yes   No      Yes     ethane not for RDKit
Yes   Yes     Yes     ethane for everyone

(There will also be an error message from RDKit sent to stderr.)

Add multiple toolkit fingerprints to an SD file

In this section you’ll learn how to use multiple toolkits to generate fingerprints for each molecule in an SD file, and add the fingerprints results back to the record as new SD tags.

In Add fingerprints to an SD file using a toolkit you learned how to use a toolkit to read a file as molecules, compute a fingerprint for each molecule, and add the fingerprint to the molecule as an SD tag, and save the result to a new SD file. The processing pipeline converted the input to a toolkit molecule and out again, and in doing so changed other parts of the record besides the new SD tag.

Sometimes you want to preserve the input as much as you can. For that case you can use the text reader to get text molecules, pass each text molecule’s record that to the toolkit to compute the fingerprint, add the new fingerprint as a tag for the text molecule, and save the result to a file.

I’ll do that one better; I’ll generate fingerprints using multiple toolkit and add all of them to the output file. Here’s an example of what the end of a new record will look like. Note: although the fingerprints are actually on one line, I’ve folded the long fingerprints across multiple lines so it doesn’t overflow this page.

18  20  8
20  21  8
6  18  8
6  8  8
8  21  8

> <rdkit512>

> <rdkit1024>

> <obfp3>


I’ll break it down into stages. The first is some preamble code to import the modules and configure the input and output files:

import chemfp

# I'll use chemfp's text-based SD parser, so the output SD records
# will be identical to the input records, except to append the new
# tags at the end of each record.

from chemfp import text_toolkit

input_filename = "/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz"
output_filename = "output.sdf.gz"

Next is to get the right SDF parser (a function which converts an SDF record into a identifier and a native toolkit molecule) and fingerprinter (a function which converts a toolkit molecule into a fingerprint) for each fingerprint type.

# The list of tag names and the corresponding fingerprint types.
wanted_fingerprint_types = (
    ("rdkit512", "RDKit-Fingerprint fpSize=512"),
    ("rdkit1024", "RDKit-Fingerprint fpSize=1024"),
    ("obfp3", "OpenBabel-FP3"),

build_data = []   # I'll use this to build the fingerprint data.
toolkit_sdf_record_parsers = {}  # I'll use this to convert an SD record into a molecule.

for output_tag, fingerprint_type_string in wanted_fingerprint_types:
    # First, get the corresponding fingerprint type.
    fingerprint_type = chemfp.get_fingerprint_type(fingerprint_type_string)

    # Figure out which toolkit to use to parse the SD records.
    toolkit = fingerprint_type.toolkit

    # For each unique toolkit, get a function that turns an SD record into a molecule.
    # (If multiple fingerprints use the same toolkit then I only
    # need to parse it once.)
    if toolkit.name not in toolkit_sdf_record_parsers:
        # The "ignore" means to return None on error, rather than raise an exception.
        toolkit_sdf_record_parsers[toolkit.name] = toolkit.make_id_and_molecule_parser("sdf", errors="ignore")

    # Get a function which turns a molecule into a fingerprint.
    fingerprinter = fingerprint_type.make_fingerprinter()

    # Store this information for record processing.
    build_data.append( (output_tag, toolkit.name, fingerprinter) )

Finally, use the text toolkit to read text molecules for each record, then use the SDF parser to get the id and molecule from the record text, then the fingerprinter to get the fingerprint from the molecule:

# Use the text toolkit to read and write SDF records.
with text_toolkit.open_molecule_writer(output_filename) as writer:
    for text_mol in text_toolkit.read_molecules(input_filename):

        # The text "molecule" .record is the actual text.
        record = text_mol.record

        # Make the fingerprints for each record and append the tag.

        # For extra performance, cache parsed molecules for future use.
        toolkit_mols = {}

        for output_tag, toolkit_name, fingerprinter in build_data:
            # There's no need to reparse the record if I've seen it before.
            if toolkit_name in toolkit_mols:
                toolkit_mol = toolkit_mols[toolkit_name]
                # Parse the record and save the molecule for later.
                toolkit_id, toolkit_mol = toolkit_sdf_record_parsers[toolkit_name](record)
                toolkit_mols[toolkit_name] = toolkit_mol

            if toolkit_mol is None:
                # There's no molecule, so no fingerprint. Save the empty string.
                text_mol.add_tag(output_tag, "")
                # Make a fingeprint and save it to the tag as a hex-encoded string.
                fp = fingerprinter(toolkit_mol)
                text_mol.add_tag(output_tag, fp.encode("hex"))

        # Write the text molecule to the output stream.

Text toolkit and SDF files

In this section you’ll learn about the specialized SDF reader API to read SDF records and tag values directly instead of through a text record.

The text toolkit support for the toolkit API lets you use the same code for SDF and SMILES, and switch between text-based and molecule-based parsers. Genericness comes at a cost. The TextRecord class is a wrapper around the actual record, so at the least there is some overhead for creating a wrapper for each record.

The text toolkit has special support for reading SDF records as raw byte strings, which are not wrapped in any object. There several SDF reader variations depending on if you want to read from a file or a string, and if you want to read the record, the (id, record) pair, or an (id, tag value) pair. These functions are:

(Note: while I write this as (id, value), those are just labels. By default it returns (SD title, SD title) pairs, or you can specify an alternate id_tag and value_tag to get the pairs you want.)

There are also special functions to work with the tag data and title of an SDF record, which take the record string as input:

The next few sections will cover some examples of how to use these specialized functions.

Read id and tag value pairs from an SD file

In this section you’ll learn how read the (id, tag value) for each record in an SD file using a specialized SDF reader. You will need Compound_014550001_014575000.sdf.gz from PubChem.

The specialized SDF readers are faster than the more generic text_toolkit support for the toolkit API. As an example, I’ll extract the identifer and molecular weight field from a PubChem file using the (slower) chemfp toolkit API:

from __future__ import print_function
from chemfp import text_toolkit

filename = "Compound_014550001_014575000.sdf.gz"
with text_toolkit.read_ids_and_molecules(filename) as reader:
  for id, text_mol in reader:
    mw = text_mol.get_tag("PUBCHEM_EXACT_MASS")
    print(id, mw)

Next I’ll extract it using the (faster) read_sdf_ids_and_values() function, which returns an iterator of the (id, tag value) pairs. Just like with toolkit.read_ids_and_molecules(), by default the id is the title line of the SD record, or I can use the id_tag parameter to get it from one of the SD tags. The value_tag has the same meaning; by default the value is the record’s title, or I can specify an alternate tag name containing the value to use:

from __future__ import print_function
from chemfp import text_toolkit

filename = "Compound_014550001_014575000.sdf.gz"
with text_toolkit.read_sdf_ids_and_values(filename, value_tag="PUBCHEM_EXACT_MASS") as reader:
  for id, mw in reader:
    print(id, mw)

Both of these generate output starting with:

14550001 229.041
14550002 199.067
14550003 169.056
14550004 124.1
14550005 291.954
14550010 259.061
14550011 224.05

My timings show that the first, generic implementation takes 0.26 seconds while the second, specialized implementation takes 0.17 seconds, which is about 33% faster, or enough to save about an hour when parsing PubChem. (The difference is even larger without the gzip overhead.) That’s why the sdf2fps command-line tool uses this function to extract the ids and fingerprint values from PubChem files.

Extract the id and atom and bond counts from an SD file

In this section you’ll use a specialized SDF reader iterate over the records of an SD file. You will need Compound_014550001_014575000.sdf.gz from PubChem.

The “records” returned by read_sdf_records(), read_sdf_records_from_string(), read_sdf_ids_and_records(), and read_sdf_ids_and_records_from_string() are the actual record content as a string, and not wrapped in a TextRecord or other class.

For example, the following will read each record from an SD file and use a regular expression to extract the title line, the number of atoms from the first 3 characters of line 4, and the number of bonds as the second 3 characters of line 4:

from __future__ import print_function
from chemfp import text_toolkit
import re

pat = re.compile(br"(.*)\n.*\n.*\n(...)(...)")

filename = "Compound_014550001_014575000.sdf.gz"
for record in text_toolkit.read_sdf_records(filename):
  m = pat.match(record)
  id = m.group(1).decode("utf8")
  num_atoms = int(m.group(2))
  num_bonds = int(m.group(3))
  print(id, num_atoms, num_bonds)

The output starts:

14550001 26 26
14550002 26 26
14550003 22 22
14550004 21 21
14550005 24 23
14550010 31 31
14550011 27 28
14550044 166 162

(Bear in mind that there may also be implicit hydrogens, so unless you know that all hydrogens are explicit or implicit, these numbers may only be roughly useful.)

Records are byte strings

The example code, while short, is still a bit tricky. The reader returns the SD records as byte strings, not Unicode strings. Why? First and foremost, using Python to read bytes from a file is 2-3x faster than reading Unicode. If all you care about is reading a couple of fields from the record then it’s faster to work with bytes and convert only those fields.

Second, this is a low-level API meant to give the actual byte representation of the data. Among other things, you should be able to know exactly where the record is located in the file. You can even do things like handle mixed encodings, where one tag value is UTF-8 encoded and another is Latin-1 encoded and cannot be read as a value UTF-8.

Python 3 makes a strong distinction between a byte string and a Unicode string. For Python 3, because the record a byte string, you’ll have to use a byte-based regular expression to parse it, as in:

pat = re.compile(br"(.*)\n.*\n.*\n(...)(...)")

You’ll also have to convert the title bytes to Unicode if you want to print the result, as in:

id = m.group(1).decode("utf8")

Thankfully, int() knows how to read the ASCII digits from a byte string, so I didn’t have to do extra work there.

SDF-specific parser parameters

In this section you’ll learn that the specialized SDF readers support the standard errors and location , and have a few special parameters of their own. You will need Compound_014550001_014575000.sdf.gz from PubChem.

All six of the read_sdf_* functions support the same errors and location parameters as the standard toolkit API, with the same meaning. For example, the following shows where each record is located in the uncompressed file:

from __future__ import print_function
from chemfp import text_toolkit

filename = "Compound_014550001_014575000.sdf.gz"
with text_toolkit.read_sdf_ids_and_records(filename) as reader:
  loc = reader.location
  for id, record in reader:
    start_byte, end_byte = loc.offsets
    print("%s at line %d (bytes %d-%d)" % (id, loc.lineno, start_byte, end_byte))

The output starts:

14550001 at line 1 (bytes 0-4227)
14550002 at line 166 (bytes 4227-8399)
14550003 at line 330 (bytes 8399-12140)
14550004 at line 486 (bytes 12140-15744)
14550005 at line 639 (bytes 15744-19838)

See Handling errors when reading molecules from a string for more information about the errors parameter, and Location information: record position and content for a description of the how to use a Location to the record’s first line number and start/end offsets in the file.

The six functions do not have a format option, because the format must be “sdf” or “sdf.gz”. Instead, there is a compression parameter. The default of None selects the compression type based on the filename, if the filename is available, or assumes the input is uncompressed. Use “gz” if the input is gzip’ed, and “none” or “” if the input is uncompressed.

The block_size is a tunable parameter, with a default value of 320 KB. The underlying reader reads a block of text then tries to extract records. When it gets to the end of a block, it reads a new block, and prepends the remaining part of the old block to the new one before looking for new records.

For performance reasons, the block_size should be several times larger than the largest record. During error recovery, the reader will read up to 320 KB or 5*block_size, whichever is larger, in order to find the next “$$$$” line and resynchronize.

Working with SD records as strings

In this section you’ll learn about the helper functions to work with SD record id and tag data when the SD record is a string. You will need Compound_014550001_014575000.sdf.gz from PubChem.

I’ll use one of the specialized SD file readers, read_sdf_records(), to get the first record from an SD file:

>>> from __future__ import print_function
>>> from chemfp import text_toolkit
>>> record = next(text_toolkit.read_sdf_records("Compound_014550001_014575000.sdf.gz"))
>>> print(record[:100])

 26 26  0     0  0  0  0  0  0999 V2000
    5.1350    0.8450    0.0

I can use get_sdf_tag() and get_sdf_tag_pairs() to get information about the tags in the record:

>>> for tag_name, tag_value in text_toolkit.get_sdf_tag_pairs(record):
...   print(tag_name, "=", repr(tag_value[:40]))
PUBCHEM_IUPAC_OPENEYE_NAME = '2-(2-hydroxyethylsulfanylmethyl)-4-nitro'
>>> text_toolkit.get_sdf_tag(record, "PUBCHEM_IUPAC_OPENEYE_NAME")

or use add_sdf_tag() to create a new record with a given tag and value added to the end of the tag block:

>>> print(record[-90:])
10  13  8
11  14  8
13  14  8
7  10  8
7  9  8
9  11  8


>>> new_record = text_toolkit.add_sdf_tag(record, b"VOLUME", b"123.45")
>>> print(new_record[-109:])
10  13  8
11  14  8
13  14  8
7  10  8
7  9  8
9  11  8



I can also get the title line of the SD record using get_sdf_id():

>>> text_toolkit.get_sdf_id(record)

or create a new string which is the old string with the title line replaced by a new value:

>>> new_record = text_toolkit.set_sdf_id(record, b"987ZYX")
>>> text_toolkit.get_sdf_id(new_record)

Note that I used byte strings, like b"VOLUME" and b"987ZYX". In general the values must be of the same string type as the record. On the flip side, if you have a Unicode record then you must pass in Unicode strings as values:

>>> unicode_record = record.decode("utf8")
>>> new_record = text_toolkit.set_sdf_id(unicode_record, u"Hello")
>>> new_record[:6]

Unicode and other character encoding

In this section you’ll learn a bit about how the text toolkit deals with different character encodings. This is a hard topic and I won’t cover it in full details. If you have a problem with Unicode encodings (and hopefuly a support contract) then contact me and I’ll help that way.

The SDF format is 8-bit clean. The specification itself uses ASCII but fields like the title, the tag name, and the tag value can contain nearly any byte value. (Some values like newline and ‘<’ and ‘>’ in the tag name, have special meaning and must not be used.)

Unfortunately, different software handle those non-ASCII values differently. An older Unix system might use the Latin-1 character set, which is able to handle many European and some non-European languages, but doesn’t have the Euro currency symbol. Microsoft Windows code page 1252 is effectively a superset of Latin-1, with the Euro symbol and a several other additional symbols.

There are of course many other symbols. The consensus for new systems is to use UTF-8 encoded Unicode, which is compatible with 8-bit clean ASCII and can handle most of the world’s languages, plus a large number of symbols. This encoding may use one, two, or more bytes to represent each symbol.

The Python3 bindings of OpenEye, RDKit, and Open Babel’s have all decided to interpret SD files as UTF-8 encoded. This consensus is great ... so long as your files are also compatible with UTF-8. But what if they aren’t? What if you have to read Latin-1 encoded file, or worse, a file where different fields have multiple encodings?

To demonstrate the problem, I’ll construct a problematic file for β-methylphenethylamine, with an experimental melting point of 140-142°C, stored in a Latin-1 encoded SD file. For now I’ll use use ‘Beta’ for the name, and ‘DEGREE’ for the temperature, as placeholders for the two non-ASCII characters.

>>> from __future__ import print_function
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("NCC(c1ccccc1)C", "smi")
>>> T.set_id(mol, "Beta-methylphenethylamine")
>>> T.add_tag(mol, "MP", "140-142DEGREEC")
>>> unicode_record = T.create_string(mol, "sdf")
>>> print(unicode_record)

 10 10  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
  3  4  1  0
  4  5  2  0
  5  6  1  0
  6  7  2  0
  7  8  1  0
  8  9  2  0
  3 10  1  0
  9  4  1  0
> <MP>


Next, I’ll replace the ‘DEGREE’ with the corresponding Unicode characters. (I’ll use the long Unicode name to be explicit.)

>>> unicode_record = unicode_record.replace(u"DEGREE", u"\N{DEGREE SIGN}")
>>> print(unicode_record)

 10 10  0  0  0  0  0  0  0  0999 V2000
> <MP>


Finally, I’ll save it to the file “latin1.sdf”, using the Latin-1 encoding:

>>> open("latin1.sdf", "wb").write(unicode_record.encode("latin1"))

This is not valid UTF-8. In my terminal, the MP tag value looks like:

> <MP>

where the “�” is the special symbol for REPLACEMENT CHARACTER, meaning that the actual character cannot be shown.

What happens if I read the file using each of the native toolkit APIs? First, OEChem under both Python 2.7 and Python 3.6:

>>> from openeye.oechem import *
>>> ifs = oemolistream("latin1.sdf")
>>> mol = OEGraphMol()
>>> OEReadMolecule(ifs, mol)
>>> OEGetSDData(mol, "MP")  # OEChem on Python 2.7

>>> OEGetSDData(mol, "MP")  # OEChem on Python 3.6

Remember, OEGetSDData() on Python 2.7 returns byte strings, and you’ll need to decode that string manually to get the degree symbol. While OEGetSDData() on Python 3.5 returns Unicode strings, but the byte “\xb0” is not a valid UTF-8 encoding. Instead, OEChem uses the Unicode codepoint “\udcb0”. This is a surrogate for the actual character, and something I don’t fully understand. Various sources say this is a UTF-16 behavior which isn’t correct UTF-8. Python doesn’t like it:

>>> print('140-142\udcb0C')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb0' in position 7: surrogates not allowed

Next, Open Babel under both Python 2.7 and Python 3.6:

>>> import openbabel as ob
>>> conv = ob.OBConversion()
>>> mol = ob.OBMol()
>>> conv.ReadFile(mol, "latin1.sdf")
>>> mol.GetData("MP").GetValue()  # Open Babel on Python 2.7

>>> mol.GetData("MP").GetValue()  # Open Babel on Python 3.6

Open Babel gives exactly the same results as OEChem.

Finally, RDKit:

>>> from rdkit import Chem
>>> supplier = Chem.ForwardSDMolSupplier("latin1.sdf")
>>> mol = next(supplier)
>>> mol.GetProp("MP")      # RDKit on Python 2.7

>>> mol.GetProp("MP")      # RDKit on Python 3.6
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte

RDKit doesn’t give a surrogate value for the illegal UTF-8 character. Instead, it complains. Which also means there is no way to get that data from Python.

What do you do if you have to read a Latin-1 encoded SD file? One solution is to use an external tool like iconv to translate the file to UTF8.

% iconv -f latin1 -t utf-8 < latin1.sdf > utf8.sdf

Another is to use Python to convert the entire file from Latin-1 to UTF8 then pass the transcoded contents to the toolkit:

>>> content = open("latin1.sdf", "rb").read()
>>> content = content.decode("latin1").encode("utf8")
>>> from __future__ import print_function
>>> import chemfp
>>> for tk in ("openbabel", "openeye", "rdkit"):
...   T = chemfp.get_toolkit(tk)
...   for mol in T.read_molecules_from_string(content, "sdf"):
...     print(tk, T.get_tag(mol, "MP"))
openbabel 140-142°C
openeye 140-142°C
rdkit 140-142°C

But if all you want is some of the tag data values, and not the molecule, then you can ask the text_toolkit to read the record as a “latin1” encoded file:

>>> from chemfp import text_toolkit
>>> for mol in text_toolkit.read_molecules("latin1.sdf", encoding="latin1"):
...   print(mol.get_tag("MP"))

The content is converted on-demand, that is, only when get_id() or get_tag() are called. The text_toolkit’s “molecule” stores the encoding so it knows how to decode the fields:

>>> mol.encoding

By the way, if you omit the ‘encoding=”latin1”’ parameter then you’ll get an exception:

>>> for mol in text_toolkit.read_molecules("latin1.sdf"):
...   print(mol.get_tag("MP"))
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/_text_toolkit.py", line 258, in get_tag
    return get_sdf_record_tag(self.record, tag, self.encoding, self.encoding_errors)
  File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/_text_toolkit.py", line 1474, in get_sdf_record_tag
    return value.decode(encoding, encoding_errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte

Mixed encodings and raw bytes

In this section you’ll learn how to get access to the id and tag data as byte strings rather than Unicode strings. This might be used if you have a perverse file which uses multiple encodings. If you run into that case, let me know - I’ll give you a sympathy prize for having to deal with it.

In the previous section you learned a few ways to read an Latin-1 encoded SD file. What happens if the title line contains an id which is UTF-8 encoded while the tag data contains a Latin-1 encoded value? (Or if you have to deal with a ‘clever’ programmer who put in semi-binary data into a data field. Because that’s the sort of thing we clever programmers sometimes do.)

The techniques I mentioned in that previous section won’t work because they assume the entire file has the same encoding.

Instead, use the text_toolkit to read the file, but access it through the byte API rather than the string API.

I need an example file. I’ll start with the “latin1.sdf” file I created for the previous section, which uses a Latin-1 encoded degree symbol in the “MP” tag data. I’ll modify it so the “Beta” in the title line is replaced by the UTF-8 encoded “β” character.

>>> content = open("latin1.sdf", "rb").read()
>>> mixed_content = content.replace(b"Beta", u"\N{GREEK SMALL LETTER BETA}".encode("utf8"))
>>> open("mixed.sdf", "wb").write(mixed_content)

(On Python 3, that last line will return “946”, to indicate that it wrote 946 bytes to the file.)

On a UTF-8 terminal the title line and the MP tag data line are respectively:

On a Latin-1 terminal they are:

How do I get their “real” values? I’ll use the text_toolkit to read the first record from the file:

>>> from chemfp import text_toolkit
>>> mol = next(text_toolkit.read_molecules("mixed.sdf"))
>>> mol
record='\xce\xb2-methylphenethylamine\n     RDKit          \n\n 10 10  0   ...',
encoding='utf8', encoding_errors='strict')

The title line is in utf8 so that’s not a problem

>>> print(mol.id)

But I won’t be able to read the “MP” field because it’s not UTF-8 encoded:

>>> mol.get_tag("MP")
Traceback (most recent call last):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 7: invalid start byte

Instead, I’ll use get_tag_as_bytes() to get the underlying bytes for the named tag, rather than as converted to a Unicode string:

>>> mol.get_tag_as_bytes(b"MP")

Once I have the bytes, I can decode them as Latin-1:

>>> print(mol.get_tag_as_bytes(b"MP").decode("latin1"))

Note that this function requires the tag name be the byte string which is found in the file. A Unicode name (which is the default string type under Python 3) will raise an exception:

>>> mol.get_tag_as_bytes(u"MP")
Traceback (most recent call last):
ValueError: tag must be a byte string or None

Use method get_tag_pairs_as_bytes() to get the list of all (tag, data) pairs, where both the tag and data are return as byte strings.

>>> mol.get_tag_pairs_as_bytes()
[('MP', '140-142\xb0C')]

Finally, use id_bytes to get the raw bytes for the identifier:

>>> mol.id_bytes

For example, if I read the file as Latin-1 then the Unicode id “MP” tag be what I expected, the id won’t be correct. Instead, I can get the id_bytes and decode it manually as UTF-8:

>>> mol2 = next(text_toolkit.read_molecules("mixed.sdf", encoding="latin1"))
>>> print(mol2.get_tag("MP"))
>>> mol2.id
>>> print(mol2.id)
>>> print(mol2.id_bytes.decode("utf8"))