Text toolkit examples¶

The text toolkit separates record parsing from chemical parsing. It understands the basic text structure of SDF and SMILES-based files and records, but not chemistry. It’s designed with the following use cases in mind:

• add tag data to an input SDF record but keep everything else unchanged. This preserves data which might be lost by converting to then from a chemical toolkit molecule.
• synchronize operations between multiple toolkits; For example, consider a hybrid fingerprint using both OEChem and RDKit. The individual RDKit and OEChem SDF readers may get out of synch when on toolkit can’t parse a record which the other can. In that case, use the text toolkit to read the records then pass the record to the chemistry toolkit.
• extract tags from an SD file. Chemfp’s sdf2fps uses the text toolkit to get the id and the tag value which contains the fingerprint.

The text toolkit implements the chemistry toolkit API, except that instead of real molecule objects it uses a thin wrapper around the text for each wrapper. This chapter uses many of the concepts developed in the chapter on Toolkit API examples.

Toolkits may modify the molecular structure¶

In this section you’ll learn that a chemistry toolkit might change details of a structure record so the input record and output record have some differences, even though the molecular “essence” is preserved. This is meant as an example for why you might not want to work through a chemistry toolkit molecule for everything.

The section Add fingerprints to an SD file using a toolkit gave an example of using a toolkit to read an SD file, compute a MACCS fingerprint, add the fingerprint as a new SD tag, and save the result to a new SD file. This is a very common task.

A problem is that toolkits can apply various normalizations, like aromaticity perception, which change atom and bond aromaticity assignments. RDKit by default will also convert explicit hydrogens into implicit hydrogens. In that section, the input record had 46 atoms while RDKit generated an output record with 28 atoms. RDKit may also ‘sanitize’ the structures further (for example, convert ‘neutral 5 coordinate Ns with double bonds to Os to the zwitterionic form’).

While it’s possible to configure RDKit to keep implicit hydrogens, the RDKit MACCS fingerprinter assumes there are no explicit hydrogens. You would need to make a copy of the molecule, remove the explicit hydrogens yourself, generate the fingerprint, and then add the fingerprint to the molecule which still has the explicit hydrogens.

Bear in mind that the number of explicit atoms and bonds is based on the molecular graph model, which is only one possible representation for the actual chemical molecule. While I said there was a semantic change, the 46 atom structure and the 28 atom structure are really the same structure, just at different levels of conceptualization.

Toolkits may modify SDF syntax¶

In this section you’ll see that passing a structure file through a chemistry toolkit and back to the same format will likely make syntax changes to the record. While not as significant as the previous section, it may help persuade you that there are cases where you want to work with the original record as text rather than as a molecule.

You will need Compound_099000001_099500000.sdf.gz from PubChem.

I’ll read an SD file to get the first record as a toolkit molecule, save the molecule to SDF format, and compare the original record with the new one. This is called a round-trip test. Will there be differences?

import chemfp

# Select your toolkit of choice
T = chemfp.get_toolkit("openeye")
#T = chemfp.get_toolkit("rdkit")
#T = chemfp.get_toolkit("openbabel")

with T.open_molecule_writer("example.sdf") as writer:
writer.write_molecule(mol)
break  # only process the first molecule


If I use the “openeye” toolkit and compare its output to the first record of the input then the difference is trivial:

2c2
<   -OEChem-04292009532D
---
>   -OEChem-06182012582D


This difference is shown in the diff utility’s default format. The “2c2” means there was a change in line 2, and the changed line is also on line 2. The “<” indicates the line in the first file (in this case the original PubChem file) and the “>” indicates the line in the second file (in this case “example.sdf”). The “---” is to make it easier for humans to see break between the two files.

But what does that line mean? The “CTfile” (“connection table file”) spec from MDL, err, I mean Accelry, err, I mean Symyx, err, I mean BIOVIA, gives the full details. The first two characters (both blank here) are the user’s initials, the next 8 characters (OpenEye uses “-” to pad out “OEChem”) are the program name.

The next six character are the date, followed by 4 characters for the time. The PubChem record was created on 29 April 2020 at 09:33 while I did the transformation on 18 June 2020 at 12:58. The last two characters are the dimensionality; in this case the structure contains 2D coordinates.

PubChem used OEChem to make the file in the first place, so it’s not too suprising that there weren’t any differences. What about Open Babel? I changed the toolkit to “openbabel” and re-did the comparison. The first few lines of the diff are:

2c2
<   -OEChem-04292009532D
---
>  OpenBabel06182013012D
4c4
<  46 49  0     1  0  0  0  0  0999 V2000
---
>  46 49  0  0  1  0  0  0  0  0999 V2000


The 2c2 change you know already, and you can see it was a few minutes between when I ran the OEChem code and the Open Babel code.

The change to line 4 is meaningless. If you look closely you’ll see that OEChem has a blank in column 12 where Open Babel has a “0”. The specification say that this field is obsolete, so I think you can do whatever you want there.

The next few lines are:

65c65
<   8  9  1  6  0  0  0
---
>   8  9  1  0  0  0  0
67c67
<   8 29  1  0  0  0  0
---
>   8 29  1  1  0  0  0


This says that OEChem interprets the bond between atoms 8 and 9 as “6” = “down” stereochemistry, while Open Babel says it’s not stereo. On the other hand, OEChem interprets the bond bond atoms 8 and 29 as having no stereochemistry while Open Babel says it has “1” = “up” stereochemistry. Looks to me like two valid interpretations of the same thing.

The rest of the differences are trivial and semantically meaningless: Open Babel uses two spaces between the “>” and “<” of a data header line, while OEChem uses one space:

101c101
< > <PUBCHEM_COMPOUND_CID>
---
> >  <PUBCHEM_COMPOUND_CID>
104c104


Finally, I’ll use RDKit for the conversion. By default RDKit removes hydrogens, which would leave the result with 15 atoms. Unlike Open Babel, that action is configurable. I told RDKit to never remove hydrogens for any of its supported formats, via the reader_args:

reader_args = {"rdkit.*.removeHs": False}


(I didn’t actually need the “rdkit.*.” namespace prefix, but the “rdkit” helps as a reminder that this is an RDKit-specific option.)

There are the familiar changes in the second and fouth lines:

2c2
<   -OEChem-04292009532D
---
>      RDKit          2D
4c4
<  46 49  0     1  0  0  0  0  0999 V2000
---
>  46 49  0  0  1  0  0  0  0  0999 V2000


RDKit doesn’t include the timestamp so leaves that fields blank. (Then again, just how useful is the timestamp? On the third hand, the chemfp fingerprint formats include a timestamp as part of the metadata, so it’s odd that I question having it in another format. On the fourth hand, OEChem supports the SuppressTimestamps option to disable including the timestamp.)

While I love knowing these sorts of details, none of these (except for the explicit hydrogens) affect the semantic interpretation. Still, I can think of cases where you want to preserve the original syntax, like if you have fragile code which expects a “0” at a certain field and will crash if there’s a blank.

The text toolkit “molecules”¶

In this section you’ll learn about the molecule-like object used by the text_toolkit.

The text_toolkit implements the standard toolkit API, which means it reads and writes “molecules”. Remember that it isn’t really a chemical molecule but more like a thin layer around a molecule record. Internally these are subclasses of a TextRecord, though I’ll often refer to them as “text molecules” to distinguish them from the the actual record as a text string.

Every text molecule has an id attribute, which may be None if there is no identifier, and a record attribute containing the actual record as a string:

>>> from chemfp import text_toolkit
>>> mol = text_toolkit.parse_molecule("c1ccccc1O benzene", "smi")
>>> mol
SmiRecord(id='benzene', record=b'c1ccccc1O benzene', smiles='c1ccccc1O',
encoding='utf8', encoding_errors='strict')
>>> mol.id   # a Unicode string
'benzene'
>>> mol.record  # a byte string
b'c1ccccc1O benzene'
>>> text_toolkit.create_string(mol, "smistring")
'c1ccccc1O'
>>> text_toolkit.create_string(mol, "smi")
'c1ccccc1O benzene\n'
>>> text_toolkit.create_bytes(mol, "smistring")
b'c1ccccc1O'
>>> text_toolkit.create_bytes(mol, "smistring.zlib")
b'x\x9cK6L\x06\x01C\x7f\x00\x0fh\x03\x04'
>>>
>>> sdf_record = (
...   'methane\n' +
...   '\n' +
...   '\n' +
...   '  1  0  0  0  0  0  0  0  0  0999 V2000\n' +
...   '    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n' +
...   'M  END\n' +
...   '\n')
>>>
>>> sdf_mol = text_toolkit.parse_molecule(sdf_record, "sdf")
>>> sdf_mol
SDFRecord(id_bytes=b'methane'(id='methane'),
record=b'methane\n\n\n  1  0  0  0  0  0  0  0  0  0999 V2000\n    0.0 ...',
encoding='utf8', encoding_errors='strict')
>>> sdf_mol.id
'methane'
>>> sdf_mol.record[-20:]
b'0  0  0\nM  END\n\n'


The record is always uncompressed.

Each of the SMILES-based records has its own unique class:

>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "smi")
SmiRecord(id='benzene', record=b'c1ccccc1O benzene',
smiles='c1ccccc1O', encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "can")
CanRecord(id='benzene', record=b'c1ccccc1O benzene',
smiles='c1ccccc1O', encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "usm")
UsmRecord(id='benzene', record=b'c1ccccc1O benzene',
smiles='c1ccccc1O', encoding='utf8', encoding_errors='strict')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "smistring")
SmiStringRecord(id=None, record=b'c1ccccc1O', smiles='c1ccccc1O')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "canstring")
CanStringRecord(id=None, record=b'c1ccccc1O', smiles='c1ccccc1O')
>>> text_toolkit.parse_molecule("c1ccccc1O benzene", "usmstring")
UsmStringRecord(id=None, record=b'c1ccccc1O', smiles='c1ccccc1O')


and for SMILES records you can access the SMILES directly through the smiles attribute:

>>> text_mol = text_toolkit.parse_molecule("C methane", "smistring")
>>> text_mol.smiles
'C'


Each text molecule also has a record_format attribute, which is the format name for the record.

>>> text_mol.record_format
'smistring'
>>> sdf_mol.record_format
'sdf'


The record_format values are “smi”, “can”, …, “usmstring” for the SMILES formats or “sdf” for a file in SDF format. The record_format will never have a compression suffix.

Unlike the chemistry-backed toolkits, the text_toolkit has no real understanding of chemistry, only a limited knowledge of the format structure. It will parse an generate garbage:

>>> text_mol = text_toolkit.parse_molecule("garbage", "smi")
>>> text_toolkit.create_string(text_mol, "smi", id="and trash",
...               writer_args={"delimiter": "tab"})
'garbage\tand trash\n'


The encoding and encoding_errors parameters describe the character encoding of the record bytes, and how to handle errors in converting to or from that encoding. For details see the section Unicode and other character encoding.

The text toolkit implements the toolkit API¶

In this section you’ll learn that the text toolkit is a pretty complete implementation of chemfp’s toolkit API.

The text toolkit implements all of the standard toolkit API, except that it doesn’t know how to convert between SMILES and SDF format. Here are some examples:

>>> from chemfp import text_toolkit
>>> mol = text_toolkit.parse_molecule("C", "smistring")
>>> text_toolkit.get_id(mol) is None
True
>>> text_toolkit.set_id(mol, u"methane")
>>> text_toolkit.get_id(mol)
'methane'
>>> text_toolkit.create_string(mol, "smi")
'C methane\n'
>>> content = "C methane\nO=O molecular oxygen\n"
...     for id, mol in reader:
...         print("#%d %r" % (reader.location.recno, id))
...
#1 'methane'
#2 'molecular oxygen'
>>>
>>> writer = text_toolkit.open_molecule_writer("light.sdf")
...   mass = text_toolkit.get_tag(mol, "PUBCHEM_EXACT_MASS")
...   mass = float(mass)
...   if mass > 100.0:
...     continue
...   cid = text_toolkit.get_tag(mol, "PUBCHEM_COMPOUND_CID")
...   print("Found", cid, mass)
...   writer.write_molecule(mol)
...
Found 99109812 99.068414
Found 99109899 97.052764
Found 99118867 98.073165
Found 99118868 98.073165
Found 99123566 97.089149
Found 99151119 84.057515
Found 99151121 84.057515
Found 99162605 98.073165
Found 99162607 98.073165
>>> writer.close()
>>> for lineno, line in enumerate(open("light.sdf"), 1):
...     print(repr(line))
...     if lineno == 4:
...         break
...
'99109812\n'
'  -OEChem-04292009532D\n'
'\n'
' 16 17  0     1  0  0  0  0  0999 V2000\n'


What you can’t do with the text_toolkit is convert from a SMILES-based format to SDF, or vice-versa. If you try you’ll either get an exception or a meaningless molecule representation.

While you can seemingly convert between the SMILES formats, the text toolkit doesn’t actually modify the SMILES term, so an input of “[238U]” will have a “canstring” (non-isomeric SMILES) of “[238U]”:

>>> U = text_toolkit.parse_molecule("[235U]", "smistring")
>>> text_toolkit.create_string(U, "canstring")
'[235U]'


I don’t know if I should make this more strict in the future, and prohibit conversion between “smi”, “can”, and “usm” formats.

In this section you’ll learn how to get and set the title line and get and add tag values to an SDF record when you have the record as a block of text.

There are two ways to get or modify SD tags for an SDFRecord, which is the TextRecord subclass for files in SDF format. The first is through the standard toolkit API functions chemfp.toolkit.get_tag(), chemfp.toolkit.get_tag_pairs(), and chemfp.toolkit.add_tag():

>>> from chemfp import text_toolkit
>>> content = (
... "methane\n" +
... "     RDKit          \n" +
... "\n" +
... "  1  0  0  0  0  0  0  0  0  0999 V2000\n" +
... "    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n" +
... "M  END\n" +
... "\n")
>>> mol = text_toolkit.parse_molecule(content, "sdf")
>>> new_record = text_toolkit.create_string(mol, "sdf")
>>> print(new_record)
methane
RDKit

1  0  0  0  0  0  0  0  0  0999 V2000
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  END
> <MW>
16.04246



>>> new_mol = text_toolkit.parse_molecule(new_record, "sdf")
>>> text_toolkit.get_tag(new_mol, "MW")
'16.04246'
>>> text_toolkit.get_tag_pairs(new_mol)
[('MW', u'16.04246')]


and the second is to use the corresponding methods of the text molecule: TextRecord.get_tag(), TextRecord.get_tag_pairs(), and TextRecord.add_tag():

>>> new_mol.get_tag_pairs()
[('MW', u'16.04246')]
>>> new_mol.get_tag("MW")
'16.04246'
>>>
>>> text_toolkit.get_tag_pairs(new_mol)
[('MW', u'16.04246')]
>>> new_mol.get_tag_pairs()
[('MW', u'16.04246')]
>>> new_mol.get_tag("MW")
'16.04246'
>>> print(text_toolkit.create_string(new_mol, "sdf")[-39:])
> <MW>
16.04246

> <NUM_ATOMS>
5




Bear in mind that there is no way to delete a tag. This may be added in the future.

Synchronizing readers from different toolkits through the text toolkit¶

In this section you’ll learn how to keep two different toolkit parsers synchronized by using the text toolkit to parse the records, then pass the record over to each toolkit to convert it to a molecule.

A structure file may have a couple of records which cannot be parsed by a toolkit, usually due to odd chemistry definitions. It’s usually fine to skip those records, which is the purpose of the errors="ignore" setting. (See Handling errors when reading molecules from a string for more information about the errors parameter.)

Consider the following SMILES file with three lines:

% cat strange.smi
C methane
C--C ethane not for RDKit
CC ethane for everyone


The first and last are valid SMILES, but “C--C” is invalid. However, Open Babel will accept it, and OEChem will accept it because the default flavor does not add the “Strict” flavor flag. (See OpenEye-specific SMILES reader_args and writer_args for more information about OEChem flavors). As a result:

>>> from chemfp import openeye_toolkit, rdkit_toolkit, openbabel_toolkit
>>> for id, mol in openeye_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
...     print("openeye found", repr(id))
...
openeye found 'methane'
openeye found 'ethane not for RDKit'
openeye found 'ethane for everyone'
>>>
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
...     print("rdkit found", repr(id))
...
rdkit found 'methane'
[15:12:30] SMILES Parse Error: syntax error while parsing: C--C
[15:12:30] SMILES Parse Error: Failed parsing SMILES 'C--C' for input: 'C--C'
rdkit found 'ethane for everyone'
>>>
>>> for id, mol in openbabel_toolkit.read_ids_and_molecules("strange.smi", errors="ignore"):
...     print("openbabel found", repr(id))
...
openbabel found 'methane'
openbabel found 'ethane not for RDKit'
openbabel found 'ethane for everyone'


Sometime you want to work with multiple toolkits using the same input molecule. For example, you might want to compute a hybrid fingerprint, or make a model prediction where the descriptors come from different toolkits.

To do that, use the text_toolkit.read_ids_and_molecules() to read each record as a text molecule, and pass the actual record to the toolkit.parse_molecule() for each toolkit to get a molecule. Because I specifed the “ignore” error handler, the molecule will be None if the record could not be parsed. (See Specify alternate error behavior for more details.):

.. code-block:: python


from chemfp import openeye_toolkit, rdkit_toolkit, openbabel_toolkit from chemfp import text_toolkit

for id, text_mol in text_toolkit.read_ids_and_molecules(“strange.smi”, errors=”ignore”):
if openeye_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors=”ignore”):
print(“openeye parsed”, repr(id))
else:
print(“openeye could not parse”, repr(id))
if rdkit_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors=”ignore”):
print(“rdkit parsed”, repr(id))
else:
print(“rdkit could not parse”, repr(id))
if openbabel_toolkit.parse_molecule(text_mol.record, text_mol.record_format , errors=”ignore”):
print(“openbabel parsed”, repr(id))
else:
print(“openbabel could not parse”, repr(id))

The output from running the above is:

.. code-block:: none

openeye parsed ‘methane’ rdkit parsed ‘methane’ openbabel parsed ‘methane’ openeye parsed ‘ethane not for RDKit’ [15:13:45] SMILES Parse Error: syntax error while parsing: C–C [15:13:45] SMILES Parse Error: Failed parsing SMILES ‘C–C’ for input: ‘C–C’ rdkit could not parse ‘ethane not for RDKit’ openbabel parsed ‘ethane not for RDKit’ openeye parsed ‘ethane for everyone’ rdkit parsed ‘ethane for everyone’ openbabel parsed ‘ethane for everyone’

The above works, but there’s a lot of duplicate code, I don’t like the layout for the output, and there’s bit of extra overhead to re-interpret the parse_molecule() for each call. I’ll make a space-delimited file as output, and use toolkit.make_id_and_molecule_parser() to create a specialized parser for each available toolkit:

import chemfp
from chemfp import text_toolkit

parsers = []
for toolkit_name in ("openeye", "rdkit", "openbabel"):
try:
toolkit = chemfp.get_toolkit(toolkit_name)
except ValueError:
parsers.append(None)
else:
parser = toolkit.make_id_and_molecule_parser(format, errors="ignore")
parsers.append(parser)

columns = []
for parser in parsers:
if parser is None:
columns.append("N/A")
else:
id, mol = parser(text_mol.record)
if mol is not None:
columns.append("Yes")
else:
columns.append("No")
columns.append(id)
print(*columns, sep="\t")


This writes a tab-delimited file to stdout, ready for import into any spreadsheet program:

.. code-block:: none

openeye rdkit openbabel ID Yes Yes Yes methane [15:15:21] SMILES Parse Error: syntax error while parsing: C–C [15:15:21] SMILES Parse Error: Failed parsing SMILES ‘C–C’ for input: ‘C–C’ Yes No Yes ethane not for RDKit Yes Yes Yes ethane for everyone

(There will also be error messages from RDKit sent to stderr.)

Add multiple toolkit fingerprints to an SD file¶

In this section you’ll learn how to use multiple toolkits to generate fingerprints for each molecule in an SD file, and add the fingerprints results back to the record as new SD tags.

In Add fingerprints to an SD file using a toolkit you learned how to use a toolkit to read a file as molecules, compute a fingerprint for each molecule, and add the fingerprint to the molecule as an SD tag, and save the result to a new SD file. The processing pipeline converted the input to a toolkit molecule and out again, and in doing so changed other parts of the record besides the new SD tag.

Sometimes you want to preserve the input as much as you can. For that case you can use the text reader to get text molecules, pass each text molecule’s record that to the toolkit to compute the fingerprint, add the new fingerprint as a tag for the text molecule, and save the result to a file.

I’ll do that one better; I’ll generate fingerprints using multiple toolkit and add all of them to the output file. Here’s an example of what the end of a new record will look like. Note: although the fingerprints are actually on one line, I’ve folded the long fingerprints across multiple lines so it doesn’t overflow this page.

> <rdkit512>
3ffef7cefffffffefbfedffbbdffefffbfffffffffbbbffffffffffdfddffdfffffdf7fffeffffe7
feeffffffffffbffef9ffffffd7fffeff6deffff3feffdff

> <rdkit1024>
7eeefbdff7f78bfbcf97fb37996eef67e25ef37f1bcd7db50b76f242efff93baf9d6533b91ebefef
f6ca4efc3eeafdff

> <oecircular4>
00000000000000080000000001080000000008000000080800000000000000000000000000000000
00000000100000000000000000000000000400000000000080000000000000000000000008002000
00000300000000000040000200100000000000000000000000000000008000010002000000054800
00000000000800000000040000000000200000002800000000000000000000000000000000000000
00000000400000000000000080000000000000000000000000000010001000000000400000000000
00000000008080000000000080000000000010004000000100000020010002200000000001000000
00000000000002021000002000000000000000100000004000000000000000000000000000200080
00000000000000000000000000000000000000000000000000000000000000001080000000100000
10000000000080000000000000000000000000000400000080000000000000000000000000000000
00200000000000000000040000100000000000001000010000000000000200000080000000000001
00000804000008000000000000000000000000000004004000000000000000000000008000000001
20000010000000000100000000200000010000040000000000000010000000800000000000000000
0000000000000000004002000000000000000000000000000000000400000000

> <obecfp2>
00000000001000000000000000000000000000000000000000000000000000000000000000000000
00001000002000000000000000000000000000000000000000000000000000000000000000000000
00020000000000000000000000000000000000400000000000000000000000000000000000000000
00000000800000000000000001000000000000000000000000000000000200000000000000000000
00004000000000000800000000000000000000000000000000000000000000000004000020000000
00000000000000000000000000000000000000000000028000000000000000000000000000000000
00000000000000000000000000000000000000200000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000040000000000000000040
00000080000000000000000000000004000800000000000000000000000000000000000000000000
00000000000002000000000000000000000000800000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000004000000000020000000000000080000
00000002000000000000000000000080000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000800001000000000




I’ll break it down into stages. The first is some preamble code to import the modules and configure the input and output files:

import chemfp

# I'll use chemfp's text-based SD parser, so the output SD records
# will be identical to the input records, except to append the new
# tags at the end of each record.

from chemfp import text_toolkit, bitops

input_filename = "Compound_099000001_099500000.sdf.gz"
output_filename = "output.sdf.gz"


Next is to get the right SDF parser (a function which converts an SDF record into a identifier and a native toolkit molecule) and fingerprinter (a function which converts a toolkit molecule into a fingerprint) for each fingerprint type.

# The list of tag names and the corresponding fingerprint types.
wanted_fingerprint_types = (
("rdkit512", "RDKit-Fingerprint fpSize=512"),
("rdkit1024", "RDKit-Fingerprint fpSize=1024"),
("obecfp2", "OpenBabel-ECFP2"),
)

build_data = []   # I'll use this to build the fingerprint data.
toolkit_sdf_record_parsers = {}  # I'll use this to convert an SD record into a molecule.

for output_tag, fingerprint_type_string in wanted_fingerprint_types:
# First, get the corresponding fingerprint type.
fingerprint_type = chemfp.get_fingerprint_type(fingerprint_type_string)

# Figure out which toolkit to use to parse the SD records.
toolkit = fingerprint_type.toolkit

# For each unique toolkit, get a function that turns an SD record into a molecule.
# (If multiple fingerprints use the same toolkit then I only
# need to parse it once.)
if toolkit.name not in toolkit_sdf_record_parsers:
# The "ignore" means to return None on error, rather than raise an exception.
toolkit_sdf_record_parsers[toolkit.name] = toolkit.make_id_and_molecule_parser("sdf", errors="ignore")

# Get a function which turns a molecule into a fingerprint.
fingerprinter = fingerprint_type.make_fingerprinter()

# Store this information for record processing.
build_data.append( (output_tag, toolkit.name, fingerprinter) )


Finally, use the text toolkit to read text molecules for each record, then use the SDF parser to get the id and molecule from the record text, then the fingerprinter to get the fingerprint from the molecule:

# Use the text toolkit to read and write SDF records.
with text_toolkit.open_molecule_writer(output_filename) as writer:

# The text "molecule" .record is the actual text.
record = text_mol.record

# Make the fingerprints for each record and append the tag.

# For extra performance, cache parsed molecules for future use.
toolkit_mols = {}

for output_tag, toolkit_name, fingerprinter in build_data:
# There's no need to reparse the record if I've seen it before.
if toolkit_name in toolkit_mols:
toolkit_mol = toolkit_mols[toolkit_name]
else:
# Parse the record and save the molecule for later.
toolkit_id, toolkit_mol = toolkit_sdf_record_parsers[toolkit_name](record)
toolkit_mols[toolkit_name] = toolkit_mol

if toolkit_mol is None:
# There's no molecule, so no fingerprint. Save the empty string.
else:
# Make a fingeprint and save it to the tag as a hex-encoded string.
fp = fingerprinter(toolkit_mol)

# Write the text molecule to the output stream.
writer.write_molecule(text_mol)


Text toolkit and SDF files¶

In this section you’ll learn about the specialized SDF reader API to read SDF records and tag values directly instead of through a text record.

The text toolkit support for the toolkit API lets you use the same code for SDF and SMILES, and switch between text-based and molecule-based parsers. Genericness comes at a cost. The TextRecord class is a wrapper around the actual record, so at the least there is some overhead for creating a wrapper for each record.

The text toolkit has special support for reading SDF records as raw byte strings, which are not wrapped in any object. There several SDF reader variations depending on if you want to read from a file or a string, and if you want to read the record, the (id, record) pair, or an (id, tag value) pair. These functions are:

(Note: while I write this as (id, value), those are just labels. By default it returns (SD title, SD title) pairs, or you can specify an alternate id_tag and value_tag to get the pairs you want.)

There are also special functions to work with the tag data and title of an SDF record, which take the record string as input:

The next few sections will cover some examples of how to use these specialized functions.

Read id and tag value pairs from an SD file¶

In this section you’ll learn how read the (id, tag value) for each record in an SD file using a specialized SDF reader. You will need Compound_099000001_099500000.sdf.gz from PubChem.

The specialized SDF readers are faster than the more generic text_toolkit support for the toolkit API. As an example, I’ll extract the identifer and molecular weight field from a PubChem file using the (slower) chemfp toolkit API:

from chemfp import text_toolkit

filename = "Compound_099000001_099500000.sdf.gz"
mw = text_mol.get_tag("PUBCHEM_EXACT_MASS")
print(id, mw)


Next I’ll extract it using the (faster) read_sdf_ids_and_values() function, which returns an iterator of the (id, tag value) pairs. Just like with toolkit.read_ids_and_molecules(), by default the id is the title line of the SD record, or I can use the id_tag parameter to get it from one of the SD tags. The value_tag has the same meaning; by default the value is the record’s title, or I can specify an alternate tag name containing the value to use:

from chemfp import text_toolkit

filename = "Compound_099000001_099500000.sdf.gz"
print(id, mw)


Both of these generate output starting with:

99000039 374.13789
99000230 449.162057
99002251 335.126991
99003537 374.210661
99003538 374.210661
99005028 315.183444
99005031 315.183444


My timings using the larger file Compound_145500001_146000000.sdf.gz show that the first, generic implementation takes 14.0 seconds while the second, specialized implementation takes 10.3 seconds, which is about 25% faster, which would save a lot of time when parsing all of PubChem. (The difference is even larger - nearly 50% faster! - without the gzip overhead.) That’s why the sdf2fps command-line tool uses this function to extract the ids and fingerprint values from PubChem files.

Extract the id and atom and bond counts from an SD file¶

In this section you’ll use a specialized SDF reader iterate over the records of an SD file. You will need Compound_099000001_099500000.sdf.gz from PubChem.

The “records” returned by read_sdf_records(), read_sdf_records_from_string(), read_sdf_ids_and_records(), and read_sdf_ids_and_records_from_string() are the actual record content as a string, and not wrapped in a TextRecord or other class.

For example, the following will read each record from an SD file and use a regular expression to extract the title line, the number of atoms from the first 3 characters of line 4, and the number of bonds as the second 3 characters of line 4:

from chemfp import text_toolkit
import re

pat = re.compile(br"(.*)\n.*\n.*\n(...)(...)")

filename = "Compound_099000001_099500000.sdf.gz"
m = pat.match(record)
id = m.group(1).decode("utf8")
num_atoms = int(m.group(2))
num_bonds = int(m.group(3))
print(id, num_atoms, num_bonds)


The output starts:

99000039 46 49
99000230 58 60
99002251 42 43
99003537 54 57
99003538 54 57
99005028 48 49
99005031 48 49
99006292 56 58


(Bear in mind that there may also be implicit hydrogens, so unless you know that all hydrogens are explicit or implicit, these numbers may only be roughly useful.)

Records are byte strings¶

The example code, while short, is still a bit tricky. The reader returns the SD records as byte strings, not Unicode strings. Why? First and foremost, using Python to read bytes from a file is significantly faster than reading Unicode. If all you care about is reading a couple of fields from the record then it’s faster to work with bytes and convert only those fields.

Second, this is a low-level API meant to give the actual byte representation of the data. Among other things, you should be able to know exactly where the record is located in the file. You can even do things like handle mixed encodings, where one tag value is UTF-8 encoded and another is Latin-1 encoded and cannot be read as a value UTF-8.

Python 3 makes a strong distinction between a byte string and a Unicode string. For Python 3, because the record a byte string, you’ll have to use a byte-based regular expression to parse it, as in:

pat = re.compile(br"(.*)\n.*\n.*\n(...)(...)")


You’ll also have to convert the title bytes to Unicode if you want to print the result, as in:

id = m.group(1).decode("utf8")


Thankfully, int() knows how to read the ASCII digits from a byte string, so I didn’t have to do extra work there.

SDF-specific parser parameters¶

In this section you’ll learn that the specialized SDF readers support the standard errors and location , and have a few special parameters of their own. You will need Compound_099000001_099500000.sdf.gz from PubChem.

All six of the read_sdf_* functions support the same errors and location parameters as the standard toolkit API, with the same meaning. For example, the following shows where each record is located in the uncompressed file:

from chemfp import text_toolkit

filename = "Compound_099000001_099500000.sdf.gz"
start_byte, end_byte = loc.offsets
print("%s at line %d (bytes %d-%d)" % (id, loc.lineno, start_byte, end_byte))


The output starts:

.. code-block:: none

99000039 at line 1 (bytes 0-6709) 99000230 at line 223 (bytes 6709-14560) 99002251 at line 462 (bytes 14560-20689) 99003537 at line 668 (bytes 20689-28115) 99003538 at line 909 (bytes 28115-35540) 99005028 at line 1150 (bytes 35540-42315)

See Handling errors when reading molecules from a string for more information about the errors parameter, and Location information: record position and content for a description of the how to use a Location to the record’s first line number and start/end offsets in the file.

The six functions do not have a format option, because the format must be “sdf” or “sdf.gz”. Instead, there is a compression parameter. The default of None selects the compression type based on the filename, if the filename is available, or assumes the input is uncompressed. Use “gz” if the input is gzip’ed, “zst” if the input use Zstandard compression (and the zstandard Python package is available) and “none” or “” if the input is uncompressed.

The block_size is a tunable parameter, with a default value of 320 KB. The underlying reader reads a block of text then tries to extract records. When it gets to the end of a block, it reads a new block, and prepends the remaining part of the old block to the new one before looking for new records.

For performance reasons, the block_size should be several times larger than the largest record. During error recovery, the reader will read up to 320 KB or 5*block_size, whichever is larger, in order to find the next “” line and resynchronize.

Working with SD records as strings¶

In this section you’ll learn about the helper functions to work with SD record id and tag data when the SD record is a string. You will need Compound_099000001_099500000.sdf.gz from PubChem.

I’ll use one of the specialized SD file readers, read_sdf_records(), to get the first record from an SD file:

.. code-block:: pycon

>>> from chemfp import text_toolkit
>>> print(record[:73])
b'99000039\n  -OEChem-04292009532D\n\n 46 49  0     1  0  0  0  0  0999 V2000\n'
>>> print(record.decode("utf8")[:110])
99000039
-OEChem-04292009532D

46 49 0 1 0 0 0 0 0999 V2000
7.8451 3.0179 0.0000 O 0

I can use get_sdf_tag() and get_sdf_tag_pairs() to get information about the tags in the record:

.. code-block:: pycon

>>> for tag_name, tag_value in text_toolkit.get_sdf_tag_pairs(record):
...   print(tag_name, "=", repr(tag_value[:40]))
...
b'PUBCHEM_COMPOUND_CID' = b'99000039'
b'PUBCHEM_COMPOUND_CANONICALIZED' = b'1'
b'PUBCHEM_CACTVS_COMPLEXITY' = b'611'
b'PUBCHEM_CACTVS_HBOND_ACCEPTOR' = b'4'
...
b'PUBCHEM_CACTVS_TAUTO_COUNT' = b'-1'
b'PUBCHEM_COORDINATE_TYPE' = b'1\n5\n255'
b'PUBCHEM_BONDANNOTATIONS' = b'12  13  8\n12  17  8\n13  18  8\n16  19  8\n'
>>> text_toolkit.get_sdf_tag(record, "PUBCHEM_IUPAC_OPENEYE_NAME")
u'2-(2-hydroxyethylsulfanylmethyl)-4-nitro-phenol'


or use add_sdf_tag() to create a new record with a given tag and value added to the end of the tag block:

>>> print(record[-210:].decode("utf8"))
> <PUBCHEM_BONDANNOTATIONS>
12  13  8
12  17  8
13  18  8
16  19  8
16  23  8
17  20  8
18  21  8
19  22  8
19  24  8
20  21  8
22  25  8
23  26  8
24  27  8
25  26  8
27  28  8
7  22  8
7  28  8
8  9  6



>>> new_record = text_toolkit.add_sdf_tag(record, b"VOLUME", b"123.45")
>>> print(new_record[-229:].decode("utf8"))
> <PUBCHEM_BONDANNOTATIONS>
12  13  8
12  17  8
13  18  8
16  19  8
16  23  8
17  20  8
18  21  8
19  22  8
19  24  8
20  21  8
22  25  8
23  26  8
24  27  8
25  26  8
27  28  8
7  22  8
7  28  8
8  9  6

> <VOLUME>
123.45




I can also get the title line of the SD record using get_sdf_id():

>>> text_toolkit.get_sdf_id(record)
b'99000039'


or create a new string which is the old string with the title line replaced by a new value:

>>> new_record = text_toolkit.set_sdf_id(record, b"987ZYX")
>>> text_toolkit.get_sdf_id(new_record)
b'987ZYX'


Note that I used byte strings, like b"VOLUME" and b"987ZYX". In general the values must be of the same string type as the record. On the flip side, if you have a Unicode record then you must pass in Unicode strings as values:

>>> unicode_record = record.decode("utf8")
>>> new_record = text_toolkit.set_sdf_id(unicode_record, u"Hello")
>>> new_record[:6]
'Hello\n'


Unicode and other character encoding¶

In this section you’ll learn a bit about how the text toolkit deals with different character encodings. This is a hard topic and I won’t cover it in full details. If you have a problem with Unicode encodings (and hopefuly a support contract) then contact me and I’ll help that way.

The SDF format is 8-bit clean. The specification itself uses ASCII but fields like the title, the tag name, and the tag value can contain nearly any byte value. (Some values like newline and ‘<’ and ‘>’ in the tag name, have special meaning and must not be used.)

Unfortunately, different software handle those non-ASCII values differently. An older Unix system might use the Latin-1 character set, which is able to handle many European and some non-European languages, but doesn’t have the Euro currency symbol. Microsoft Windows code page 1252 is effectively a superset of Latin-1, with the Euro symbol and a several other additional symbols.

There are of course many other symbols. The consensus for new systems is to use UTF-8 encoded Unicode, which is compatible with 8-bit clean ASCII and can handle most of the world’s languages, plus a large number of symbols. This encoding may use one, two, or more bytes to represent each symbol.

The Python3 bindings of OpenEye, RDKit, and Open Babel’s have all decided to interpret SD files as UTF-8 encoded. This consensus is great … so long as your files are also compatible with UTF-8. But what if they aren’t? What if you have to read Latin-1 encoded file, or worse, a file where different fields have multiple encodings?

To demonstrate the problem, I’ll construct a problematic file for β-methylphenethylamine, with an experimental melting point of 140-142°C, stored in a Latin-1 encoded SD file. For now I’ll use use ‘Beta’ for the name, and ‘DEGREE’ for the temperature, as placeholders for the two non-ASCII characters.

>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("NCC(c1ccccc1)C", "smi")
>>> T.set_id(mol, "Beta-methylphenethylamine")
>>> unicode_record = T.create_string(mol, "sdf")
>>> print(unicode_record)
Beta-methylphenethylamine
RDKit

10 10  0  0  0  0  0  0  0  0999 V2000
0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
1  2  1  0
2  3  1  0
3  4  1  0
4  5  2  0
5  6  1  0
6  7  2  0
7  8  1  0
8  9  2  0
3 10  1  0
9  4  1  0
M  END
> <MP>
140-142DEGREEC




Next, I’ll replace the ‘DEGREE’ with the corresponding Unicode characters. (I’ll use the long Unicode name to be explicit.)

>>> unicode_record = unicode_record.replace(u"DEGREE", u"\N{DEGREE SIGN}")
>>> print(unicode_record)
Beta-methylphenethylamine
RDKit

10 10  0  0  0  0  0  0  0  0999 V2000
....
M  END
> <MP>
140-142°C




Finally, I’ll save it to the file “latin1.sdf”, using the Latin-1 encoding:

>>> open("latin1.sdf", "wb").write(unicode_record.encode("latin1"))
948


(The “948” indicates that 948 bytes were written to the file.)

This is not valid UTF-8. In my terminal, the MP tag value looks like:

> <MP>
140-142�C


where the “�” is the special symbol for REPLACEMENT CHARACTER, meaning that the actual character cannot be shown.

What happens if I read the file using each of the native toolkit APIs? First, OEChem under both Python 2.7 and Python 3.8:

>>> from openeye.oechem import *
>>> ifs = oemolistream("latin1.sdf")
>>> mol = OEGraphMol()
True
>>> OEGetSDData(mol, "MP")  # OEChem on Python 2.7
'140-142\xb0C'

>>> OEGetSDData(mol, "MP")  # OEChem on Python 3.8
'140-142\udcb0C'


Remember, OEGetSDData() on Python 2.7 returns byte strings, and you’ll need to decode that string manually to get the degree symbol. While OEGetSDData() on Python 3.8 returns Unicode strings, but the byte “\xb0” is not a valid UTF-8 encoding. Instead, OEChem uses the Unicode codepoint “\udcb0”. This is a surrogate for the actual character, and something I don’t fully understand. Various sources say this is a UTF-16 behavior which isn’t correct UTF-8. Python doesn’t like it:

>>> print('140-142\udcb0C')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcb0' in position 7: surrogates not allowed


Next, Open Babel under both Python 2.7 and Python 3.6:

>>> from openbabel import openbabel as ob
>>> conv = ob.OBConversion()
>>> mol = ob.OBMol()
True
>>> mol.GetData("MP").GetValue()  # Open Babel on Python 2.7
'140-142\xb0C'

>>> mol.GetData("MP").GetValue()  # Open Babel on Python 3.8
'140-142\udcb0C'


Open Babel gives exactly the same results as OEChem.

Finally, RDKit:

>>> from rdkit import Chem
>>> supplier = Chem.ForwardSDMolSupplier("latin1.sdf")
>>> mol = next(supplier)
>>> mol.GetProp("MP")      # RDKit on Python 2.7
'140-142\xb0C'

>>> mol.GetProp("MP")      # RDKit on Python 3.8
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte


RDKit doesn’t give a surrogate value for the illegal UTF-8 character. Instead, it complains. Which also means there is no way to get that data from Python.

What do you do if you have to read a Latin-1 encoded SD file? One solution is to use an external tool like iconv to translate the file to UTF8.

% iconv -f latin1 -t utf-8 < latin1.sdf > utf8.sdf


Another is to use Python to convert the entire file from Latin-1 to UTF8 then pass the transcoded contents to the toolkit:

>>> content = open("latin1.sdf", "rb").read()
>>> content = content.decode("latin1").encode("utf8")
>>>
>>> import chemfp
>>> for tk in ("openbabel", "openeye", "rdkit"):
...   T = chemfp.get_toolkit(tk)
...   for mol in T.read_molecules_from_string(content, "sdf"):
...     print(tk, T.get_tag(mol, "MP"))
openbabel 140-142°C
openeye 140-142°C
rdkit 140-142°C


But if all you want is some of the tag data values, and not the molecule, then you can ask the text_toolkit to read the record as a “latin1” encoded file:

>>> from chemfp import text_toolkit
>>> for mol in text_toolkit.read_molecules("latin1.sdf", encoding="latin1"):
...   print(mol.get_tag("MP"))
...
140-142°C


The content is converted on-demand, that is, only when get_id() or get_tag() are called. The text_toolkit’s “molecule” stores the encoding so it knows how to decode the fields:

>>> mol.encoding
'latin1'


By the way, if you omit the ‘encoding=”latin1”’ parameter then you’ll get an exception:

>>> for mol in text_toolkit.read_molecules("latin1.sdf"):
...   print(mol.get_tag("MP"))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "chemfp/_text_toolkit.py", line 209, in get_tag
return get_sdf_record_tag(self.record, tag, self.encoding, self.encoding_errors)
File "chemfp/_text_toolkit.py", line 1445, in get_sdf_record_tag
return value.decode(encoding, encoding_errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte


Mixed encodings and raw bytes¶

In this section you’ll learn how to get access to the id and tag data as byte strings rather than Unicode strings. This might be used if you have a perverse file which uses multiple encodings. If you run into that case, let me know - I’ll give you a sympathy prize for having to deal with it.

In the previous section you learned a few ways to read an Latin-1 encoded SD file. What happens if the title line contains an id which is UTF-8 encoded while the tag data contains a Latin-1 encoded value? (Or if you have to deal with a ‘clever’ programmer who put in semi-binary data into a data field. Because that’s the sort of thing we clever programmers sometimes do.)

The techniques I mentioned in that previous section won’t work because they assume the entire file has the same encoding.

Instead, use the text_toolkit to read the file, but access it through the byte API rather than the string API.

I need an example file. I’ll start with the “latin1.sdf” file I created for the previous section, which uses a Latin-1 encoded degree symbol in the “MP” tag data. I’ll modify it so the “Beta” in the title line is replaced by the UTF-8 encoded “β” character.

>>> content = open("latin1.sdf", "rb").read()
>>> mixed_content = content.replace(b"Beta", u"\N{GREEK SMALL LETTER BETA}".encode("utf8"))
>>> open("mixed.sdf", "wb").write(mixed_content)
946


On a UTF-8 terminal the title line and the MP tag data line are respectively:

On a Latin-1 terminal they are:

How do I get their “real” values? I’ll use the text_toolkit to read the first record from the file:

>>> from chemfp import text_toolkit
>>> mol
SDFRecord(id_bytes=b'\xce\xb2-methylphenethylamine'(id='β-methylphenethylamine'),
record=b'\xce\xb2-methylphenethylamine\n     RDKit          \n\n 10 10  0   ...',
encoding='utf8', encoding_errors='strict')


The title line is in utf8 so that’s not a problem

>>> print(mol.id)
β-methylphenethylamine


But I won’t be able to read the “MP” field because it’s not UTF-8 encoded:

>>> mol.get_tag("MP")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 7: invalid start byte


Instead, I’ll use get_tag_as_bytes() to get the underlying bytes for the named tag, rather than as converted to a Unicode string:

>>> mol.get_tag_as_bytes(b"MP")
'140-142\xb0C'


Once I have the bytes, I can decode them as Latin-1:

>>> print(mol.get_tag_as_bytes(b"MP").decode("latin1"))
140-142°C


Note that this function requires the tag name be the byte string which is found in the file. A Unicode name (which is the default string type under Python 3) will raise an exception:

>>> mol.get_tag_as_bytes(u"MP")
Traceback (most recent call last):
...
ValueError: tag must be a byte string or None


Use method get_tag_pairs_as_bytes() to get the list of all (tag, data) pairs, where both the tag and data are return as byte strings.

>>> mol.get_tag_pairs_as_bytes()
[(b'MP', b'140-142\xb0C')]


Finally, use id_bytes to get the raw bytes for the identifier:

>>> mol.id_bytes
b'\xce\xb2-methylphenethylamine'


For example, if I read the file as Latin-1 then the Unicode id “MP” tag be what I expected, the id won’t be correct. Instead, I can get the id_bytes and decode it manually as UTF-8:

>>> mol2 = next(text_toolkit.read_molecules("mixed.sdf", encoding="latin1"))
>>> print(mol2.get_tag("MP"))
140-142°C
>>> mol2.id
'Î²-methylphenethylamine'
>>>
>>> print(mol2.id_bytes.decode("utf8"))
β-methylphenethylamine