Toolkit API examples

This chapter gives many examples of how to use the toolkit API. For an overview of the toolkit API functions, see chemfp.toolkit. For details about actual toolkit implementations, see chemfp.openeye_toolkit, chemfp.openbabel_toolkit, chemfp.rdkit_toolkit, and chemfp.text_toolkit.

Fingerprint search usually starts with a structure record, and not a fingerprint. The functions chemfp.read_molecule_fingerprints() and chemfp.read_molecule_fingerprints_from_string() give a quick way to read a file or string containing structure records as the corresponding fingerprints.

Sometimes you want more control over the process. You might want to generate multiple fingerprints for the same structure and not want to reparse the structure record multiple times. Or you might want to return the search results as extra fields to the query SDF record instead of a simple list of values.

Chemfp uses a third-party chemistry toolkit to parse the records into a molecule, or compute the fingerprint for a given molecule. It’s not hard to write your own Open Babel, OEChem/OEGraphSim, or RDKit code to handle any of these or similar tasks. The problem comes in when you want to mix multiple fingerprint types, like to compare the default RDKit fingerprint to Open Babel’s FP2 fingerprint. You end up writing very different code for essentially the same fingerprint task.

There’s an old saying in computer science; all problems can be solved with another layer of indirection. The chemfp toolkit API follows that tradition. It’s a common API for molecule I/O which works across the three supported toolkits. It’s also my best effort at implementing a next generation API.

Bear in mind that it is only an I/O API. Chemfp is a fingerprint toolkit and it will not gain a common molecule API. For that, look toward Cinfony.

Get a chemfp toolkit

In this section you’ll learn how to load a “toolkit” – a portable API layer above the actual chemistry toolkit – and how to check if a toolkit is available and has a valid license.

Each toolkit I/O adapter is implemented as a chemfp submodule. If you know the underlying chemistry toolkit is installed you can import the adapter directly:

>>> from chemfp import openbabel_toolkit
>>> from chemfp import openeye_toolkit
>>> from chemfp import rdkit_toolkit
>>> from chemfp import cdk_toolkit

Use chemfp.get_toolkit_names() to get the available toolkit names:

>>> chemfp.get_toolkit_names()
set(['cdk', 'openbabel', 'openeye', 'rdkit'])

This will try to import each module, which means it may take a second or more depending on the shared library load time for your system. (This overhead only occurs once.) The function returns a list of the modules that could be loaded and have a valid license.

You can use chemfp.get_toolkit() to get the correct toolkit module given a name; it raises a ValueError if the underlying toolkit isn’t installed or the toolkit name is unknown:

>>> chemfp.get_toolkit("rdkit")
<module 'chemfp.rdkit_toolkit' from 'chemfp/rdkit_toolkit.py'>
>>> chemfp.get_toolkit("openeye")
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
>>> chemfp.get_toolkit("openbabel")
<module 'chemfp.openbabel_toolkit' from 'chemfp/openbabel_toolkit.py'>
>>> chemfp.get_toolkit("cdk")
<module 'chemfp.cdk_toolkit' from 'chemfp/cdk_toolkit.py'>

Existence isn’t enough to know if you can use a toolkit. OEChem requires a license. Each I/O adapter implements chemfp.toolkit.is_licensed(). It returns True for Open Babel and RDKit and the value of OEChemIsLicensed() for OEChem:

>>> for name in chemfp.get_toolkit_names():
...   T = chemfp.get_toolkit(name)
...   print(f"Toolkit {T.name!r} ({T.software}) is licensed? {T.is_licensed()}")
...
Toolkit 'cdk' (CDK/2.8) is licensed? True
Toolkit 'openbabel' (OpenBabel/3.1.0) is licensed? True
Toolkit 'openeye' (OEChem/20220607) is licensed? True
Toolkit 'rdkit' (RDKit/2021.09.4) is licensed? True

(Thanks OpenEye for an no-cost developer license to their toolkit! Thanks to the Open Babel, RDKit, and CDK developers for an open source license to their toolkits!)

There is currently no way to check if OEGraphSim is licensed; you’ll need to use native OpenEye code instead.

For fun I also showed the software attribute, which gives more detailed information about the toolkit version in a standardized format.

There is still another way to get access to the toolkit modules. The special objects chemfp.rdkit, chemfp.cdk, chemfp.openeye, chemfp.openbabel will automatically import and redirect to the underlying module, without need for an explicit import:

>>> chemfp.rdkit.morgan2.from_smiles("CCO")[:8]
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> chemfp.cdk.pubchem.from_smiles("CCO")[:8]
b'\x00\x02\x04\x00\x00\x00\x00\x00'
>>> chemfp.openeye.tree.from_smiles("CCO")[:8]
b'\x00\x00\x00\x00\x00\x00\x00\x00'
>>> chemfp.openbabel.fp2.from_smiles("CCO")[:8]
b'\x00\x00\x00\x00\x00\x00\x00\x00'

These objects should not be imported into your module because it’s all too easy to confuse them with the actual toolki module name. Instead, either use chemfp.<toolkit> or import the actual chemfp wrapper module.

Finally, use chemfp.has_toolkit() to check if a toolkit is available. In the following, I used one of my local testing environments which has OEChem installed but not the other toolkits. (I use venv to create and manage these virtual environments; it’s a very useful tool.):

>>> chemfp.has_toolkit("openeye")
True
>>> chemfp.has_toolkit("openbabel")
False
>>> chemfp.has_toolkit("rdkit")
False
>>> chemfp.has_toolkit("cdk")
False

The other option is to catch the ValueError raised when trying to get an unavailable toolkit:

>>> chemfp.get_toolkit("openeye")
<module 'chemfp.openeye_toolkit' from 'chemfp/openeye_toolkit.py'>
  >>> chemfp.get_toolkit("cdk")
Traceback (most recent call last):
  File "chemfp/__init__.py", line 2188, in get_toolkit
    from . import cdk_toolkit as tk
  File "chemfp/cdk_toolkit.py", line 37, in <module>
    import jpype # Must install JPype so chemfp can interface to the CDK jar
    ^^^^^^^^^^^^
ModuleNotFoundError: No module named 'jpype'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/__init__.py", line 2191, in get_toolkit
    raise ValueError("Unable to get toolkit %r: %s" % (toolkit_name, err))
ValueError: Unable to get toolkit 'cdk': No module named 'jpype'
>>>
>>> chemfp.get_toolkit("schrodinger")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dalke/cvses/cfp-4x/chemfp/__init__.py", line 2199, in get_toolkit
    raise ValueError("Toolkit %r is not supported" % (toolkit_name,))
ValueError: Toolkit 'schrodinger' is not supported

This is a bit more complicated to do, but does have the advantage of giving access to an error message.

Parse and create SMILES

In this section you’ll learn how to parse a SMILES into a molecule, convert a molecule into a SMILES, and the difference between a SMILES record and a SMILES string. You will need a chemistry toolkit for this and most of the examples in this chapter.

The chemfp toolkit I/O API is the same across the different toolkits, and examples with one will generally work with the other, except for essential differences like the specific formats available, the chemistry differences in how to interpret a record, the error messages, and reader and writer arguments.

For most examples I’ll use T as the toolkit module name, rather than specify a specific toolkit. My examples will be based on RDKit, but you can use any of the following, if available on your system:

# Choose one of these
from chemfp import openeye_toolkit as T
from chemfp import openbabel_toolkit as T
from chemfp import rdkit_toolkit as T
from chemfp import cdk_toolkit as T

I’ll parse the SMILES string for phenol as a toolkit molecule, then convert the toolkit molecule into its canonical isomeric SMILES string.

>>> from chemfp import rdkit_toolkit as T
>>> T.translate_record("c1ccccc1O", in_format="smistring", out_format="smistring")
'Oc1ccccc1'

The “smistring” format name means that the input is a SMILES string and there is no identifier.

The default input and output format is “smi”, which processes a “record” of a SMILES file, which contains both a SMILES and an optional identifier:

>>> T.translate_record("c1ccccc1O phenol")
'Oc1ccccc1 phenol\n'

The translate_record function is a helper based on lower-level components to parse a record into a molecule object, and to create a record from a molecule object. Here’s what those look like, using chemfp.toolkit.parse_molecule`and :func:`chemfp.toolkit.create_string():

>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x103559980>
>>> T.create_string(mol, "smistring")
'Oc1ccccc1'

Chemfp follows the rule from the original SMILES paper that the SMILES string ends at the first whitespace. The following is valid across the chemfp toolkits API even if the underlying toolkit doesn’t accept the “junk” as part of a SMILES:

>>> mol = T.parse_molecule("c1ccccc1O junk", "smistring")

On the other hand, if you have a SMILES record, which is a SMILES string followed by an id and possibly other fields, then use the “smi” format name. That will parse the first characters as a SMILES string and parse the rest of the input, up to the end of the line, as the record id:

>>> mol = T.parse_molecule("c1ccccc1O junk", "smistring")
>>> T.get_id(mol) is None
True
>>> mol = T.parse_molecule("c1ccccc1O junk", "smi")
>>> T.get_id(mol)
'junk'
>>> mol = T.parse_molecule("c1ccccc1O flotsam and jetsam\nand more\n", "smi")
>>> T.get_id(mol)
'flotsam and jetsam'

I used the chemfp.toolkit.get_id() helper function. While chemfp doesn’t have a common molecule object, I found I do need a few standard functions to manipulate toolkit molecules. Also, toolkit.parse_molecule() will only read the first record and ignore trailing data, which is why the “and more” didn’t affect anything.

Now that the molecule has an id, it’s easy to see the difference between the “smistring” and “smi” in the output string:

>>> T.create_string(mol, "smistring")
'Oc1ccccc1'
>>> T.create_string(mol, "smi")
'Oc1ccccc1 flotsam and jetsam\n'

SMILES, SDF and InChI are so common that there are special toolkit functions to parse and create records in those formats, and to read and write files in those formats. Here is another way to parse and create the phenol record as “smi” format:

>>> mol = T.parse_smi("c1ccccc1O phenol")
>>> T.create_smi(mol)
'Oc1ccccc1 phenol\n'

Finally, you can pass an alternate id to the toolkit.create_string() function. One example of when this is useful is when your identifier comes from one field of a database and the SMILES string from another, and you want to combine the results to get an SDF record:

>>> T.create_string(mol, "smi", id="nothing to see here")
'Oc1ccccc1 nothing to see here\n'

WARNING: Chemfp’s toolkit wrapper implementation may temporarily change then restore the toolkit molecule’s own identifier in order to get the correct output. This is not thread-safe.

Canonical, non-isomeric, and arbitrary SMILES

In this section you’ll learn the difference between the “smistring”, “canstring”, and “usmstring” SMILES string formats and the “smi”, “can”, and “usm” SMILES record formats. As with all examples which use the generic T toolkit name, you’ll need one of the supported chemistry toolkits, and I’ll use RDKit as my underlying toolkit.

The SMILES format supports many different ways to represent the same molecule. “CO”, “OC”, “[OH][CH3]”, and “C3.O3” are four different SMILES strings for methanol. A canonicalization algorithm uses additional rules to create a unique SMILES representation for a given molecular graph. The different chemistry toolkit have different canonicalization algorithms, so each toolkit will likely generate a different canonical SMILES string for the same molecular graph.

There are multiple classes of canonical SMILES strings even in the same toolkit. The original SMILES format did not handle isotopes, chirality, or stereochemistry. The later extension to support these was called “isomeric SMILES”, to distinguish it from the original SMILES.

Because of the history, when people asked a toolkit for “SMILES” output they got non-isomeric non-canonical SMILES, while “canonical SMILES” gave them “non-isomeric canonical”. This caused subtle usability errors. Many people, including people like me who should have the experience to know better, expect canonical isomeric SMILES by default. But for over 20 years nearly all of the toolkits followed Daylight’s lead in how they did things.

I learned about the problem when OEChem 2.0 broke with tradition and fixed the mistake. It defined the default SMILES as canonical isomeric SMILES. If you make the effort to ask for a canonical SMILES you get canonical non-isomeric SMILES, and if you really want non-canonical, non-isomeric SMILES you can ask for the “usm” format.

Year later I learned that that Open Babel did the right thing well before OpenEye. Open Babel’s “canonical” is isomeric SMILES, you must specify the “i” option to not include isotopic or chiral markings, and they don’t even refer to “isomeric SMILES”.

Chemfp follows OpenEye’s naming convention. The “smistring” format generates a canonical isomeric SMILES string, the “canstring” format generates a canonical non-isomeric SMILES string, and the “usmstring” format generates a non-canonical non-isomeric SMILES string:

>>> from chemfp import openeye_toolkit as T  # use your toolkit of choice
>>>
>>> mol = T.parse_smistring("[235P].[238U]")
>>> T.create_string(mol, "smistring")
'[235P].[238U]'
>>> T.create_string(mol, "canstring")
'[P].[U]'
>>> T.create_string(mol, "usmstring")
'[P].[U]'

Here’s evidence that the “usmstring” format is non-canonical:

>>> mol = T.parse_smistring("[238U].[235P]")
>>> T.create_string(mol, "usmstring")
'[U].[P]'
>>> T.create_string(mol, "smistring")
'[235P].[238U]'

These conventions also apply when creating “smi”, “can”, and “usm” strings:

>>> T.set_id(mol, "radioactive")
>>> T.create_string(mol, "smi")
'[235P].[238U] radioactive\n'
>>> T.create_string(mol, "can")
'[P].[U] radioactive\n'
>>> T.create_string(mol, "usm")
'[U].[P] radioactive\n'

By the way, chemfp.toolkit.parse_molecule() doesn’t distinguish between “smi”, “can” and “usm” as input SMILES records, nor between “smistring”, “canstring” and “usmstring”. The format only makes a difference for output. Later on you’ll see how to specify writer_args to have more fine-grained control over the output SMILES format. (See RDKit-specific SMILES reader_args and writer_args, OpenEye-specific SMILES reader_args and writer_args, Open Babel-specific SMILES reader_args and writer_args, and CDK-specific SMILES reader_args and writer_args for toolkit-specific examples.)

Use format to create a record in SDF format

In this section you’ll learn how to convert a toolkit molecule into an SDF record. This example will use the RDKit toolkit but the results will be substantially the same for any of the three supported chemistry toolkits.

To create an SDF record as a Unicode string, pass “sdf” as the format to chemfp.toolkit.create_string() or use the create_sdf function:

>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("CO", "smistring")
>>> print(T.create_string(mol, "sdf"))

     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  END
$$$$
>>> print(T.create_sdf(mol))

     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  END
$$$$

If you want it in V3000 format, specify the “sdf3k” format or use the create_sdf3k function:

>>> print(T.create_sdf3k(mol))

     RDKit

  0  0  0  0  0  0  0  0  0  0999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 2 1 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0.000000 0.000000 0.000000 0
M  V30 2 O 0.000000 0.000000 0.000000 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 1 1 2
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

There is also a chemfp.toolkit.create_bytes() function which returns the record as UTF-8 encoded Unicode, which all of the supported cheminformatics toolkits use internally.

I’ll set the molecule’s name to the lower-case Greek letter ‘alpha’, and show you the interactive output from Python:

>>> T.set_id(mol, u"\N{GREEK SMALL LETTER ALPHA}")
>>> T.create_string(mol, "sdf")[:13]
'α\n     RDKit '
>>> T.create_bytes(mol, "sdf")[:13]
b'\xce\xb1\n     RDKit'
>>> print(T.create_sdf(mol)[:13])
α
     RDKit

Use zlib record compression

In this section you’ll learn about the “zlib” compression option for single record parsers and byte string creation.

A record in SDF format can be large, but most of the content is repetetive. Often it’s better to store a zlib compressed record in a database instead of the full record. When I use zlib to compress each record of Compound_099000001_099500000.sdf.gz I get a 4.5-fold compression. That is, the uncompressed records take 73,024,092 bytes, the individually compressed records take 16,262,567 bytes, and the gzip compressed file takes 6,847,342 bytes. (Gzip is twice as good as individually compressed records because it can collect compression statistics across multiple records and build a better prediction model.)

Chemfp supports a zlib compression option for the record-oriented functions, though not the file-oriented functions. To enable it, add “.zlib” to the format string for chemfp.toolkit.create_bytes(). Here you can see how adding that suffix reduces the record size:

>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("CO", "smistring")
>>> print("uncompressed:", len(T.create_bytes(mol, "sdf")))
uncompressed: 228
>>> print("compressed:", len(T.create_bytes(mol, "sdf.zlib")))
compressed: 77

I’ll complete a round-trip conversion by parsing the compressed SD record to a molecule then converting it to a SMILES string:

>>> compressed = T.create_bytes(mol, "sdf.zlib")
>>> new_mol = T.parse_molecule(compressed, "sdf.zlib")
>>> T.create_string(new_mol, "smistring")
'CO'

The zlib option only works with create_bytes; it does not work with create_string because the latter only returns Unicode strings, and it’s possible for zlib to return something which isn’t valid Unicode. Here’s what happens if you try to use it anyway:

>>> T.create_string(mol, "sdf.zlib")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/rdkit_toolkit.py", line 419, in create_string
    return _toolkit.create_string(mol, format, id, writer_args, errors)
  File "chemfp/base_toolkit.py", line 1382, in create_string
    raise ValueError("create_string() does not support compression. Use create_bytes()")
ValueError: create_string() does not support compression. Use create_bytes()

On the other hand, chemfp.toolkit.parse_molecule() takes both Unicode strings and byte strings as input. It treats byte strings as being UTF-8 encoded.

Use zst record compression

In this section you’ll learn about the “zst” compression option for single record parsers and byte string creation.

Chemfp 3.4 added support for ZStandard compression in most places, including in the record-oriented functions, via the suffix “.zst” in the format name or filename. The following compares zlib and zst compression to the uncompressed size:

import chemfp
for toolkit_name in ("text", "rdkit", "openbabel", "openeye"):
  T = chemfp.get_toolkit(toolkit_name)
  with T.read_molecules("Compound_099000001_099500000.sdf.gz",
        reader_args={"rdkit.sdf.removeHs": False}) as reader:
    uncompressed_size = zlib_size = zst_size = 0
    for mol in reader:
      uncompressed_size += len(T.create_bytes(mol, "sdf"))
      zlib_size += len(T.create_bytes(mol, "sdf.zlib"))
      zst_size += len(T.create_bytes(mol, "sdf.zst"))
    print("%r toolkit: uncompressed: %d zlib: %d (%.2f) zstd: %d (%.2f)" % (
       toolkit_name, uncompressed_size, zlib_size, uncompressed_size/zlib_size,
        zst_size, uncompressed_size/zst_size))

The output of the above is:

'text' toolkit: uncompressed: 73024092 zlib: 16262567 (4.49) zstd: 16976598 (4.30)
'rdkit' toolkit: uncompressed: 68180103 zlib: 15843096 (4.30) zstd: 16714426 (4.08)
'openbabel' toolkit: uncompressed: 73392094 zlib: 16295123 (4.50) zstd: 16985140 (4.32)
'openeye' toolkit: uncompressed: 73024092 zlib: 16269883 (4.49) zstd: 16977089 (4.30)

By default OEChem and Open Babel will keep hydrogens while RDKit removes them, which makes the output SD files considerably smaller. The reader_args specifies rdkit.sdf.removeHs so RDKit will keep the hydrogens, which makes the size comparisons more direct. The total RDKit size is still smaller than the other toolkits because RDKit only writes 4 columns for each bond, while the others use 7 columns.

Remember, compression effectiveness is a balance between compression time, compressed size, and decompression time. The zlib, gzip, and zst compression methods all support different compression levels. For zlib and gzip, 1 results in faster compression time but generally larger compressed sizes, and 9 gives the best compression at the cost of decreased performance. Zstandard also uses 1 for faster compression but uses 19 to get the maximum compression.

The compression level can be specified using the level argument of the chemfp functions which support compressed output, like chemfp.toolkit.create_bytes(), chemfp.toolkit.open_molecule_writer(), and save(). It can be the numeric compression level, or the words “min” for minimum compression, “default” for default (for zlib and gzip, 3 for zstd), and “max” for maximum compression at the expense of time.

Get a list of available formats and distinguish between input and output formats

In this section you’ll learn how to get the list of available formats for each object, and determine if a format can be used to get a toolkit molecule from a string record, or convert a toolkit molecule into a string record.

The toolkit’s chemfp.toolkit.get_formats() function returns a list of the available formats. On my computer RDKit supports 23 formats, OEChem 34, Open Babel (showing off its heritage) supports a whopping 200, and CDK also has 23:

>>> from chemfp import rdkit_toolkit
>>> len(rdkit_toolkit.get_formats())
20
>>> rdkit_toolkit.get_formats()
[Format('rdkit/smi'), Format('rdkit/can'), Format('rdkit/usm'),
Format('rdkit/cxsmi'), Format('rdkit/sdf'), Format('rdkit/sdf3k'),
Format('rdkit/smistring'), Format('rdkit/canstring'),
Format('rdkit/usmstring'), Format('rdkit/cxsmistring'),
Format('rdkit/molfile'), Format('rdkit/rdbinmol'),
Format('rdkit/fasta'), Format('rdkit/sequence'),
Format('rdkit/helm'), Format('rdkit/mol2'), Format('rdkit/pdb'),
Format('rdkit/xyz'), Format('rdkit/mae'), Format('rdkit/inchi'),
Format('rdkit/inchikey'), Format('rdkit/inchistring'),
Format('rdkit/inchikeystring')]
>>>
>>> from chemfp import openeye_toolkit
>>> len(openeye_toolkit.get_formats())
34
>>> openeye_toolkit.get_formats()
[Format('openeye/smi'), Format('openeye/usm'),
Format('openeye/can'), Format('openeye/cxsmi'),
Format('openeye/sdf'), Format('openeye/sdf3k'),
Format('openeye/molfile'), Format('openeye/skc'),
Format('openeye/mol2'), Format('openeye/mol2h'),
Format('openeye/sln'), Format('openeye/mmod'),
Format('openeye/pdb'), Format('openeye/xyz'), Format('openeye/cdx'),
Format('openeye/mopac'), Format('openeye/mf'),
Format('openeye/oeb'), Format('openeye/inchi'),
Format('openeye/inchikey'), Format('openeye/oez'),
Format('openeye/cif'), Format('openeye/mmcif'),
Format('openeye/fasta'), Format('openeye/sequence'),
Format('openeye/csv'), Format('openeye/json'),
Format('openeye/smistring'), Format('openeye/canstring'),
Format('openeye/usmstring'), Format('openeye/cxsmistring'),
Format('openeye/slnstring'), Format('openeye/inchistring'),
Format('openeye/inchikeystring')]
>>>
>>> from chemfp import openbabel_toolkit
>>> len(openbabel_toolkit.get_formats())
200
>>> openbabel_toolkit.get_formats()
[Format('openbabel/smi'), Format('openbabel/can'),
 Format('openbabel/usm'), Format('openbabel/cxsmi'),
 Format('openbabel/smistring'), Format('openbabel/canstring'),
 Format('openbabel/usmstring'), Format('openbabel/cxsmistring'),
 Format('openbabel/sdf'), Format('openbabel/sdf3k'),
 Format('openbabel/inchi'), Format('openbabel/inchikey'),
 Format('openbabel/inchistring'),
 Format('openbabel/inchikeystring'), Format('openbabel/CONTFF'),
   ... many formats omitted ...
Format('openbabel/molreport'), Format('openbabel/orca')]
>>>
>>> from chemfp import cdk_toolki
>>> len(cdk_toolkit.get_formats())
23
>>> cdk_toolkit.get_formats()
[Format('cdk/smi'), Format('cdk/can'), Format('cdk/usm'),
Format('cdk/cxsmi'), Format('cdk/sdf'), Format('cdk/sdf3k'),
Format('cdk/smistring'), Format('cdk/canstring'),
Format('cdk/usmstring'), Format('cdk/cxsmistring'),
Format('cdk/molfile'), Format('cdk/inchi'), Format('cdk/inchikey'),
Format('cdk/inchistring'), Format('cdk/inchikeystring'),
Format('cdk/pdb'), Format('cdk/cdkjava'), Format('cdk/cml'),
Format('cdk/ctx'), Format('cdk/crystclust'), Format('cdk/gamess'),
Format('cdk/mol2'), Format('cdk/xyz')]

Some of these formats, like “can”, “inchistring”, and “sdf3k” are chemfp names for special implementations or non-standard format configurations, usually meant to have a portable way to describe a given configuration.

I’ll use chemfp.toolkit.get_format(), which returns a chemfp.base_toolkit.Format, to get the “sdf” format for OpenEye (if you don’t have access to OEChem, use one of the other toolkits instead):

>>> import chemfp
>>> sdf_format = chemfp.openeye.get_format("sdf")
>>> sdf_format.name
'sdf'
>>> sdf_format.toolkit_name
'openeye'

The “sdf” format can be used for both input and output in all toolkits:

>>> sdf_format.is_input_format, sdf_format.is_output_format
(True, True)

However, some formats are output only, like the InChIKey format (assuming it’s available for your toolkit):

>>> inchi_fmt = chemfp.openeye.get_format("inchikey")
>>> inchi_fmt.is_input_format, inchi_fmt.is_output_format
(False, True)

On the other hand, some formats are input only, like Open Babel’s support for MOPAC’s output format:

>>> mopout_fmt = chemfp.openbabel.get_format("mopout")
>>> mopout_fmt.is_input_format, mopout_fmt.is_output_format
(True, False)

Instead of asking for all available formats, you can ask for only the input formats, or only the output formats, using chemfp.toolkit.get_input_formats or chemfp.toolkit.get_output_formats:

>>> import chemfp
>>> for toolkit_name in ("openbabel", "openeye", "rdkit", "cdk"):
...   T = chemfp.get_toolkit(toolkit_name)
...   print(f"{toolkit_name} has {len(T.get_input_formats())} input formats")
...   print(f"{toolkit_name} has {len(T.get_output_formats())} output formats")
...
openbabel has 157 input formats
openbabel has 145 output formats
openeye has 28 input formats
openeye has 33 output formats
rdkit has 20 input formats
rdkit has 21 output formats
cdk has 20 input formats
cdk has 21 output formats

Determine the format for a given filename

It’s sometimes useful to know what format will be used for a given filename. A filename can be used as a source for a reader or destination for a writer, and a toolkit might understand a given format when used as input but not as ouput, or vice-versa.

The function chemfp.toolkit.get_input_format_from_source() returns a chemfp.base_toolkit.Format for the given filename:

>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> T.get_input_format_from_source("abc.smi.gz")
Format('rdkit/smi.gz')

This is the same Format object you saw in the previous section. I didn’t mention the compression attribute in that discussion. It’s “gz” for gzip-ed files, “zst” for zstandard compressed files, and the empty string “” for uncompressed files.

>>> fmt = T.get_input_format_from_source("abc.smi.gz")
>>> fmt.name
'smi'
>>> fmt.compression
'gz'
>>>
>>> fmt = T.get_input_format_from_source("abc.smi")
>>> fmt.name
'smi'
>>> fmt.compression
''

Asking for a supported format which isn’t an input format raises a ValueError exception:

>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/openbabel_toolkit.py", line 165, in get_input_format_from_source
  return _format_registry.get_input_format_from_source(source, format)
File "chemfp/base_toolkit.py", line 805, in get_input_format_from_source
  format_config = self.get_input_format_config(register_name)
File "chemfp/base_toolkit.py", line 728, in get_input_format_config
  raise ValueError("%s does not support %r as an input format"
ValueError: Open Babel does not support 'inchikey' as an input format

even though “inchikey” is supported as an output format:

>>> openbabel_toolkit.get_output_format_from_destination("example.inchikey")
Format('openbabel/inchikey')

Yes, there’s a different function to get the format name for a source filename than for a destination filename. Maybe in the future I’ll support a generic get_format_from_filename(); let me know if that would be useful.

If you ask for a format which doesn’t exist then the functions raise a different ValueError exception:

>>> openbabel_toolkit.get_input_format_from_source("example.does-not-exist")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
     .....
  File "chemfp/base_toolkit.py", line 718, in get_format_config
    raise ValueError("%s does not support the %r format"
ValueError: Open Babel does not support the 'does-not-exist' format

I’ve found it useful to have a way to override the default guess. It’s amazing how many people use “.dat” for SMILES or SDF files, and “.txt” files for SMILES. The format lookup functions support a second, optional parameter, which is the format name to use.

>>> openbabel_toolkit.get_input_format_from_source("example.does-not-exist", "smi.gz")
Format('openbabel/smi.gz')

This exists so that code like:

if format is not None:
    fmt = T.get_format(format)
else:
    fmt = T.get_format_from_source(filename)

can be replaced with:

fmt = T.get_format_from_source(filename, format)

Working with a format object is useful when combined with format’s reader_args and writer_arg functions discussed in Specify a SMILES delimiter through reader_args

>>> fmt = openbabel_toolkit.get_input_format_from_source("input.smi")
>>> fmt.get_default_writer_args()
{'options': None, 'isomeric': True, 'canonicalization': 'default',
'explicit_hydrogens': False, 'delimiter': None}
>>> fmt.get_writer_args_from_text_settings({
...    "explicit_hydrogens": "true",
...    "isomeric": "false",
...    "delimiter": "tab"})
{'isomeric': False, 'explicit_hydrogens': True, 'delimiter': 'tab'}

Parse the id and the molecule at the same time

In this section you’ll learn how to parse a structure record, as a string, to extract both the identifier and the native molecule object.

Usually you want both the molecule and its id. You could parse the molecule then use T.get_id(mol) to get the id, but that’s extra work, it leads to awkward looking code, and is slower than having chemfp do the work for you when it parses the molecule.

Instead, use chemfp.toolkit.parse_id_and_molecule():

>>> from chemfp import rdkit_toolkit as T  # use your toolkit of choice
>>>
>>> T.parse_id_and_molecule("C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a", "smi")
('vitamin a', <rdkit.Chem.rdchem.Mol object at 0x1035f14b0>)

If there is no id/title field then the id will either be None or the empty string, depending on the toolkit and format:

>>> T.parse_id_and_molecule("C", "smi")
(None, <rdkit.Chem.rdchem.Mol object at 0x1035f14b0>)

Instead of testing for the empty string or None, your code you should use “if not id:” to test for a missing id:

>>> id, mol = T.parse_id_and_molecule("C", "smi")
>>> if not id:
...   print("Missing id!")
...
Missing id!

Specify alternate error behavior

In this section you’ll learn how to use the errors parameter to have chemfp.toolkit.parse_molecule() return None rather than raise an exception, and to have it print a report about the failing molecule.

The string “Q” is not a valid SMILES string. All of the toolkits will fail to parse it, and the chemfp toolkit I/O adapter by default raises an exception when that happens:

>>> import chemfp
>>> chemfp.openbabel.parse_smistring("Q")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
       ...
  File "chemfp/io.py", line 139, in error
    raise ParseError(msg, location)
chemfp.ParseError: Open Babel cannot parse the SMILES 'Q'
>>>
>>> chemfp.rdkit.parse_molecule("Q", "smistring")
[16:02:55] SMILES Parse Error: syntax error while parsing: Q
[16:02:55] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
     ...
File "/Users/dalke/cvses/cfp-4x/chemfp/io.py", line 139, in error
    raise ParseError(msg, location)
chemfp.ParseError: RDKit cannot parse the SMILES string 'Q'
>>>
>>> chemfp.openeye.parse_smistring("Q", "smistring")
Warning: Problem parsing SMILES:
Warning: Q
Warning: ^

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
      ...
  File "chemfp/io.py", line 139, in error
    raise ParseError(msg, location)
chemfp.ParseError: OEChem cannot parse the smistring record: 'Q'
>>>
>>> chemfp.cdk.parse_molecule("Q", "smistring")
Traceback (most recent call last):
  File "SmilesParser.java", line 251, in org.openscience.cdk.smiles.SmilesParser.parseSmiles
Exception: Java Exception
   ...
  File "/Users/dalke/cvses/cfp-4x/chemfp/io.py", line 139, in error
    raise ParseError(msg, location)
chemfp.ParseError: CDK cannot parse the SMILES 'Q': unexpected character:
Q
^

On the other hand, “[NH8]” is a valid SMILES, but RDKit by default will reject it as chemically unreasonable, while OEChem and Open Babel are less strict and treat it as a molecular graph rather than a chemical molecule.

I’ll write a program which checks which toolkits will parse “[NH8]”

# I call this "check_NH8.py"
import chemfp
allowed = []; rejected = []
for name in chemfp.get_toolkit_names():
  T = chemfp.get_toolkit(name)
  try:
    T.parse_molecule("[NH8]", "smistring")
  except ValueError:
    rejected.append(name)
  else:
    allowed.append(name)
print("Allowed:", allowed, "Rejected:", rejected)
% python check_NH8.py
[22:01:05] Explicit valence for atom # 0 N, 8, is greater than permitted
Allowed: ['openeye', 'cdk', 'openbabel'] Rejected: ['rdkit']

I think the try/except/else is sometimes harder to understand than returning an error value, because it’s harder to see the control flow. I can ask chemfp.toolkit.parse_molecule() to ignore errors, which causes it to return a None object rather than raise an exception. turns the above loop into the following:

for name in chemfp.get_toolkit_names():
  T = chemfp.get_toolkit(name)
  mol = T.parse_molecule("[NH8]", "smistring", errors="ignore")
  if mol is None:
    rejected.append(name)
  else:
    allowed.append(name)
print("Allowed:", allowed, "Rejected:", rejected)

The errors option is more useful in later sections, when parsing multiple records.

The errors parameter can also take the value report. Like ignore, this will return a None when there is an error rather than raise an exception. It will also write a consistent, cross-toolkit error message to stderr, including the SMILES string that failed if the input is a SMILES:

>>> for name in chemfp.get_toolkit_names():
...   print(f"Using toolkit {name!r}")
...   T = chemfp.get_toolkit(name)
...   mol = T.parse_molecule("Q", "smistring", errors="report")
...   mol = T.parse_molecule("[NH8]", "smistring", errors="report")
...

The chemfp.toolkit.parse_id_and_molecule() function also takes the errors parameter. If the structure could not be parsed then the second component of the tuple (the molecule) will be None. The first component (the id) may or or may not be None, depending on the underlying implementation:

>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
[13:03:10] SMILES Parse Error: syntax error while parsing: Q
[13:03:10] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
(None, None)
>>>
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
Warning: Problem parsing SMILES:
Warning: Q q-ane
Warning: ^

(None, None)
>>>
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
==============================
*** Open Babel Error  in ParseSimple
  SMILES string contains a character 'Q' which is invalid
('q-ane', None)
>>>
>>> from chemfp import cdk_toolkit
>>> cdk_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
(None, None)

Future versions of chemfp may work to normalize this behavior, or let the caller choose a specific behavior.

Specify a SMILES delimiter through reader_args

In this section you’ll learn how to parse a SMILES record as a set of delimited fields instead of the default of a SMILES string followed by a title, and some of the limitations of chemfp’s attempt at a consistent cross-toolkit SMILES record parser.

You might think that the SMILES file format is well defined, but it sadly isn’t. Different toolkits have slightly different interpretations for a SMILES record format. Consider the SMILES record:

C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a

The original Daylight definition is that a SMILES record is single line, which starts with the SMILES string. The SMILES string ends with the first whitespace character or the end of the line, and if there was a whitespace character than the rest of the line is the title. OpenEye follows this definition, as does chemfp. That’s why the previous example extracted “vitamin A” as the record id.

However, RDKit treats a SMILES file record as a space or tab separated set of fields, where the first field is the SMILES, the second field is the id/title and additional columns may store other properties. RDKit would use “vitamin” as the record id for this record. (RDKit can also be configured to interpret the first line as column names. Chemfp does not currently support this option, though I plan to have a cross-platform implementation in a future release.)

Chemfp normalizes the SMILES record parser API so that all toolkits by default expect the Daylight format.

Warning

Future versions of chemfp may change the default to “tab” instead of “to-eol” because CXSMILES is becoming more common.

Use the optional reader_args dictionary to specify an alternate interpretation:

>>> from chemfp import rdkit_toolkit as T  # use your toolkit of choice
>>>
>>> smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a"
>>> T.parse_id_and_molecule(smiles, "smi", reader_args={"delimiter": "whitespace"})
('vitamin', <rdkit.Chem.rdchem.Mol object at 0x10f5ccfa0>)

In this case I asked it to parse the record as a set of whitespace delimited fields. If you have tab-separated fields, where a space inside of a field is not part of the delimiter, then use the “tab” delimiter:

>>> T.parse_id_and_molecule("O=O\tmolecular oxygen\t31.9988\n", "smi",
...                          reader_args={"delimiter": "tab"})
('molecular oxygen', <rdkit.Chem.rdchem.Mol object at 0x10fbe9590>)

The supported delimiters are:

  • to-eol - (default) everything past the first whitespace is interpreted as the id/title;
  • tab or “\t” - the fields are tab-separated; the first field is the SMILES and the second the id;
  • space or ” ” - the fields are space-separated;
  • whitespace - the fields are whitespace-separated;
  • native - use the native interpretation for the given toolkit;

While chemfp strives for cross-toolkit portability, it is not perfect. Leading and trailing whitespace might not be supported, so the first character of the SMILES record must also be the first character of the SMILES string. Also, the toolkit is free to interpret the first whitespace as the delimiter despite the reader_args setting. In practice, as of early 2020, Open Babel, RDKit, and OEChem will stop at the first whitespace, though I suspect they will increasingly support the CXSMILES extensions.

Neither the SMILES parser nor the other parsers validate the full contents of the reader_args dictionary. Extra items are ignored. This is deliberate because it lets you combine, say, SMILES and SDF parameters in the same dictionary without needing to check the specific format first.

To a lesser extent, it also makes it easier to specify parameters which work across multiple toolkit versions. For example, the most recent version of OEChem’s SMILES parsers added a quiet option, which chemfp will support in the future. Your code can have a {“quiet”: True} without first checking to see if this version of chemfp is new enough to support the parameter.

WARNING: As a result, it’s very easy to specify a key with a typo, which is ignored, and not notice that it nothing happens.

WARNING #2: Really, I’ve been bitten by this a few times. Be extra cautious to check that you are using the right keys.

Specify an output SMILES delimiter through writer_args

In this section you’ll learn how to create a SMILES record with a tab character separating the SMILES from the title using the writer_args parameter of chemfp.toolkit.create_string().

By default create_string uses a space character to separate the SMILES from the rest of the id:

>>> from chemfp import rdkit_toolkit as T  # use your toolkit of choice
>>>
>>> mol = T.parse_molecule("O=O molecular oxygen\n", "smi")
>>> T.create_string(mol, "smi")
'O=O molecular oxygen\n'

To use a tab character instead, pass in a writer_args dictionary with a “delimiter” of “tab”:

>>> T.create_string(mol, "smi", writer_args={"delimiter": "tab"})
'O=O\tmolecular oxygen\n'

The writer_args delimiter also accepts “whitespace”, “space”, “to-eol” and the other values from reader_args. Only “tab” and “\t” will use a tab character as the delimiter; all of the the others will use a space character.

Warning

Future versions of chemfp may change the default to “tab” to better support the use of CXSMILES extensions.

Neither the SMILES writer nor the other writers validate the full contents of the writer_args dictionary. Extra items are ignored. This is deliberate because it lets you combine, say, SMILES and SDF parameters in the same dictionary without needing to check the specific format first. It also makes it easier to specify parameters which work across multiple toolkit versions.

WARNING: As a result, it’s very easy to specify a key with a typo, which is ignored, and not notice that it nothing happens.

WARNING #2: Really, I’ve been bitten by this a few times. Be extra cautious to check that you are using the right keys.

RDKit-specific SMILES reader_args and writer_args

In this section you’ll learn how to pass toolkit-specific parameters to the RDKit toolkit functions to parse and create a SMILES string. You will need the RDKit toolkit.

Earlier I showed that RDKit by default does a sanitization check to verify that the input is correct.

>>> from chemfp import rdkit_toolkit
>>> mol = rdkit_toolkit.parse_molecule("[NH8]", "smistring", errors="ignore")
[16:31:55] Explicit valence for atom # 0 N, 8, is greater than permitted
>>> mol is None
True

The underlying RDKit code to parse a SMILES string, MolFromSmiles, takes a sanitize parameter. The default, True, tells it to do the sanitization step, while False disables it.

Use the reader_args dictionary to pass the sanitize parameter to the underlying toolkit function:

>>> mol = rdkit_toolkit.parse_molecule("[NH8]", "smistring", reader_args={"sanitize": False})
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x107590a60>
>>> from rdkit import Chem
>>> Chem.MolToSmiles(mol)
'[NH8]'

Use the writer_args dictionary to pass toolkit-specific parameters to RDKit’s MolToSmiles:

>>> mol = rdkit_toolkit.parse_molecule("c1ccccc1[16OH]", "smistring")
>>> rdkit_toolkit.create_smistring(mol, "smistring")
'[16OH]c1ccccc1'
>>> rdkit_toolkit.create_string(mol, "smistring",
...                  writer_args={"isomericSmiles": False})
'Oc1ccccc1'
>>> rdkit_toolkit.create_string(mol, "smistring",
...                  writer_args={"kekuleSmiles": True, "allBondsExplicit": True})
'[16OH]-C1:C:C:C:C:C:1'

See Get the default reader_args or writer_args for a format for a description of how to get the default reader and writer arguments for a given format, and use help(rdkit_toolkit.read_molecules) and help(rdkit_toolkit.open_molecule_writer) to get a more human-readable description.

The helper functions like parse_smistring and parse_sdf take keyword arguments instead of reader_args, which is easier if know you will target that exact toolkit and format:

>>> rdkit_toolkit.parse_smistring("[NH8]", sanitize=False)
<rdkit.Chem.rdchem.Mol object at 0x15b2b1d20>

OpenEye-specific SMILES reader_args and writer_args

In this section you’ll learn how to pass toolkit-specific parameters to the OEChem toolkit functions to parse and create a SMILES string. You will need the OEChem toolkit. See the next section for specific details about aromaticity.

By default the OEChem SMILES parser is tolerant of bad SMILES. I believe it’s too tolerant, because will gladly parse what I think are invalid SMILES, like “C-=C”:

>>> from chemfp import openeye_toolkit as T
>>> mol = T.parse_smistring("C-=C")
>>> T.create_smistring(mol)
'C=C'

The developers at OpenEye recognize that pedantic folk like me exist. The OEChem SMILES parser has a “strict” mode, which I can enable in chemfp through the “flavor” parameter of the reader_args dictionary:

>>> mol = T.parse_molecule("C-=C", "smistring",
...   reader_args={"flavor": "Strict"})
Warning: Unknown flavor (16385) for format (27) in set iflavor.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
      ...
  File "chemfp/io.py", line 139, in error
    raise ParseError(msg, location)
chemfp.ParseError: OEChem cannot parse the smistring record: 'C-=C'

The “Warning: Unknown flavor” is an unfortunate difference between how OEChem parses a SMILES vs. a CXSMILES. Starting with chemfp 4.1, chemfp will interpret a SMILES string as CXSMILES unless the “cxsmiles” reader_arg is False. Doing that here gives a more informative error message:

>>> mol = T.parse_molecule("C-=C", "smistring",
...   reader_args={"flavor": "Strict", "cxsmiles": False})
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C-=C
Warning:   ^

Warning: Error reading molecule "" in Canonical stereo SMILES format.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
     ....
  File "chemfp/io.py", line 139, in error
    raise ParseError(msg, location)
chemfp.ParseError: OEChem cannot parse the smistring record: 'C-=C'

Going back to error handling, the underlying OEParseSmiles() function takes the optional strict and canon parameters. Why does chemfp use the term “flavor”? Why the capitalization for “Strict”?

Historically the low-level OEChem functions took individual parameters, like the positional arguments canon and strict:

>>> mol = OEGraphMol()
>>> OEParseSmiles(mol, "C-=C", False, True)
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C-=C
Warning:   ^

False

(I wrote “historically” because more recent versions have format-specific options classes, like OEParseSmilesOptions for SMILES. These collect all of the configuration options into a single parameter, which is easier to pass around.)

On the other hand, the high-level molecule parsers take a single “flavor” integer value to specify the options for a given format. This flavor is usually expressed as the union of a set of bitmasks. I’ll show how OEChem’s Python API uses the flavor parameter.

The following OEChem code reads a SMILES file in the default non-strict mode (with no specified flavor):

% cat example.smi
C=-C bad
CCC good
% python
   ...
>>> from openeye.oechem import *
>>> ifs = oemolistream("example.smi")
>>> for mol in ifs.GetOEGraphMols():
...   print(mol.GetTitle(), mol.NumAtoms())
...
bad 2
good 3

while the following sets the SMILES flavor to use “strict” mode:

>>> ifs = oemolistream("example.smi")
>>> ifs.SetFlavor(OEFormat_SMI, OEIFlavor_SMI_Strict)
True
>>> for mol in ifs.GetOEGraphMols():
...   print(mol.GetTitle(), mol.NumAtoms())
...
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C=-C bad
Warning:   ^

Warning: Error reading molecule "" in Canonical stereo SMILES format.
good 3

(You can see some terminology differences between me and OpenEye in the warning message. The “Canonical” and “stereo” are only meaningful as a description of the output format, not the input format, and I use the traditional term “isomeric” while they highlight the more important stereochemistry aspect. I also got confused because I thought at first the “Canonical” had something to do with OEIFlavor_SMI_Canon.)

I decided to base the chemfp openeye_toolkit API on the high-level “flavor” API of OEChem, which is better documented and requires less work on my part to implement than low-level functions. But I also decided to extend it to support a string value, and not just an integer.

To explain how that works, I’ll switch from describing reader_args to writer_args, because raising an exception with the “Strict” option gets boring, fast.

The OEChem SMILES output flavors are: OEOFlavor_SMI_AtomMaps, OEOFlavor_SMI_AtomStereo, …. and you know what? The OEOFlavor_SMI_ prefix is part of what makes the flavors hard to use in Python, so I’ll omit the prefix in chemfp. The OEChem SMILES output flavors are: AllBonds, AtomMaps, AtomStero, BondStereo, Canonical, ExtBonds, Hydrogens, ImpHCount, Isotopes, Kekule, RGroups, SmiMask, and SuperAtoms. There are also Default and DEFAULT which are the bitwise union RGroups|Isotopes|AtomStereo|BondStereo|AtomMaps|Canonical.

In chemfp you can specify the fields as a “|” or “,” separated list of flavor flags, without the prefix. Here are several different ways to specify the default settings for isomeric canonical SMILES string output:

>>> mol = openeye_toolkit.parse_molecule("[16O][*:1]", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring")
'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
...     writer_args={"flavor": ""})
'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
...     writer_args={"flavor": "Default"})
'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
...    writer_args={"flavor": "RGroups|Isotopes|AtomStereo|BondStereo|AtomMaps|Canonical"})
'[R1][16O]'

These settings override any options which might be implied by the format name. Thus, even though “smistring” is supposed to generate an isomeric canonical SMILES, I can use the writer_args to remove the isomeric component from the flavor:

>>> openeye_toolkit.create_string(mol, "smistring",
...    writer_args={"flavor": "RGroups|AtomStereo|BondStereo|AtomMaps|Canonical"})
'[R1][O]'

While I used “|” as the separator, I can equally use “,”, as in:

>>> openeye_toolkit.create_string(mol, "smistring",
...     writer_args={"flavor": "Isotopes,Canonical"})
'*[16O]'

OEChem uses the bar as a bitwise-or operator which merges the different flags. I added the comma as an alternative to the vertical bar because chemfp has additional syntax for removing options. The following removes the “RGroups” option from the isomeric and non-isomerical formats defaults, but otherwise leaves the defaults alone:

>>> openeye_toolkit.create_string(mol, "smistring",
...         writer_args={"flavor": "Default,-RGroups"})
'[*:1][16O]'
>>>
>>> openeye_toolkit.create_string(mol, "canstring",
...         writer_args={"flavor": "Default,-RGroups"})
'[*:1][O]'

(The terms are evaluated from left to right, so you can delete a term then add it back if you want.)

I added a comma because writing this as Default|-RGroups caused the C programmer mind in me to gasp in bewilderment. (“The bitwise-or with the negative of the RGroups bitflags?!!”)

You don’t need to specify the OEChem flavor using a flavor string. You can also specify it as an integer:

>>> from openeye.oechem import *
>>> (OEOFlavor_SMI_Isotopes|OEOFlavor_SMI_AtomStereo|OEOFlavor_SMI_BondStereo|
...  OEOFlavor_SMI_AtomMaps|OEOFlavor_SMI_Canonical)
121
>>> openeye_toolkit.create_string(mol, "smistring",
...         writer_args={"flavor": 121})
'[*:1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
...         writer_args={"flavor": 0})
'[O]*'

or (and this might be a bit excessive) as a string-encoded integer:

>>> openeye_toolkit.create_string(mol, "smistring",
...         writer_args={"flavor": "121"})
'[*:1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
...         writer_args={"flavor": "0"})
'[O]*'

Chemfp tries to be helpful. It will include the list of available flavor names in the exception if it doesn’t understand what you gave it:

>>> openeye_toolkit.create_string(mol, "smistring",
...         writer_args={"flavor": "chocolate"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/openeye_toolkit.py", line 446, in create_string
    return _toolkit.create_string(mol, format, id, writer_args, errors)
            ... lines removed ...,
  File "chemfp/_openeye_toolkit.py", line 1174, in parse_flavor
    raise err
ValueError: OEChem smi format does not support the 'chocolate'
flavor option. Available flavors are: AllBonds, AtomMaps,
AtomStereo, BondStereo, Canonical, ExtBonds, Hydrogens,
ImpHCount, Isotopes, Kekule, RGroups, SuperAtoms

See Get the default reader_args or writer_args for a format for a description of how to get the default reader and writer arguments for a given format, and use help(openeye_toolkit.read_molecules) and help(openeye_toolkit.open_molecule_writer) to get a more human-readable description.

OpenEye-specific aromaticity

In this section you’ll learn how chemfp handles OpenEye’s aromaticity parameter. You will need the OEChem toolkit, and you should read the previous section to understand some of the terminology.

Note: the OEGraphSim fingerprints are not affected by the aromaticity of the reader because the fingerprint generators ensure that the molecules are always perceived using “openeye” aromaticity before generating the fingerprint.

The OpenEye toolkit supports the “openeye”, “daylight”, “tripos”, “mdl”, and “mmff” aromaticity models. In the high-level API, which is meant for reading and writing files or file-like objects, the aromaticity is an aspect of the flavor integer. If unspecified, OEChem uses the appropriate default aromaticity model for that format. As a result, aromaticity perception is required for both reading and writing files.

The low-level API handles file processing and aromaticity perception as distinct steps. This API can also process a single record directly, while the high-level API requires wrapping the record in a file-like object and then reading the first molecule from it.

The chemfp toolkit API is a high-level API for both files and records, which means I had to implement record conversion routines on top of OEChem’s low-level API. Consequently, some of the details are different between the file I/O and record I/O APIs; the most significant being that the record I/O routines also support a “none” aromaticity.

The following shows the default aromaticity proceessing in action:

>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("C1=CC=CC=C1", "smistring")
>>> [bond.IsAromatic() for bond in mol.GetBonds()]
[True, True, True, True, True, True]

Automatic aromaticity perception is normally the right thing to do, because different toolkits and even different versions of the same toolkit may have different ideas of what is aromatic, and it’s best to ensure that they are consistently interpreted.

Aromaticity perception isn’t needed when you know that the input aromaticity is correct and unambiguous. My timings show that aromaticity perception takes about half of the time needed to parse a SMILES string. If the string comes from a good data source, like a database record where OEChem created the SMILES, then you can nearly double the performance by omitting the perception step.

What does “ambiguous” mean? Consider azulene, which can be described by the SMILES “c1ccc2cccc2cc1”. The fusion bond is not aromatic, while the peripheral bonds form a 10 pi electron system. In SMILES, an unspecified bond means “single or aromatic”. If one of the terminal atoms is aliphatic then the bond must be a single bond. But as the fusion bond in azulene shows, it’s possible for an unspecified bond with terminal aromatic atoms to still be non-aromatic. The above SMILES is ambiguous, and OEChem needs to do a full aromaticity analysis to determine that the fusion bond is not aromatic.

An unambiguous SMILES for azulene is “c1ccc-2cccc2cc1”, where the fusion bond is marked explicitly as a single bond. The SMILES parser can use the simpler rule that an unspecified ring bond is aromatic whenever both terminal atoms are aromatic, and not require the lengthy aromatic perception step to determine that. OEChem generates unambiguous SMILES, so if you know OEChem generated the SMILES then you can recover the original aromaticity directly.

(As a side note, Daylight first introduced this in 4.71, and used fluorene (“C1c2ccccc2-c3ccccc13”) as the prototypical case. Daylight’s rule is to include the “-” for a single bond between two aromatic atoms, while OEChem’s rule is to include the “-” for a single bond between two aromatic atoms and which is in a ring. Ring identification is much easier than aromaticity perception.)

So where was I … ah, right, specifing the aromaticity model. I decided to separate aromaticity from the rest of the flavor flags, and specify it with its own reader_args and writer_args field. It’s easiest to see using beneze in Kekule form:

>>> mol = openeye_toolkit.parse_molecule("C1=CC=CC=C1", "smistring",
...     reader_args={"aromaticity": "none"})
>>>
>>> [bond.IsAromatic() for bond in mol.GetBonds()]
[False, False, False, False, False, False]
>>> openeye_toolkit.create_string(mol, "smistring",
...     writer_args={"aromaticity": "none"})
'C1=CC=CC=C1'

NOTE: the aromaticity flags are volatile. If you don’t specify the “none” aromaticity model then chemfp.toolkit.create_string() will reperceive aromaticity using the “openeye” aromaticity model and possibly reassign the aromaticity flags.

>>> openeye_toolkit.create_string(mol, "smistring")
'c1ccccc1'
>>> openeye_toolkit.create_string(mol, "smistring",
...     writer_args={"aromaticity": "none"})
'c1ccccc1'

This is consistent with how OEChem’s high-level operations also modify the input molecule when creating output. I’m not fully happy with it. OEChem also has a “ConstMolecule” version, so this detail may change in the future.

Open Babel-specific SMILES reader_args and writer_args

In this section you’ll learn how to pass toolkit-specific parameters to the Open Babel toolkit functions to create a SMILES string. You will need the Open Babel toolkit.

As far as I can tell, Open Babel does not have configuration options to change the default SMILES parser, so chemfp has no toolkit-specific reader_args for that toolkit. Open Babel does have configuration options to change the default SMILES output routines. These can be set in chemfp with the writer_args dictionary.

Open Babel uses an options string to change the configuration. The string “i U smilesonly” generates non-isomeric SMILES output, where the atom ordering is determined by the InChI’s canonicalization algorithm (“Universal SMILES”), and where the identifier is excluded from the SMILES output.

Did you know all of that? I didn’t. Some of these options are only documented in the code. It’s also difficult for chemfp to handle since some of the options conflict with how chemfp thinks of things. For example, chemfp is in charge of including the identifier, so it will always enable “smilesonly”, and it’s difficult for the “cansmiles” output, which is non-isomeric, to know if an options string wants to override the default”i” option that it requires.

I ended up making my own writer_args API to have more explicit control over the individual parameters:

  • explicit_hydrogens - boolean
  • isomeric - boolean
  • canonicalization - a string like “default”, “none”, “universal”, “anticanonical”, or “inchified”
  • options - the Open Babel options string (if you must use it; using it may break things if you are not very careful.)

Here’s an example of how to disable isomeric support for the “smistring” output, which would normally generate an isomeric SMILES:

>>> from chemfp import openbabel_toolkit
>>> mol = openbabel_toolkit.parse_smistring("[16O]=O")
>>> openbabel_toolkit.create_string(mol, "smistring")
'[16O]=O'
>>> openbabel_toolkit.create_string(mol, "smistring",
...     writer_args={"isomeric": False})
'O=O'

I can also enable isomeric SMILES for the “canstring” format, which is normally non-isomeric:

>>> openbabel_toolkit.create_string(mol, "canstring")
'O=O'
>>> openbabel_toolkit.create_string(mol, "canstring",
...     writer_args={"isomeric": True})
'[16O]=O'

Open Babel supports several different canonicalization algorithms. Perhaps the most unusual one is “anticanonical”, which uses random numbers for the atom ordering algorithm. The same molecule can generate different SMILES strings across multiple calls, so it’s the antithesis of “canonical”:

>>> for i in range(5):
...   print(openbabel_toolkit.create_string(mol, "smistring",
...         writer_args={"canonicalization": "anticanonical"}))
...
[16O]=O
[16O]=O
O=[16O]
[16O]=O
[16O]=O

As with the other toolkits, Open Babel’s format-specific helper functions like parse_sdf and create_smistring use keyword arguments instead of reader_args and writer_args. Here’s another way to do the same anti-canonicalization example:

>>> for i in range(5):
...   print(openbabel_toolkit.create_smistring(mol, canonicalization="anticanonical"))
...
O=[16O]
O=[16O]
[16O]=O
[16O]=O
O=[16O]

See Get the default reader_args or writer_args for a format for a description of how to get the default reader and writer arguments for a given format, and use help(openbabel_toolkit.read_molecules) and help(openbabel_toolkit.open_molecule_writer) to get a more human-readable description.

CDK-specific SMILES reader_args and writer_args

In this section you’ll learn how to pass toolkit-specific parameters to the CDK toolkit functions to parse and create a SMILES string. You will need the CDK JAR file on your CLASSPATH and the JPype Java/Python adapter installed. (See the [installation guide for help]).

By default CDK will find a Kekule assignment of the input SMILES:

>>> from chemfp import cdk_toolkit
>>> cdk_toolkit.parse_molecule("oC", "smistring")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
       ...
  File "<string>", line 1, in raise_tb
chemfp.ParseError: CDK cannot parse the SMILES 'oC': a valid kekulé structure could not be assigned

This can be disabled with the kekulise=False reader argument:

>>> mol = cdk_toolkit.parse_molecule("oC", "smistring", reader_args={"kekulise": False})
>>> [a.isAromatic() for a in mol.atoms()]
[True, False]

I’m not sure why you would need it, but it’s there. As with the other toolkits, the format helper functions like parse_smistring use keyword arguments instead of taking a reader_args dictionary, so the following also works:

>>> mol = cdk_toolkit.parse_smistring("oC", kekulise=False)

The CDK writer_args are more interesting. The default SMILES writer generates the SMILES in Kekule form:

>>> mol = cdk_toolkit.parse_smistring("c1ccccc1[16OH]")
>>> cdk_toolkit.create_string(mol, "smistring")
'C1=CC=C(C=C1)[16OH]'

CDK uses a flavor parameter similar to how OpenEye’s flavor system works. I’ll use first specify the “Default” flavor, using both the generic create_string and format-specific create_smistring:

>>> cdk_toolkit.create_string(mol, "smistring", writer_args={"flavor": "Default"})
'C1=CC=C(C=C1)[16OH]'
>>> cdk_toolkit.create_smistring(mol, flavor="Default")
'C1=CC=C(C=C1)[16OH]'

then use “???” to cause the flavor parser to fail and show a list of possible options:

>>> cdk_toolkit.create_string(mol, "smistring", writer_args={"flavor": "???"})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
      ...
ValueError: CDK smistring format does not support the '???' flavor
option. Available flavors are: Absolute, AtomAtomMap,
AtomAtomMapRenumber, AtomicMass, AtomicMassStrict, Canonical,
Cx2dCoordinates, Cx3dCoordinates, CxAtomLabel, CxAtomValue,
CxCoordinates, CxFragmentGroup, CxMulticenter, CxPolymer, CxRadical,
CxSmiles, CxSmilesWithCoords, Default, InChILabelling, Isomeric,
Stereo, StereoCisTrans, StereoExCisTrans, StereoExTetrahedral,
StereoTetrahedral, Unique, UniversalSmiles, UseAromaticSymbols

CDK has an a lot of options! Let’s try a few. First, use aromatic symbols instead of Kekule form:

>>> cdk_toolkit.create_string(mol, "smistring", writer_args={"flavor": "Default,UseAromaticSymbols"})
'c1ccc(cc1)[16OH]'
>>> cdk_toolkit.create_smistring(mol, flavor="Default,UseAromaticSymbols")
'c1ccc(cc1)[16OH]'

Now, also disable isomeric SMILES:

>>> cdk_toolkit.create_string(mol, "smistring",
...    writer_args={"flavor": "Default,UseAromaticSymbols,-Isomeric"})
'Oc1ccccc1'
>>> cdk_toolkit.create_smistring(mol, flavor="Default,UseAromaticSymbols,-Isomeric")
'Oc1ccccc1'

If you know the CDK toolkit then you should be able to figure out how to use these flavor flags.

Get the default reader_args or writer_args for a format

In this section you’ll learn how to get the default reader_args and writer_args for a given format.

As you’ve seen, each toolkit format can have its own reader_args and writer_args parameters, and chemfp layers its own format types (like “smistring”) on top of the native formats. It’s easy to forget the specific parameters for a given format, much less the default values.

The get_default_reader_args() and get_default_writer_args() methods of the Format object return the respective default arguments:

>>> from chemfp import rdkit_toolkit
>>> fmt = rdkit_toolkit.get_format("smi")
>>> fmt.get_default_reader_args()
{'sanitize': True, 'has_header': False, 'delimiter': None}
>>> fmt.get_default_writer_args()
{'isomericSmiles': True, 'kekuleSmiles': False, 'canonical': True,
'allBondsExplicit': False, 'allHsExplicit': False, 'cxsmiles': False,
'delimiter': None}

You can sometimes use this information to see how chemfp maps its format types to the toolkit parameters. In RDKit, the difference between chemfp’s “smi” and “can” formats is that isomericSmiles is True for the first and False for the second:

>>> rdkit_toolkit.get_format("can").get_default_writer_args()
{'isomericSmiles': False, 'kekuleSmiles': False, 'canonical': True,
'allBondsExplicit': False, 'allHsExplicit': False, 'cxsmiles': False,
'delimiter': None}

While writing this documentation I realized that the OEChem toolkit shows neither the default flavor nor the default aromaticity for a given format type. I will likely improve that in a future version of chemfp.

Convert text settings into reader and writer arguments

In this section you’ll learn how to convert text-based configuration settings into the appropriate reader_args or writer_args dictionary.

The reader_args and writer_args take native Python values, including integers and booleans. In practice these will often be defined in a configuration file, through command-line options, or as CGI parameters. The Format methods get_reader_args_from_text_settings() and get_writer_args_from_text_settings() convert a text-based settings dictionary into the appropriate arguments dictionary with native Python objects as values. (These are methods of the Format object, because the parameter details are format-specific.)

The following shows an example using the RDKit toolkit’s “sdf” format to get reader_args from a dictionary of text settings:

>>> from chemfp import rdkit_toolkit
>>>
>>> sdf_format = rdkit_toolkit.get_format("sdf")
>>> sdf_format.get_default_reader_args()
{'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True}
>>>
>>> sdf_format.get_reader_args_from_text_settings({
...    "strictParsing": "true",
...    "removeHs": "False",
...    "sanitize": "0"})
{'sanitize': False, 'removeHs': False, 'strictParsing': True}

The boolean setting parser converts “true”, “True”, and “1” to Python’s True, and “false”, “False”, and “0” to Python’s False. Otherwise it raises a ValueError.

The following shows an equivalent example for RDKit’s SDF writer_args:

>>> sdf_format.get_default_writer_args()
{'includeStereo': False, 'kekulize': True, 'v3k': False}
>>> sdf_format.get_writer_args_from_text_settings({
...     "kekulize": "false", "v3k": "true",
...     "includeStereo": "True"})
{'includeStereo': True, 'kekulize': False, 'v3k': True}

WARNING: these functions will ignore unknown keys. This was done to allow the text settings dictionary to contain settings for other toolkits and formats. As a result, typos are harder to detect, because they will be ignored.

See argparse text settings to reader and writer args for an example of converting text settings from the command-line into reader and writer arguments.

Multi-toolkit reader_args and writer_args

In this section you’ll learn how to configure reader_args and writer_args so the same dictionary can be used to configure multiple toolkits and formats.

Sometimes you don’t know which toolkit will be used for parsing, but you do know that you want Open Babel, OEChem, and RDKit to act in non-standard ways. For example, the choice of toolkit may depend on the user-defined fingerprint type, or simply (as in the following example) depend on user input.

The reader_args and writer_args will ignore unknown parameters, which lets you combine arguments for different toolkits into a single dictionary. As the toolkits use completely different parameter names (except a couple, like “delimiter”, which are supposed to act the same for all toolkits), there’s no conflict in the names for a given format.

The following defines a reader_args dictionary and a writer_args dictionary with parameters for each supported toolkit, then enters a loop. The loop asks the user for a SMILES string, or the name of the toolkit to use, or “q” to quit the loop. It will parse each SMILES into a molecule, then generate a SMILES output, although with decidedly strange parameters:

import chemfp

from chemfp import rdkit_toolkit as T  # use your default toolkit of choice
#from chemfp import openeye_toolkit as T
#from chemfp import openbabel_toolkit as T
#from chemfp import cdk_toolkit as T

try:
  raw_input  # Python 2 name
except NameError:
  raw_input = input # Python 3

reader_args = {
   "sanitize": False,  # RDKit,
   "openeye.*.flavor": "Default,Strict",   # OEChem
   "aromaticity": "none",  # OEChem
}

writer_args = {
   "kekuleSmiles": True,  # RDKit
   "canonicalization": "anticanonical",  # Open Babel
   "aromaticity": "daylight",  # OEChem
   "cdk.*.flavor": "Default,UseAromaticSymbols",   # CDK
}

print("Using", T.name, "toolkit")
while 1:
  query = raw_input("SMILES, toolkit name, or 'q' to quit? ")
  if not query or query == "q":
    break

  if query in ("rdkit", "openeye" ,"openbabel", "cdk"):
    try:
        T = chemfp.get_toolkit(query)
    except ValueError:
        print("Toolkit %r not available" % (query,))
    print("Using", T.name, "toolkit")
    continue

  mol = T.parse_molecule(query, "smistring", reader_args=reader_args, errors="ignore")
  if mol is None:
    print("Toolkit", T.name, "could not parse query as SMILES")
    continue

  smiles = T.create_string(mol, "smistring", writer_args=writer_args, errors="ignore")
  if not smiles:
    print("Toolkit", T.name, "could not convert the molecule to SMILES")
    continue
  print(" -->", smiles)

I saved the above to a script and then ran it. It starts using RDKit, where I’ve set the reader’s “sanitize” to False so RDKit won’t perceive aromaticity on input, and set the writer’s “kekuleSmiles” to show explicit aromatic bond types:

Using rdkit toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
 --> OC1=CC=CC=C1
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
 --> OC1:C:C:C:C:C:1

I then switch to the OpenEye toolkit, show that it is operating with “strict” added to the default reader flavor, and convert a couple of SMILES to canonical SMILES to show the output uses the Daylight aromaticity model instead of the default:

SMILES, toolkit name, or 'q' to quit? openeye
SMILES, toolkit name, or 'q' to quit? C==C
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C==C
Warning:   ^

Toolkit openeye could not parse query as SMILES
Using openeye toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
 --> c1ccc(cc1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
 --> c1ccc(cc1)O

I then switch to the Open Babel toolkit and show that it generates “anti-canonical” SMILES, where the spanning tree priority order for SMILES output is randomly assigned:

SMILES, toolkit name, or 'q' to quit? openbabel
Using openbabel toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
 --> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
 --> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
 --> c1ccc(cc1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
 --> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
 --> c1c(O)cccc1
SMILES, toolkit name, or 'q' to quit? q

Finally I switch to CDK and show that it generate aromatic SMILES instead of Kekule:

SMILES, toolkit name, or 'q' to quit? cdk
Using cdk toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
 --> C1=CC=C(C=C1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
 --> c1ccc(cc1)O

See argparse text settings to reader and writer args for an example of using multi-toolkit reader_args and writer_args.

Qualified reader and writer parameters names

In this section you’ll learn how to use qualified parameter names. These give fine-grained control over the configuration options for each toolkit and format.

The previous section pointed out that the three toolkits use different parameter names, so for a given format you can combine the toolkit-specific reader_args into one unified dictionary and writer_args into another unified dictionary. However, within a toolkit the same parameter name can be reused for different formats, with different meanings.

This best example is for the chemfp.openeye_toolkit, where the reader_args and writer_args for all formats support the “flavor” and “aromaticity” parameters. The following shows examples where I might use a different flavor for the SMILES and InChI outputs, to get something other than the default representation:

>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("CC([O-])=O", "smistring")
>>>
>>> openeye_toolkit.create_string(mol, "smistring")
'CC(=O)[O-]'
>>> openeye_toolkit.create_string(mol, "smistring",
...     writer_args={"flavor": "Default|ImpHCount"})
'[CH3]C(=O)[O-]'
>>>
>>> openeye_toolkit.create_string(mol, "inchistring")
'InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1'
>>> openeye_toolkit.create_string(mol, "inchistring",
...     writer_args={"flavor": "Default|FixedHLayer"})
'InChI=1/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1/fC2H3O2/q-1'

Chemfp uses “qualified” parameter names to handle this situation. For example, the qualified name “smistring.flavor” is the flavor parameter for the smistring format:

>>> writer_args = {
...    "smistring.flavor": "Default|ImpHCount",
...    "inchistring.flavor": "Default|FixedHLayer",
... }
>>> mol = openeye_toolkit.parse_molecule("CC([O-])=O", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring", writer_args=writer_args)
'[CH3]C(=O)[O-]'
>>> openeye_toolkit.create_string(mol, "inchistring", writer_args=writer_args)
'InChI=1/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1/fC2H3O2/q-1'

WARNING: there are six SMILES-related formats (“smi”, “can”, “usm”, “smistring”, “canstring”, and “usmstring”) so to be complete you’ll need to specify values for all of them. There are also two InChI-related formats (“inchi” and “inchistring”).

A “fully qualified” name looks like “openeye.smistring.flavor”. The first term is the toolkit, the second the format name, and the last the parameter name. At present there little need for fully qualified names because most parameter names are either unique to a toolkit and format type, or (like ‘delimiter’) supposed to be identical across all toolkits. The major exception is ‘flavor’, used by all of the OpenEye formats as well as the RDKit “fasta”, “sequence”, and “pdb” formats.

The following demonstration, which is more a parlor trick than something useful, shows how to have each toolkit use a different SMILES delimiter:

>>> import chemfp
>>>
>>> reader_args = {
...   "rdkit.smi.delimiter": "tab",
...   "openbabel.smi.delimiter": "whitespace",
...   "openeye.smi.delimiter": "to-eol",
...   "cdk.smi.delimiter": "space",
... }
>>>
>>> for toolkit_name in ("rdkit", "openbabel", "openeye", "cdk"):
...   T = chemfp.get_toolkit(toolkit_name)
...   id, mol = T.parse_id_and_molecule("C\tabc def\tghi", "smi",
...                                     reader_args=reader_args)
...   print(toolkit_name, "sees the id", repr(id))
...
rdkit sees the id 'abc def'
openbabel sees the id 'abc'
openeye sees the id 'abc def\tghi'
cdk sees the id 'def\tghi'

(As a reminder, the ‘delimiter’ implementation is not perfect. A toolkit may accept the first whitespace after the SMILES term as a valid delimiter even if it doesn’t match the actual parameter, and a toolkit may decide to stop parsing the SMILES term at the first whitespace.)

The final type of qualified parameter looks like “openeye.*.aromaticity”, where the first term is the toolkit name, the second term is “*”, and the third term is the parameter name. This is most useful if you want OEChem to enforce the same aromaticity across all formats, or have the RDKit parsers ignore sanitization, with configuration entries like:

{"openeye.*.aromaticity": "daylight",
 "rdkit.*.sanitize": False}

However, as only OEChem supports “aromaticity” and only RDKit supports “sanitize”, you could also write this as simply:

{"aromaticity": "daylight",
 "sanitize": False}

Qualified parameter priorities

In this section you’ll learn the priority order when multiple terms try to specify the same parameter.

In the previous section you learned how “delimiter”, “smi.delimiter”, “rdkit.*.delimiter” and “rdkit.smi.delimiter” can all be used to set the delimiter style for RDKit’s “smi” format. If more then one term is specified, which one wins?

Chemfp checks for the parameters in the following order:

  1. rdkit.smi.delimiter
  2. rdkit.*.delimiter
  3. smi.delimiter
  4. delimiter

The parameter with the highest ranking determines the setting, as the following shows:

>>> from chemfp import rdkit_toolkit as T
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
...     reader_args={"delimiter": "to-eol",
...                  "smi.delimiter": "whitespace"})
>>> id
'methane'
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
...     reader_args={"rdkit.*.delimiter": "to-eol",
...                  "smi.delimiter": "whitespace"})
>>> id
'methane 16.04246'
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
...     reader_args={"rdkit.*.delimiter": "to-eol",
...                  "rdkit.smi.delimiter": "whitespace"})
>>> id
'methane'

One way to remember it is the longest name has priority.

It can be confusing to have a large dictionary with multiple format and toolkit qualifiers. The get_unqualified_reader_args() and get_unqualified_writer_args() methods of Format object will return the fully unqualified reader_args and writer_args for that format:

>>> fmt = T.get_format("smi")
>>> fmt.get_unqualified_reader_args({
...       "delimiter": "to-eol",
...       "smi.delimiter": "whitespace",
...     })
{'sanitize': True, 'has_header': False, 'delimiter': 'whitespace'}
>>> fmt.get_unqualified_writer_args({
...       "delimiter": "space",
...       "smi.delimiter": "tab",
...     })
{'isomericSmiles': True, 'kekuleSmiles': False, 'canonical': True,
'allBondsExplicit': False, 'allHsExplicit': False,
'cxsmiles': False, 'delimiter': 'tab'}

This can also be helpful if you think you made a typo; get the unqualified reader_args and see if the result has the arguments you think it should have.

Qualified names and text settings

In this section you’ll learn how the qualified names also apply to text settings.

Earlier you learned that text settings are string-based keys and values, which might come from the command-line, a configuration file, or some other text-based source. These need to be converted into Python values before they can be used as reader_args or writer_args.

A Format object can convert a dictionary of text settings into the correct argument dictionary. To get a Format object, ask the toolkit for the format of the given name:

>>> from chemfp import rdkit_toolkit as T
>>> fmt = T.get_format("sdf")
>>> fmt.get_default_reader_args()
{'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True}

The section Convert text settings into reader and writer arguments showed how to convert the text settings with unqualified names into a reader_args dictionary:

>>> fmt.get_reader_args_from_text_settings({
...     "strictParsing": "false",
...     "removeHs": "false",
...     })
{'removeHs': False, 'strictParsing': False}

The text settings dictionary also supports qualified parameter names, including handling the priority resolution described in Qualified parameter priorities:

>>> fmt.get_reader_args_from_text_settings({
...     "strictParsing": "false",
...     "sdf.strictParsing": "true",
...     "removeHs": "false",
...     "rdkit.*.removeHs": "true",
...     "rdkit.sdf.sanitize": "false",
...     })
{'sanitize': False, 'removeHs': True, 'strictParsing': True}

If you stare at it for a bit you’ll see that “sdf.strictParsing” has a higher priority than “strictParsing” and “rdkit.*.removeHs” is higher than “removeHs”, which is how it’s supposed to work.

Read molecules from an SD file or stdin

In this section you’ll learn how to read an SD file and iterate through its records as toolkit molecules. You will need Compound_099000001_099500000.sdf.gz from PubChem.

Time to get back to molecules! The chemfp.toolkit.read_molecules() function reads molecules from a structure file:

from chemfp import rdkit_toolkit as T  # use your toolkit of choice
for mol in T.read_molecules("Compound_099000001_099500000.sdf.gz"):
    print(T.create_string(mol, "smistring"))

By default it uses the filename extension to figure out the format and compression type. You can specify it yourself, if you wish, using the format option:

from chemfp import rdkit_toolkit as T  # use your toolkit of choice
for mol in T.read_molecules("Compound_099000001_099500000.sdf.gz",
                            format="sdf.gz"):
  print(T.create_string(mol, "smistring"))

Examples of valid format values are “smi”, “can”, and “usm” (but not the *string variants like “smistring”, because those aren’t record-based formats), and “sdf”, as well as gzip-compressed versions like “smi.gz” and “sdf.gz”.

(For Open Babel the “.gz” extension does nothing as Open Babel will auto-detect and handle gzip compressed input. Chemfp’s RDKit interface also support zstandard-compressed files with the extension “.zst” if the Python package “zstandard” is installed.)

If the first parameter (the source parameter) is the Python None value then the toolkit will read from stdin. As there’s no filename, chemfp can’t look at the extension to figure out the format, so it assumes the input is in “smi” format, that is, an uncompressed SMILES file.

Therefore, to read an SD file from stdin you must specify the format. The following program reads a gzip compressed SD file from stdin, convert it to SMILES, and find the 10 most common characters used in the SMILES strings:

# This file is named 'count_smiles_characters.py'
from collections import Counter
from chemfp import rdkit_toolkit as T # use your toolkit of choice

symbol_counts = Counter()
for mol in T.read_molecules(None, "sdf.gz"):
  smiles = T.create_string(mol, "smistring")
  symbol_counts.update(smiles)
for symbol, count in symbol_counts.most_common(10):
  print("%7d: %r" % (count, symbol))

Now to try it on a data set:

% python count_smiles_characters.py < Compound_099000001_099500000.sdf.gz
 114190: 'c'
  96119: 'C'
  50541: '('
  50541: ')'
  33054: '1'
  29000: 'O'
  22227: '='
  19716: '2'
  19276: '@'
  18420: 'N'

Read ids and molecules from an SD file at the same time

In this section you’ll learn how to read an SD file and iterate through its records as the two-element tuple of (id, molecule). You will need the Compound_099000001_099500000.sdf.gz from PubChem, which was used in the previous section.

In an earlier section, Parse the id and the molecule at the same time, you learned how to parse a structure record to get both the identifier and the molecule at the same time. The toolkit function chemfp.toolkit.read_ids_and_molecules() is the equivalent for reading from a structure file.

In the following example I’ll use the RDKit toolkit to create a tab-separated file with the id in the first column, the number of carbon atoms in the second, and the SMILES in the third. For brevity, I’ll display only the first 10 records, which also gives a nice example of when to use itertools.islice:

from itertools import islice
from chemfp import rdkit_toolkit
filename = "Compound_099000001_099500000.sdf.gz"

with rdkit_toolkit.read_ids_and_molecules(filename) as reader:
  for id, mol in islice(reader, 0, 10):
      num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
      smiles = rdkit_toolkit.create_string(mol, "smistring")
      print(id, num_carbons, smiles, sep="\t")

(See the next section for a description of how the line with the sum() works.)

Here’s the output, and a spot check shows the carbon counts are correct:

99000039      21      O=C(CC[C@H]1NC(=O)c2ccccc2NC1=O)Nc1cccc2ncccc12
99000230      21      COc1ccc(S(=O)(=O)N2CCC(C(=O)N[C@H](C)C(=O)NCc3ccco3)CC2)cc1
99002251      19      Cc1ccc(N/C=C(/C#N)C(=O)NC(=O)Cc2ccccc2)c(O)c1
99003537      23      CC(C)C[C@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1
99003538      23      CC(C)C[C@@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1
99005028      19      C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@@H]1CCCC[C@@H]1C
99005031      19      C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@H]1CCCC[C@@H]1C
99006292      20      Cc1ccc(C)c(S(=O)(=O)N2CCC[C@H](C(=O)NC3CCCCC3)C2)c1
99006293      20      Cc1ccc(C)c(S(=O)(=O)N2CCC[C@@H](C(=O)NC3CCCCC3)C2)c1
99006597      25      CS/C(N=CN(C)C)=C(\C#N)[P+](c1ccccc1)(c1ccccc1)c1ccccc1

What’s fun is that RDKit and OEChem both implement mol.GetAtoms() and atom.GetAtomicNum() so it’s trivial to port the above from RDKit to OEChem; replace rdkit_toolkit with openeye_toolkit!

The Open Babel port isn’t quite as easy because Open Babel has a different way to get the atoms in a molecule. To make it easy to copy and paste, here’s the equivalent code for Open Babel:

from itertools import islice
from chemfp import openbabel_toolkit
filename = "Compound_099000001_099500000.sdf.gz"

with openbabel_toolkit.read_ids_and_molecules(filename) as reader:
  for id, mol in islice(reader, 0, 10):
    num_carbons = sum(1 for atom_idx in range(mol.NumAtoms())
                            if mol.GetAtom(atom_idx+1).GetAtomicNum() == 6)
    smiles = openbabel_toolkit.create_string(mol, "smistring")
    print(id, num_carbons, smiles, sep="\t")

Finally, here’s the implementation for CDK. (This only works with Python 3 because the jpype Java/Python adapater doesn’t support Python 2):

from itertools import islice
from chemfp import cdk_toolkit
filename = "Compound_099000001_099500000.sdf.gz"

with cdk_toolkit.read_ids_and_molecules(filename) as reader:
  for id, mol in islice(reader, 0, 10):
    num_carbons = sum(a.getAtomicNumber() == 6 for a in mol.atoms())
    smiles = cdk_toolkit.create_string(mol, "smistring")
    print(id, num_carbons, smiles, sep="\t")

(If you run this you’ll notice that CDK’s SDF reader keeps the hydrogens as explicit atoms because the above generates SMILES strings like C1([H])=C([H])C2=C(C([H])=C1.....)

Read ids and molecules using an SD tag for the id

In this section you’ll learn how to use the id_tag to get the id from one of the SD tags, rather than from the record’s title. You will need the Compound_099000001_099500000.sdf.gz from PubChem, which was used in the previous section. I’ll also explain an idiom for how to count the number of records in an iterator.

Sometimes you would rather use a tag value as the id rather than the title line of the SDF record. This is critical for ChEBI data set and older ChEMBL data sets, which leave the title line (mostly) blank. In this case, use the id_tag to specify the tag to use.

The following example modifies the RDKit code from previous code to use PUBCHEM_IUPAC_SYSTEMATIC_NAME as the id, rather than the title line:

from itertools import islice
from chemfp import rdkit_toolkit
filename = "Compound_099000001_099500000.sdf.gz"
reader = rdkit_toolkit.read_ids_and_molecules(filename, id_tag="PUBCHEM_IUPAC_SYSTEMATIC_NAME")

for id, mol in islice(reader, 0, 10):
  num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
  smiles = rdkit_toolkit.create_string(mol, "smistring")
  print(id, num_carbons, smiles, sep="\t")

The output is:


3-[(3R)-2,5-bis(oxidanylidene)-3,4-dihydro-1H-1,4-benzodiazepin-3-yl]-N-quinolin-5-yl-propanamide 21 O=C(CC[C@H]1NC(=O)c2ccccc2NC1=O)Nc1cccc2ncccc12 N-[(2R)-1-(furan-2-ylmethylamino)-1-oxidanylidene-propan-2-yl]-1-(4-methoxyphenyl)sulfonyl-piperidine-4-carboxamide 21 COc1ccc(S(=O)(=O)N2CCC(C(=O)N[C@H](C)C(=O)NCc3ccco3)CC2)cc1 (Z)-2-cyano-3-[(4-methyl-2-oxidanyl-phenyl)amino]-N-(2-phenylethanoyl)prop-2-enamide 19 Cc1ccc(N/C=C(/C#N)C(=O)NC(=O)Cc2ccccc2)c(O)c1 N-[(1S)-1-(1H-benzimidazol-2-yl)-3-methyl-butyl]-2-(1-methylindol-3-yl)ethanamide 23 CC(C)C[C@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1 N-[(1R)-1-(1H-benzimidazol-2-yl)-3-methyl-butyl]-2-(1-methylindol-3-yl)ethanamide 23 CC(C)C[C@@H](NC(=O)Cc1cn(C)c2ccccc12)c1nc2ccccc2[nH]1 [(2S)-1-[[(1R,2S)-2-methylcyclohexyl]amino]-1-oxidanylidene-propan-2-yl] (E)-3-phenylprop-2-enoate 19 C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@@H]1CCCC[C@@H]1C [(2S)-1-[[(1S,2S)-2-methylcyclohexyl]amino]-1-oxidanylidene-propan-2-yl] (E)-3-phenylprop-2-enoate 19 C[C@H](OC(=O)/C=C/c1ccccc1)C(=O)N[C@H]1CCCC[C@@H]1C (3S)-N-cyclohexyl-1-(2,5-dimethylphenyl)sulfonyl-piperidine-3-carboxamide 20 Cc1ccc(C)c(S(=O)(=O)N2CCC[C@H](C(=O)NC3CCCCC3)C2)c1 (3R)-N-cyclohexyl-1-(2,5-dimethylphenyl)sulfonyl-piperidine-3-carboxamide 20 Cc1ccc(C)c(S(=O)(=O)N2CCC[C@@H](C(=O)NC3CCCCC3)C2)c1 [(E)-1-cyano-2-(dimethylaminomethylideneamino)-2-methylsulfanyl-ethenyl]-triphenyl-phosphanium 25 CS/C(N=CN(C)C)=C(C#N)[P+](c1ccccc1)(c1ccccc1)c1ccccc1

You might have found the “sum(1 for atom in ....)” a bit odd. I agree with you. It is, however, the standard way in Python to count the number of elements in the iterator which match a given condition. I’ll break it down so you can understand how it works.

A list comprehension iterates through each element in an iterator (in the following it iterates over the characters in a string) and returns a list:

>>> [c for c in "Hello"]
['H', 'e', 'l', 'l', 'o']

Add an “if” to it to operate on only a subset of the characters:

>>> [c for c in "Hello" if c != "l"]
['H', 'e', 'o']

I could use len() of this to get the number of non-“l” characters, but that would require making a list only to throw it away. There’s another route to the same answer. To get there, use the value 1 for each character rather than the character itself:

>>> [1 for c in "Hello" if c != "l"]
[1, 1, 1]

Then use sum() to sum the values, which in this case is also the number of elements in the list:

>>> sum([1 for c in "Hello" if c != "l"])
3

Unlike len(), sum() only needs an iterator, not a list. I can replace the list comprehension with a generator comprehension, to get:

>>> sum(1 for c in "Hello" if c != "l")
3

Going back to the RDKit/OEChem expression:

num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)

I hope you can see how this counts the number of atoms in the molecule whose atomic number is 6. Or, if you want another way to think of it, the expression is the same as:

num_carbons = 0
for atom in mol.GetAtoms():
    if atom.GetAtomicNum() == 6:
        num_carbons += 1

Read from a string instead of a file

In this section you’ll learn how to read molecules from a string containing multiple SMILES records.

In the section Read molecules from an SD file or stdin you learned how to read molecules from a structure file or stdin. Sometimes the input structures come from a string. For example, if a web page has a form with a text box, where users can paste in a set of SMILES or SDF records and submit the form, then the web application on the server will likely receive those records as a single string.

When the records are in a string instead of a file, use chemfp.toolkit.read_molecules_from_string(). It’s very similar to chemfp.toolkit.read_molecules(), except that the first parameter, content, is the string instead of the source filename, and the second parameter, format, is required. (chemfp doesn’t try to auto-detect the format from the content.)

The following reads the records from a string containing two simple SMILES records and prints the number of non-implicit atoms for each one. I’ve included implementations for all three toolkits; use the one(s) that are available to you:

content = ("C methane 16.04246\n"
           "O=O water 31.9988\n")

from chemfp import rdkit_toolkit
for mol in rdkit_toolkit.read_molecules_from_string(content, "smi"):
  print("RDKit:", mol.GetNumAtoms())

from chemfp import openeye_toolkit
for mol in openeye_toolkit.read_molecules_from_string(content, "smi"):
  print("OEChem:", mol.NumAtoms())

from chemfp import openbabel_toolkit
for mol in openbabel_toolkit.read_molecules_from_string(content, "smi"):
  print("Open Babel:", mol.NumAtoms())

from chemfp import cdk_toolkit
for mol in cdk_toolkit.read_molecules_from_string(content, "smi"):
  print("CDK:", mol.getAtomCount())

When I run the above (on a computer where all four supported toolkits are installed), the above reports:

RDKit: 1
RDKit: 2
OEChem: 1
OEChem: 2
Open Babel: 1
Open Babel: 2
CDK: 1
CDK: 2

I would like to improve the output a bit to also include the record id in the output. The toolkit function chemfp.toolkit.read_ids_and_molecules_from_string() is similar to chemfp.toolkit.read_molecules_from_string() except that it iterates through the (id, toolkit molecule) tuple rather than just the molecule:

>>> from chemfp import rdkit_toolkit
>>> content = ("C methane 16.04246\n"
...            "O=O water 31.9988\n")
>>> for (id, mol) in rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi"):
...     print("RDKit:", repr(id), mol.GetNumAtoms())
...
RDKit: 'methane 16.04246' 1
RDKit: 'water 31.9988' 2

You can see that the default SMILES reader assumes the rest of the line is the id. The file and string record readers take a reader_args parameter just like chemfp.toolkit.parse_id_and_molecule(). I’ll specify the “whitespace” delimiter so the parser uses only the second word as the id:

>>> for (id, mol) in rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi",
...                     reader_args={"delimiter": "whitespace"}):
...     print("RDKit:", repr(id), mol.GetNumAtoms())
...
RDKit: 'methane' 1
RDKit: 'water' 2

See Specify a SMILES delimiter through reader_args for more details about setting the “delimiter” reader_args.

The string readers, like the file readers, also support the id_tag option to get the id from an SD tag instead of the title line. See Read ids and molecules using an SD tag for the id for more details about using the id_tag.

The reader may reuse molecule objects!

In this section you’ll learn that the OEChem and Open Babel toolkits reuse the same molecule object, which means you can’t save a molecule for later.

Suppose you want to read all of the molecules from a file into a list. It’s very tempting to write it as:

>>> import chemfp
>>> from chemfp import openeye_toolkit as T
>>> mols = list(T.read_molecules_from_string("C methane\nO water\n", "smi"))

This does not work for the openeye_toolkit or the openbabel_toolkit:

>>> mols
[<openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10326ba40> >,
 <openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10326ba40> >]
>>> T.create_string(mols[0], "smistring")
''
>>> [T.create_string(mol, "smistring") for mol in mols]
['', '']

This is because the underlying reader for those two toolkits reuse the same molecule object. You can see that in the above, which returns the same OEGraphMol object (with id 0x10326ba40) for each record. The reason why OpenEye decided to reuse the object is to get better performance. Clearing the molecule object is faster than deleting it and reallocating a new one.

In addition, the OEChem reader code does a “clear molecule” followed by “read next record or stop”. At the end of the file there is no record, so the reader ends with a clear molecule. That explains why the OEGraphMol produces an empty SMILES string for the last couple of lines in the above code.

The only portable way to load a list of molecules is to use chemfp.toolkit.copy_molecule(), as in:

>>> from chemfp import openeye_toolkit as T
>>> mols = [T.copy_molecule(mol) for mol in T.read_molecules_from_string("C methane\nO water\n", "smi")]
>>> mols
[<openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10328a810> >,
 <openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x100c78320> >]
>>> T.create_string(mols[0], "smistring")
'C'
>>> T.create_string(mols[1], "smistring")
'O'

I don’t really like this solution because the RDKit reader doesn’t need a copy, so the extra copy is pure overhead.

Future versions of chemfp will likely have a reader_arg to specify if it’s okay to reuse a molecule object or if a new one must be used each time.

Write molecules to a SMILES file

In this section you will learn how to write toolkit molecules into a structure file. You will need Compound_099000001_099500000.sdf.gz from PubChem.

Chemfp can write toolkit molecules to a file in a given format. I’ll start by making an RDKit molecule, though the same API works with Open Babel and OEChem:

>>> from chemfp import rdkit_toolkit as T  # use your toolkit of choice
>>> mol = T.parse_molecule("c1ccccc1O phenol", "smi")

Use chemfp.toolkit.open_molecule_writer() to create a writer object. By default it will look at the output filename extension to figure out the format and compression type, and if that doesn’t work it defaults to SMILES output:

>>> writer = T.open_molecule_writer("example.smi")

The fingerprint writer has several methods to write a molecule to the file. If you write a molecule by itself it will use the molecule’s own id (in this case, “phenol”):

>>> writer.write_molecule(mol)

Or, use write_id_and_molecule() if you want to specify an alternate id:

>>> writer.write_id_and_molecule("something else", mol)

WARNING: The toolkit implementation may temporarily change the toolkit molecule’s own identifier in order to get the correct output. You should not alter the molecule’s id in another thread while calling this function.

Let’s see if it worked, by closing the writer (otherwise some of the output may be in an internal buffer) and reading the file:

>>> writer.close()
>>> print(open("example.smi").read())
Oc1ccccc1 phenol
Oc1ccccc1 something else

The write_molecules() method is optimized for passing in a list or iterator of molecule objects, and write_ids_and_molecules() is the equivalent if you have (id, molecule) pairs. For example, the following converts an SD file into a compressed SMILES file:

from chemfp import rdkit_toolkit as T # use your toolkit of choice
reader = T.read_molecules("Compound_099000001_099500000.sdf.gz")
writer = T.open_molecule_writer("example.smi.gz")
writer.write_molecules(reader)

# These are optional, but recommended. Even better would be
# to use the context manager described in the next section.
writer.close()
reader.close()

If you have a list (or iterator) of molecules, then use the write_molecules() method.

The open function also supports the format parameter, so you can specify “smi” or “sdf.gz” some other combination of structure format and compression type:

writer = T.open_molecule("wrong_extension.smi", format="sdf.gz")

If the zstandard package is available then use the .zst suffiz for ZStandard compression.

Reader and writer context managers

In this section you’ll learn how to use chemfp’s readers and writers to close the file, rather than depend on Python’s garbage collector or manual “close()”. You will need Compound_099000001_099500000.sdf.gz from PubChem.

In the previous section, Write molecules to a SMILES file, you learned how to convert an SD file into a SMILES file. At the end was a small program with optional “close()” statements. These are optional because Python’s garbage collector and chemfp work together. When a chemfp reader or writer is no longer needed, the garbage collector asks chemfp to clean up, and chemfp closes the native toolkit’s file object.

This is fine for a simple script or function, but sometimes you want more control over when the file is closed. You can call the writer’s close() method yourself, but it’s really easy to forget to do that.

Python supports “context managers”, which carry out certain actions when a block of code finishes. See PEP 343 if you want the full details. For chemfp you only need to know that the reader and writer context managers will always close the file at the end of the block.

A normal Python file context manager works like this:

>>> with open("example.txt", "w") as outfile:
...     outfile.write("I am here.\n")
...
>>> print(repr(open("example.txt").read()))
'I am here.\n'

If instead I use one file object to write the data and another to read the file, without a flush() or close() by the writer, then there’s a syncronization problem:

>>> outfile = open("example.txt", "w")
>>> outfile.write("I am here.\n")
>>> print(repr(open("example.txt").read()))
''

Why does this print the empty string? The output text is still in an internal buffer, which isn’t written to the disk until the close call:

>>> outfile.close()
>>> print(repr(open("example.txt").read())
'I am here.\n'

The same problem occurs with molecule output:

>>> from chemfp import rdkit_toolkit as T # can also use openbabel_toolkit
>>> mol = T.parse_molecule("C=O carbon monoxide", "smi")
>>> writer = T.open_molecule_writer("example.smi")
>>> writer.write_molecule(mol)
>>> open("example.smi").read()
''
>>> writer.close()
>>> open("example.smi").read()
'C=O carbon monoxide\n'

Note: this problem does not occur with the openeye_toolkit. Most likely that toolkit always flushes its output buffers after each molecule.

The chemfp readers and writers support a context manager, so you can use the same solution you would for regular files:

>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("C=O carbon monoxide", "smi")
>>> with T.open_molecule_writer("example.smi") as writer:
...     writer.write_molecule(mol)
...
>>> open("example.smi").read()
'C=O carbon monoxide\n'

With the context manager concept firmly in mind, the following is the way I prefer to write the conversion script from the previous section:

from chemfp import rdkit_toolkit as T # use your toolkit of choice

with T.read_molecules("Compound_099000001_099500000.sdf.gz") as reader:
  with T.open_molecule_writer("example.smi.gz") as writer:
    writer.write_molecules(reader)

That said, if you really want to depend on the garbage collector, you can also write it with one (or two) fewer lines:

from chemfp import rdkit_toolkit as T # use your toolkit of choice
T.open_molecule_writer("example.smi.gz").write_molecules(
    T.read_molecules("Compound_099000001_099500000.sdf.gz"))

Write molecules to stdout in a specified format

In this section you’ll learn how to specify the structure writer’s output format, and to write to stdout instead of to a file.

The function chemfp.toolkit.open_molecule_writer() supports a format parameter, in case you don’t want chemfp to determine the output format and compression based on the filename extension.

For example, if the destination is None (instead of a filename) then chemfp will write the output to stdout. Since Python’s None object doesn’t have an extension, it will write the molecules as uncompressed SMILES. If you want to write to stdout in SDF format you will have to specify the output format, like the following:

>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("O=O molecular oxygen", "smi")
>>> with T.open_molecule_writer(None, "sdf") as writer:
...     writer.write_molecule(mol)
...
molecular oxygen
     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0
M  END
$$$$
>>> with T.open_molecule_writer(None, "inchikey") as writer:
...   writer.write_molecule(mol)
...
MYMOFIZGZYHOMD-UHFFFAOYSA-N molecular oxygen

Write molecules to a string (and a bit of InChI)

In this section you’ll learn how to write toolkit molecules into memory, and when finished to get the result as a string.

The previous sections showed examples of writing molecules to a file or to stdout. Sometimes you want to save the records as a string; perhaps to send a response for a web request or display the contents in a text pane of a GUI. The function chemfp.toolkit.open_molecule_writer_to_string() creates a MoleculeStringWriter which stores the output records into memory. Once the writer is closed, the memory contents can be retrieved as a string with MoleculeStringWriter.getvalue().

For a bit of variation, the following example uses the “inchi” output format, and the openbabel_toolkit:

>>> from chemfp import openbabel_toolkit as T # use your toolkit of choice
>>> alanine = T.parse_molecule("O=C(O)[C@@H](N)C alanine", "smi")
>>> glycine = T.parse_molecule("C(C(=O)O)N glycine", "smi")
>>> writer = T.open_molecule_writer_to_string("inchi")
>>> writer.write_molecules([alanine, glycine])
>>> writer.close()
>>> print(writer.getvalue())
InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 alanine
InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5) glycine

You should know that there’s no well-defined “inchi” file format, only an InChI string. I decided to follow Open Babel’s lead and say that the “inchi” format has one record per line, where each line contains the InChI string followed by a delimiter, followed by the id (if available) on the rest of the line.

The InChI output writer_args supports an “include_id” parameter. The default, True, includes the id, while the following example sets it to False to have only the InChI string on the line:

>>> with T.open_molecule_writer_to_string("inchi",
...             writer_args={"include_id": False}) as writer:
...   writer.write_molecule(alanine)
...   writer.write_molecule(glycine)
...
>>> print(writer.getvalue())
InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)

I also used the context manager so the code would be a bit shorter and, I think, clearer. It’s up to you to decide if write_molecules() with a 2-element list is clear than two write_molecule() lines.

Handling errors when reading molecules from a string

In this section you’ll learn how to ignore errors and improve error reporting when reading from a string, rather then accept the default of raising an exception and stopping. The examples will use a string containing SMILES records, but the same principles apply to any format.

If you’ve used the chemfp readers on real-world data sets you might have noticed that the RDKit and Open Babel ones sometimes raise an exception, saying that a given record could not be parsed. I’ll demonstrate with a string containing four SMILES records:

>>> content = ("C methane\n" +
...            "CN(C)(C)(C)C pentavalent nitrogen\n" +
...            "Q Q-ane\n" +
...            "[U] uranium\n")
>>>

RDKit doesn’t like the pentavalent nitrogen, and chemfp’s rdkit_toolkit stops processing at that record:

>>> from chemfp import rdkit_toolkit
>>> with rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
[16:11:12] Explicit valence for atom # 1 N, 5, is greater than permitted
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "chemfp/_rdkit_toolkit.py", line 342, in _iter_read_smiles_ids_and_molecules
    error_handler.error("RDKit cannot parse the SMILES %s" % (_compat.myrepr(smiles),), location)
  File "chemfp/io.py", line 112, in error
    _compat.raise_tb(ParseError(msg, location), None)
  File "<string>", line 1, in raise_tb
chemfp.ParseError: RDKit cannot parse the SMILES 'CN(C)(C)(C)C',
file '<string>', line 2, record #2: first line is 'CN(C)(C)(C)C pentavalent nitrogen'

Open Babel doesn’t care about the too-high valence on the nitrogen, but doesn’t like the non-SMILES in the third record:

>>> from chemfp import openbabel_toolkit
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
pentavalent nitrogen
==============================
*** Open Babel Error  in ParseSimple
  SMILES string contains a character 'Q' which is invalid
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "chemfp/_openbabel_toolkit.py", line 927, in _iter_column_records
    error_handler.error("Open Babel cannot parse the %s %s"
  File "chemfp/io.py", line 112, in error
    _compat.raise_tb(ParseError(msg, location), None)
  File "<string>", line 1, in raise_tb
chemfp.ParseError: Open Babel cannot parse the SMILES 'Q',
file '<string>', line 3, record #3: first line is 'Q Q-ane'

Neither does CDK:

>>> from chemfp import cdk_toolkit
>>> with cdk_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
...   for id, mol in reader:
...     print(id)
...
methane
pentavalent nitrogen
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Error while reading the SMILES from: Q Q-ane, org.openscience.cdk.exception.InvalidSmilesException: could not parse 'Q Q-ane', unexpected character:
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Q Q-ane
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: ^
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/dalke/cvses/cfp-3x/docs/chemfp/_cdk_toolkit.py", line 498, in _iter_read_smiles_ids_and_molecules_cdk
    error_handler.error("CDK cannot parse a SMILES record", location)
  File "/Users/dalke/cvses/cfp-3x/docs/chemfp/io.py", line 112, in error
    _compat.raise_tb(ParseError(msg, location), None)
  File "<string>", line 1, in raise_tb
chemfp.ParseError: CDK cannot parse a SMILES record, file '<string>', line 3, record #3: first line is 'Q Q-ane'

To round things out, OEChem accepts pentavalent nitrogen and skips the bad SMILES at a lower level than what chemfp uses, so there’s no exception:

>>> from chemfp import openeye_toolkit
>>> with openeye_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
pentavalent nitrogen
Warning: Problem parsing SMILES:
Warning: Q Q-ane
Warning: ^

Warning: Error reading molecule "" in Canonical stereo SMILES format.
uranium

I’ll emphasize that point. The openeye_toolkit uses OEChem’s high-level reader, which provides no information about if OEChem skipped a record with a failure. Chemfp therefore cannot provide more information about the failures, whether as an exception or an improved error message.

I’m certain that nearly everyone wants the reader to ignore the few records that can’t be parsed by the underlying toolkit. The readers and writers support the errors option. The default value of “strict” tells chemfp to raise an exception when it detects a parse failure, and “ignore” tells it to ignore the error and go on to the next record:

>>> with rdkit_toolkit.read_ids_and_molecules_from_string(
...               content, "smi", errors="ignore") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
[16:13:45] Explicit valence for atom # 1 N, 5, is greater than permitted
[16:13:45] SMILES Parse Error: syntax error for input: 'Q'
uranium
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(
...               content, "smi", errors="ignore") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
pentavalent nitrogen
uranium
>>> with cdk_toolkit.read_ids_and_molecules_from_string
...               content, "smi", errors="ignore") as reader:
...   for id, mol in reader:
...     print(id)
...
methane
pentavalent nitrogen
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Error
while reading the SMILES from: Q Q-ane, org.openscience.cdk.exception.InvalidSmilesException:
could not parse 'Q Q-ane', unexpected character:
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Q Q-ane
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: ^
uranium

The “strict” default comes from my long-held belief that it’s better to be strict first, and detect problems early, than to let them intrude. My resolve is weakening, because it’s been rare to find that I can make use of that information. The biggest counter-example is when I specify one format but the file is actually in another format, in which case the reader skips a lot of garbage. For example, a SMILES reader, pointed to a SD file or a compressed SMILES file, will try hard to make sense of the data and end up ignoring almost everything. I haven’t decided if I will change the default policy.

I’ve also found that the toolkits aren’t that helpful at identifying which record failed. Take a look at the RDKit warning:

[16:13:45] Explicit valence for atom # 1 N, 5, is greater than permitted

It says that I did this in the late afternoon, and the reason for the failure, but says very little about the record with the problem.

To help improve this, and to send still more garbage, err, I mean helpful messages to stderr, chemfp supports a “report” errors value. It’s the same as “ignore” except that it also displays more details about the failure location:

>>> with rdkit_toolkit.read_ids_and_molecules_from_string(
...               content, "smi", errors="report") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
[16:14:52] Explicit valence for atom # 1 N, 5, is greater than permitted
ERROR: RDKit cannot parse the SMILES 'CN(C)(C)(C)C', file '<string>', line 2, record #2: first line is 'CN(C)(C)(C)C pentavalent nitrogen'. Skipping.
[16:14:52] SMILES Parse Error: syntax error for input: 'Q'
ERROR: RDKit cannot parse the SMILES 'Q', file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(
...               content, "smi", errors="report") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
pentavalent nitrogen
ERROR: Open Babel cannot parse the SMILES 'Q', file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium
>>> with cdk_toolkit.read_ids_and_molecules_from_string(
...               content, "smi", errors="report") as reader:
...     for id, mol in reader:
...         print(id)
...
methane
pentavalent nitrogen
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR:
Error while reading the SMILES from: Q Q-ane, org.openscience.cdk.exception.InvalidSmilesException:
could not parse 'Q Q-ane', unexpected character:
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: Q Q-ane
org.openscience.cdk.io.iterator.IteratingSMILESReader ERROR: ^
ERROR: CDK cannot parse a SMILES record, file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium

The quality of the error message depends on the toolkit and the format. The best messages are for the Open Babel and RDKit SMILES readers and InChI readers, because I decided to have chemfp identify the records for those formats itself, instead of using the underlying toolkits to read the file. Chemfp still uses the underlying toolkit to convert the individual record into a native toolkit molecule.

I did this because I found the the SMILES and InChI reader performance was the same, and by writing my own parsers I had the ability to report line numbers and improve the error messages.

The examples so far used the read_ids_and_molecules_from_string function. The read_molecules_from_string function also supports the errors option, with the same meaning.

>>> sizes = []
>>> with openbabel_toolkit.read_molecules_from_string(
...     content, "smi", errors="report") as reader:
...   for mol in reader:
...     sizes.append(mol.NumAtoms())
...
ERROR: Open Babel cannot parse the SMILES 'Q', file '<string>', line 3, record #3: first line is 'Q Q-ane'. Skipping.
>>> sizes
[1, 6, 1]

Handling errors when reading molecules from a file

In this section you’ll learn how to how to ignore errors and improve error reporting when reading from SD file, rather then accept the default of raising an exception and stopping. The examples will use an SD file, but the same principles apply to any format.

In the previous section you learned that when the readers encounter a error, the default behavior is to raise a Python exception and how to use the error parameter to ignore those errors or to provide a more detailed error report.

The file-based readers, chemfp.toolkit.read_molecules() and chemfp.toolkit.read_ids_and_molecules(), can be configured the same way, that is:

# When there is an error, raise an exception and stop (this is the default)
T.read_molecules(filename)
T.read_molecules(filename, errors="strict")
T.read_ids_and_molecules(filename)
T.read_ids_and_molecules(filename, errors="strict")

# When there is an error, go on to the next record
T.read_molecules(filename, errors="ignore")
T.read_ids_and_molecules(filename, errors="ignore")

# When there is an error, print an error message to stderr then
# go on to the next record
T.read_molecules(filename, errors="report")
T.read_ids_and_molecules(filename, errors="report")

To show it in action, I’ll construct an SD file with three records. The first will contain a trivalent oxygen, the second a corrupt record, and the third will be atomic nitrogen. I’ll use OEChem to help me make the file.

from chemfp import openeye_toolkit as T
mol1 = T.parse_molecule("O#C trivalent", "smi") # RDKit won't like this
mol2 = T.parse_molecule("[U] Q-record", "smi")  # I'll corrupt this record
mol3 = T.parse_molecule("[N] nitrogen", "smi")  # This one is fine
with T.open_molecule_writer_to_string("sdf") as writer:
  writer.write_molecules([mol1, mol2, mol3])
content = writer.getvalue()
# replace the "U" with the nonsense "Qq"
content = content.replace("U ", "Qq")
# Save
open("bad_data.sdf", "w").write(content)

Here’s what the output file bad_data.sdf it looks like, so you can copy&paste if you wish:

trivalent
  -OEChem-04251716112D

  2  1  0     0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  3  0  0  0  0
M  END
$$$$
Q-record
  -OEChem-04251716112D

  1  0  0     0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 Qq  0  0  0  0  0  0  0  0  0  0  0  0
M  END
$$$$
nitrogen
  -OEChem-04251716112D

  1  0  0     0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 N   0  0  0  0  0 15  0  0  0  0  0  0
M  END
$$$$

I’ll try to read that file using the native RDKit reader, which skips records it can’t parse:

>>> from rdkit import Chem
>>> reader = Chem.ForwardSDMolSupplier("bad_data.sdf")
>>> ids = [mol.GetProp("_Name") for mol in reader if mol is not None]
[16:40:57] Explicit valence for atom # 0 O, 3, is greater than permitted
[16:40:57] ERROR: Could not sanitize molecule ending on line 8
[16:40:57] ERROR: Explicit valence for atom # 0 O, 3, is greater than permitted
[16:40:57]

****
Post-condition Violation
Element 'Qq' not found
Violation occurred on line 91 in file /Users/dalke/ftps/rdkit-Release_2020_03_1/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****

[16:40:57] ERROR: Element 'Qq' not found
[16:40:57] ERROR: moving to the beginning of the next molecule
>>> ids
['nitrogen']

As expected, RDKit could only extract one record of the three. It helpfully points out the line number of the records it couldn’t parse (lines 8 and 14)

Now I’ll do the same using chemfp’s rdkit_toolkit interface and the default error handler, which is strict:

>>> from chemfp import rdkit_toolkit
>>> ids = []
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules("bad_data.sdf"):
...   ids.append(id)
...
[16:41:47] Explicit valence for atom # 0 O, 3, is greater than permitted
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dalke/cvses/cfp-3x/docs/chemfp/_rdkit_toolkit.py", line 1340, in _iter_read_sdf_structures
    error_handler.error("Could not parse molecule block", location)
  File "/Users/dalke/cvses/cfp-3x/docs/chemfp/io.py", line 112, in error
    _compat.raise_tb(ParseError(msg, location), None)
  File "<string>", line 1, in raise_tb
chemfp.ParseError: Could not parse molecule block, file 'bad_data.sdf', line 1, record #1: first line is 'trivalent'

It stops at the first error and raise an exception. The exception contains some information about the error location, including the filename, line number, record number, and the contents of the first line of the file.

How does chemfp get that information? Under the covers chemfp uses its own parser, from the text_toolkit to read each record, then passes that record to RDKit to turn the record into a molecule. This gives chemfp a bit more control over error reporting. Originally this was also faster than using RDKit’s own ForwardSDMolSupplier, but now chemfp is about 10% slower. A future implementation may offer a run-time choice of which implementation to use, in case you want better performance at the expense of less detailed error information.

Pass in either “ignore” or “report” as the errors option if you want chemfp to skip records with an error keep on processing. I’ll use “report” to show what the error reporting looks like:

>>> from chemfp import rdkit_toolkit
>>> ids = []
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules(
...     "bad_data.sdf", errors="report"):
...   ids.append(id)
...
[16:42:23] Explicit valence for atom # 0 O, 3, is greater than permitted
ERROR: Could not parse molecule block, file 'bad_data.sdf', line 1, record #1: first line is 'trivalent'. Skipping.
[16:42:23]

****
Post-condition Violation
Element 'Qq' not found
Violation occurred on line 91 in file /Users/dalke/ftps/rdkit-Release_2020_03_1/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****

[16:42:23] Element 'Qq' not found
ERROR: Could not parse molecule block, file 'bad_data.sdf', line 10,
record #2: first line is 'Q-record'. Skipping.

RDKit’s own error messages from ForwardSDMolSupplier, like “Unexpected error hit on line 14” / “moving to the begining of the next molecule”, have disappeared, because chemfp handles record extraction. The sanitization error message about explicit valence remains because RDKit still does that work.

Note also that under Python 2.7 chemfp returns a Unicode string for the id, rather than the byte string that the native RDKit API returns.

That was RDKit. What about Open Babel?

>>> from chemfp import openbabel_toolkit
>>> with openbabel_toolkit.read_ids_and_molecules(
...     "bad_data.sdf", "sdf", errors="strict") as reader:
...   for id, mol in reader:
...     print("Read", repr(id), "first atom:", mol.GetAtom(1).GetAtomicNum())
...
Read 'trivalent' first atom: 8
Read 'Q-record' first atom: 0
Read 'nitrogen' first atom: 7

Open Babel reads all three records even in strict mode. Interestingly, Open Babel turns the ‘Qq’ atom into a “*” atom, with atomic number 0. To double check, I’ll read the list of molecules, then write them all out as SMILES:

>>> with openbabel_toolkit.read_molecules("bad_data.sdf") as reader:
...   mols = [openbabel_toolkit.copy_molecule(mol) for mol in reader]
...
>>> len(mols)
3
>>> with openbabel_toolkit.open_molecule_writer(None, "smi") as writer:
...   writer.write_molecules(mols)
...
C#[O] trivalent
* Q-record
[N] nitrogen

OEChem also parses that “Qq” record as an atom with atomic number of 0, and it also doesn’t give me a warning message:

>>> from chemfp import openeye_toolkit
>>> with openeye_toolkit.read_ids_and_molecules(
...     "bad_data.sdf", errors="strict") as reader:
...   for id, mol in reader:
...     print("Read", repr(id), [a.GetAtomicNum() for a in mol.GetAtoms()])
...
Read 'trivalent' [8, 6]
Read 'Q-record' [0]
Read 'nitrogen' [7]

I totally didn’t expect the toolkits to parse an unknown atom type like “Qq”!

In any case, OEChem will skip records which it could not parse, and there’s no easy way for chemfp to get that information, so in practice the “strict” and “report” options are meaningless.

Finally, what about CDK?

>>> from chemfp import cdk_toolkit
>>> with cdk_toolkit.read_ids_and_molecules(
...   "bad_data.sdf", "sdf", errors="strict") as reader:
...   for id, mol in reader:
...     print("Read", repr(id), "first atom:", list(mol.atoms())[0].getAtomicNumber())
...
Read 'trivalent' first atom: 8
Read 'Q-record' first atom: 0
Read 'nitrogen' first atom: 7

It accepts the “Qq” atom type! That’s because CDK default to a RELAXED parser mode. I’ll change that reader_args to STRICT:

>>> with cdk_toolkit.read_ids_and_molecules(
...   "bad_data.sdf", "sdf", errors="strict", reader_args={"mode": "STRICT"}) as reader:
...   for id, mol in reader:
...     print("Read", repr(id), "first atom:", list(mol.atoms())[0].getAtomicNumber())
...
Read 'trivalent' first atom: 8
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing
line 5:     0.0000    0.0000    0.0000 Qq  0  0  0  0  0  0  0  0  0  0  0  0 -> invalid symbol: Qq
org.openscience.cdk.io.iterator.IteratingSDFReader ERROR: Error while reading next molecule: invalid symbol: Qq

Ignore errors in create_string() and create_bytes()

In this section you’ll learn how to ignore errors when converting a molecule into a string or byte record.

Some molecules cannot be represented in some formats. The easiest example is the molecule from the SMILES “*”, which contains a single atom with the atomic number 0 and cannot be represented in InChI:

>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> T.create_string(mol, "smistring")
'[*]'
>>> T.create_string(mol, "inchistring")
[16:47:59] ERROR: Unknown element '*'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/rdkit_toolkit.py", line 418, in create_string
    return _toolkit.create_string(mol, format, id, writer_args, errors)
  File "chemfp/base_toolkit.py", line 1389, in create_string
    return self._create_string_impl(format_config, mol, id, writer_args, error_handler)
  File "chemfp/base_toolkit.py", line 1392, in _create_string_impl
    return format_config.create_string(mol, id, writer_args, error_handler)
  File "chemfp/_rdkit_toolkit.py", line 1709, in create_string
    error_handler.error("RDKit cannot create the InChI string")
  File "chemfp/io.py", line 112, in error
    _compat.raise_tb(ParseError(msg, location), None)
  File "<string>", line 1, in raise_tb
chemfp.ParseError: RDKit cannot create the InChI string

By default the chemfp.toolkit.create_string() and chemfp.toolkit.create_bytes() functions will raise an exception if the molecule cannot be converted into the given record format. Use the errors parameter to specify that behavior. Just like with file reading, the default value is “strict”, “ignore” will return None if there was an error, and “report” will return None and also print some information about the failure to stderr.

The following uses “ignore”:

>>> import chemfp
>>> for toolkit in ("openbabel", "rdkit", "openeye", "cdk"):
...   T = chemfp.get_toolkit(toolkit)
...   mol = T.parse_molecule("*", "smistring")
...   result = T.create_string(mol, "inchistring", errors="ignore")
...   print(toolkit, "returned", repr(result))
...
==============================
*** Open Babel Warning  in InChI code
  #0 :Unknown element(s): *
==============================
*** Open Babel Error  in InChI code
  InChI generation failed
openbabel returned None
[16:49:04] ERROR: Unknown element '*'
rdkit returned None
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
openeye returned None
cdk returned None

The following uses “report”. You can see the only addition is the new line ‘ERROR: Open Babel cannot create the InChI string. Skipping.’ For a bit of variation, I also changed things to use create_bytes instead of create_string:

>>> from chemfp import openbabel_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> result = T.create_bytes(mol, "inchistring", errors="report")
==============================
*** Open Babel Warning  in InChI code
  #0 :Unknown element(s): *
==============================
*** Open Babel Error  in InChI code
  InChI generation failed
ERROR: Open Babel cannot create the InChI string. Skipping.
>>> result is None
True

I’ll do the same using the CDK toolkit:

>>> from chemfp import cdk_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> result = T.create_bytes(mol, "inchistring", errors="report")
ERROR: CDK cannot create the InChI string. Skipping.

Ignore errors when writing molecules

In this section you’ll learn how to ignore errors and improve error reporting when writing a file, rather than accept the default of raising an exception and stopping. You will need a copy of ChEBI_lite.sdf.gz.

It’s not unusal for there to be a few input records which cannot be parsed into a molecule. It’s much less common to come across a molecule which cannot be turned into a record. The SMILES and SD file formats are able to handle a wide range of chemistry. Even R-groups, which can’t directly be expressed as SMILES, can be represented in one of several conventions, like [*:1] for R1.

There are no such conventions for InChI. As you saw in the previous section, it’s easy to make a molecule to InChI converter fail if the structure contains a “*” atom.

The functions chemp.toolkit.open_molecule_writer(), chemp.toolkit.open_molecule_writer_to_string(), and chemp.toolkit.open_molecule_writer_to_bytes() return a molecule writer. This can be used to write a single molecule at a time, or to write molecule multiples from an iterator.

What happens if I try to convert the ChEBI file into an InChI file using OEChem?

>>> from chemfp import openeye_toolkit as T
>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> writer = T.open_molecule_writer("chebi.inchi")
>>> writer.write_molecules(reader)
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/base_toolkit.py", line 283, in write_molecules
    _compat.raise_tb(err[0], err[1])
  File "<string>", line 1, in raise_tb
  File "chemfp/_openeye_toolkit.py", line 685, in _gen_write_inchi_structures
    error_handler.error(errmsg, location)
  File "chemfp/io.py", line 112, in error
    _compat.raise_tb(ParseError(msg, location), None)
  File "<string>", line 1, in raise_tb
chemfp.ParseError: OEChem cannot create the InChI string, file 'chebi.inchi', record #3

The third record could not be converted to an InChI string, and the warning message that OEChem printed to the termal shows that the molecule contained a wildcard atom, that is, the “*” atom. But, did it really?

>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> for i in range(3):
...   mol = next(reader)
...   print(T.create_string(mol, "smistring"))
...
c1cc(c(cc1[C@@H]2[C@@H](Cc3c(cc(cc3O2)O)O)O)O)O
C[C@]12CC[C@H](C1)C(C2=O)(C)C
*C(=O)OC(CO)CO[R1]

That shows “R1”, not an R-group. What’s going on? “R1” isn’t even a valid SMILES.

This is an OEChem extension to SMILES. The default output SMILES flavor includes the flag “RGroups”, which

[c]ontrols whether atoms with atomic number zero (as determined by the OEAtomBase::GetAtomicNum method), and a non-zero map index (as determined by the OEAtomBase::GetMapIdx method) should be displayed using the [R1] notation. In this notation, the integer value following the R corresponds to the atom’s map index. When this flag isn’t set, such atoms are written in the Daylight convention [*:1]. – OEChem documenation

I’ll redo the loop but this time disable the RGroup using the writer_args option to set the flavor to “Default,-RGroups”, that is, the default value but without RGroups being set:

>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> for i in range(5):
...   mol = next(reader)
...   print(T.create_string(mol, "smistring",
...      writer_args={"flavor": "Default,-RGroups"}))
...
c1cc(c(cc1[C@@H]2[C@@H](Cc3c(cc(cc3O2)O)O)O)O)O
C[C@]12CC[C@H](C1)C(C2=O)(C)C
*C(=O)OC(CO)CO[*:1]
C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1C[C@H](C2=O)O)O
c1cc(c(c(c1)Cl)C#N)Cl

That indeed gives [*:1] which is the wildcard atom that InChI complains about.

The molecule writers support the same errors option as the molecule readers. The default value is “strict”, which means to raise an exception. To ignore errors, use “ignore”, and to ignore errors except to report a message to standard out, use “report”.

>>> from chemfp import openeye_toolkit as T # use your toolkit of choice
>>> reader = T.read_ids_and_molecules("ChEBI_lite.sdf.gz", id_tag="ChEBI ID", errors="ignore")
>>> writer = T.open_molecule_writer("chebi.inchi", errors="report")
>>> writer.write_ids_and_molecules(reader)

The first few and last few lines of output are:

Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #3. Skipping.
Warning: Unsupported Sgroup information ignored
Warning: Unsupported Sgroup information ignored
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #13. Skipping.
Warning: Stereochemistry corrected on atom number 2 of
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #133. Skipping.
    ...
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #94443. Skipping.
Warning: Stereochemistry corrected on atom number 8 of
Warning: Stereochemistry corrected on atom number 13 of
Warning: Stereochemistry corrected on atom number 36 of

where the lines starting “ERROR: OEChem” come from chemfp, and the others come from OEChem at a lower-level. (Alas, the “report” isn’t as helpful as it should be. I would like it to include the output id in the error message, but all it gives is the record number. Perhaps it will be in the next release?)

All told, there were 107205 of which 98631 could be written out. I got these numbers from the writer’s location property (see Location information: filename, record_format, recno and output_recno, below). Its recno is the number of records sent to the writer, and output_recno is the number of records actually written:

>>> writer.location.recno
107205
>>> writer.location.output_recno
98631

Reader and writer format metadata

In this section you’ll learn about the format metadata attribute of the readers and writers. You will need Compound_099000001_099500000.sdf.gz from PubChem if you want to reproduce this for yourself.

Each reader and writer has a metadata attribute, which stores some information about the parameters used to open it:

>>> from chemfp import rdkit_toolkit as T
>>> reader = T.read_molecules("Compound_099000001_099500000.sdf.gz")
>>> reader.metadata
FormatMetadata(filename='Compound_099000001_099500000.sdf.gz',
record_format='sdf', args={'sanitize': True, 'removeHs': True, 'strictParsing': True, 'includeTags': True})
>>> writer = T.open_molecule_writer(None, "sdf")
>>> writer.metadata
FormatMetadata(filename='<stdout>', record_format='sdf',
args={'includeStereo': False, 'kekulize': True, 'v3k': False})

The metadata for a structure reader and writer is a chemfp.base_toolkit.FormatMetadata instances, and not the chemfp.Metadata for a fingerprint reader and writer.

The filename attribute is best effort at a string representation of the source or destination. It can either be the original filename (if there is one), the strings “<stdin>” or “<stdout>” for stdin/stout, the string “<string>” if reading or writing to memory, the source or destination’s “name” attribute if a file object, or None if all else fails.

The record_format attribute is the format name for the record, which is the same as the input file format except without any compression. As you can see in the above example, the “sdf.gz” reader has a record_format of “sdf”. This parameter is useful when you want use the text_toolkit to extract records because you pass the text reader’s record format as the format for the chemistry toolkit’s toolkit.parse_molecule().

The args attribute is the processed reader_args or writer_args, without any namespacing. For now it’s mostly available for debugging purposes, so you can see how the toolkit layer actually processed your arguments. In the future there will be a way to turn this into a text settings dictionary.

Location information: filename, record_format, recno and output_recno

In this section you’ll learn the basics of the chemfp.io.Location API, you’ll learn how to get the location object for each reader and writer, and you’ll learn about the recno and output_recno location attributes.

(See the next section for details about the lineno, offsets, record, and other location properties which are not available for every toolkit format.)

The reader and writers track information about the current state of the reader and writer. Some of this information is more generally useful, and available through the location attribute of each reader and writer:

>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> content = "C methane\nO=O oxygen\n"
>>> reader = T.read_molecules_from_string(content, "smi")
>>> reader.location
Location('<string>')
>>> loc = reader.location
>>> loc.filename
'<string>'
>>> loc.record_format
'smi'

If there is no actual filename then filename is “<string>” for string-based I/O, “<stdin>” when reading from stdin, and “<stdout>” when writing to stdout. (The latter two occur when the source or destination parameter, respectively, are None.) The record_format is the record format name, without any compression suffix:

>>> writer = T.open_molecule_writer("example.sdf.gz")
>>> writer.location.filename
'example.sdf.gz'
>>> writer.location.record_format
'sdf'
>>> writer.close()

All of the toolkit readers and writers support the recno location property, which is the number of records which have been read or written. A recno of 0 means that no records have been read:

>>> reader = T.read_molecules_from_string(content, "smi")
>>> loc = reader.location
>>> loc.recno
0
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x10fb06e50>
>>> loc.recno
1
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x10fb06ec0>
>>> loc.recno
2

While you could use the recno property for simple enumeration, as in the folllowing:

>>> with T.read_ids_and_molecules_from_string(content, "smi") as reader:
...    loc = reader.location
...    for id, mol in reader:
...       print("record number:", loc.recno, "id:", id)
...
record number: 1 id: methane
record number: 2 id: oxygen

I would prefer that you write it with the “enumerate()” function, as in:

>>> with T.read_ids_and_molecules_from_string(content, "smi") as reader:
...   for recno, (id, mol) in enumerate(reader, 1):
...     print("record number:", recno, "id:", id)
...
record number: 1 id: methane
record number: 2 id: oxygen

The enumerate() function is both faster and more expected for this sort of code. The recno property exists more to help with error reporting, and to report summary information, like:

>>> print("Read", reader.location.recno, "records")
Read 2 records

The output writers distinguish between recno, which is the number of molecules that chemfp tried to save, and output_recno, which is the number of molecules that could actually be saved. This occurs because some molecules cannot be written to a given format, like the SMILES “*” which has no InChI representation:

>>> from chemfp import openbabel_toolkit
>>> writer = openbabel_toolkit.open_molecule_writer("example.inchi")
>>> parse_molecule = openbabel_toolkit.parse_molecule
>>> writer.write_molecule(parse_molecule("c1ccccc1O", "smistring"))
>>> writer.location.recno
1
>>> writer.location.output_recno
1
>>> writer.write_molecule(parse_molecule("*", "smistring"))
==============================
*** Open Babel Warning  in InChI code
  #0 :Unknown element(s): *
==============================
*** Open Babel Error  in InChI code
  InChI generation failed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "chemfp/base_toolkit.py", line 271, in write_molecule
    _compat.raise_tb(err[0], err[1])
  File "<string>", line 1, in raise_tb
  File "chemfp/_openbabel_toolkit.py", line 1348, in _gen_write_delimited_structures
    error_handler.error("Open Babel cannot create the %s string"
  File "chemfp/io.py", line 112, in error
    _compat.raise_tb(ParseError(msg, location), None)
  File "<string>", line 1, in raise_tb
chemfp.ParseError: Open Babel cannot create the InChI string, file 'example.inchi', record #2
>>> writer.location.recno
2
>>> writer.location.output_recno
1

Location information: record position and content

In this section you’ll learn how to get position information for each record and information about the content of each record. You will need the RDKit toolkit or Open Babel toolkit. (Unfortunately for me, OEChem doesn’t have a way to get this information, and my hybrid parser with improved error reporting proved to be much slower than OEChem’s native performance.) You will also need Compound_099000001_099500000.sdf.gz from PubChem.

(See the previous section for details about the filename, record_format, recno and output_recno location properties, which are available for every toolkit format.)

Sometimes you want to know where a record is located in a file. You might want to report that the unusable record started on line 12345 of a given file, or you might want to index a file to implement random access lookup.

The underlying toolkits do not implement this functionality. Instead, chemfp includes its own SMILES and SDF file readers. These know enough about the formats to extract a single record, then pass the record to the toolkit to turn into a molecule. This lets chemfp track the line number of the start of the record, its byte range, the text of the current record, and other details.

Timings show that the hybrid parser for the SMILES formats are no slower than the native RDKit and Open Babel readers, and that the hybrid SDF parser a bit slower than RDKit’s native parser (about 10%) and slower than Open Babel’s native parser. In all cases, OEChem’s native parsers leave chemfp in the dust.

As a consequence, the rdkit_toolkit and openbabel_toolkit SMILES readers track more detailed record information, but the openeye_toolkit one does not. (The text_toolkit of course always tracks that information.) Here is an example which works for rdkit_toolkit and openbabel_toolkit:

>>> from chemfp import openbabel_toolkit as T # or rdkit_toolkit
>>> content = "C methane\nO=O oxygen\n"
>>> reader = T.read_ids_and_molecules_from_string(content, "smi")
>>> loc = reader.location
>>> for id, mol in reader:
...     print("id:", repr(id), "lineno:", loc.lineno, "byte range:", loc.offsets)
...     print("   record content:", repr(loc.record))
...     print("   first line:", repr(loc.first_line))
...
id: 'methane' lineno: 1 byte range: (0, 10)
   record content: b'C methane\n'
   first line: 'C methane'
id: 'oxygen' lineno: 2 byte range: (10, 21)
   record content: b'O=O oxygen\n'
   first line: 'O=O oxygen'
>>> content[0:10]
'C methane\n'
>>> content[10:21]
'O=O oxygen\n'

(Note: if the input record is a Unicode string then it will be converted into a UTF-8 encoded byte string. The start and end positions are coordinates in the encoded byte string, not the text string.)

The location instance of the rdkit_toolkit SDF reader gives access to many details about the current parser state:

>>> from chemfp import rdkit_toolkit
>>> reader = rdkit_toolkit.read_molecules("Compound_099000001_099500000.sdf.gz")
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x1104c9830>
>>> reader.location.lineno
1
>>> reader.location.offsets
(0, 6709)
>>> reader.location.first_line
'99000039'
>>> next(reader)
<rdkit.Chem.rdchem.Mol object at 0x1104c97c0>
>>> reader.location.lineno
223
>>> reader.location.offsets
(6709, 14560)
>>> reader.location.first_line
'99000230'

The openbabel_toolkit and openeye_toolkit implementations by default don’t track this level of detail, because their native readers are faster than when I can manage in a hybrid reader. Consequently, those values are None:

>>> from chemfp import openbabel_toolkit
>>> reader = openbabel_toolkit.read_molecules("Compound_099000001_099500000.sdf.gz")ne
>>> next(reader)
<openbabel.openbabel.OBMol; proxy of <Swig Object of type 'OpenBabel::OBMol *' at 0x106eff4b0> >
>>> print(reader.location.lineno)
None
>>> print(reader.location.offsets)
None
>>> print(reader.location.first_line)
None

There is experimental support to use Open Babel in hybrid mode. The reader_args supports an “implementation” option. The default of None, or “openbabel”, tells chemfp to use Open Babel’s native parser, while specifying “chemfp” tells it to use chemfp’s own SDF record parser:

>>> openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
>>> reader = openbabel_toolkit.read_molecules("Compound_099000001_099500000.sdf.gz",
...             reader_args={"implementation": "chemfp"})
>>> next(reader)
<openbabel.openbabel.OBMol; proxy of <Swig Object of type 'OpenBabel::OBMol *' at 0x106effd20> >
>>> reader.location.lineno
1
>>> reader.location.offsets
(0, 6709)
>>> reader.location.first_line
'99000039'

If user-defined selection of the back-end implementation works well, I may add similar support for the openeye_toolkit, for those who want the increased level of location detail despite the large performance impact.

The RDKit “sdf” reader always uses the hybrid. This is for historical reasons. The hybrid solution was once always faster than the native ForwardSDMolSupplier. That has since changed, and ForwardSDMolSupplier is about 10% faster. At some point I will add an ‘implementation’ option so you can switch between performance and improved error reporting.

Writing your own error handler (Experimental)

In this section you’ll learn how to write your own error handler. This is an advanced topic. Bear in mind that this is highly experimental and very likely to change. I hope you can provide feedback about how to improve it.

In earlier sections you learned that when the errors parameter is “strict”, the parser will raise an exception if there’s a problem with a record. When it’s “ignore”, the record parsers return None as the molecule, while the file and string readers skip the failing record. When it’s “report”, the result is the same as “ignore” except that extra information about the failure is written to stderr.

The errors parameter can also take an object which implements the “errors()” method as in the following:

import sys
class OopsHandler(object):
    def error(self, msg, location=None):
        if location is None:
            sys.stderr.write("Oops! %s. Skipping.\n" % (msg,))
        else:
            sys.stderr.write("Oops! %s, %s. Skipping.\n" % (msg, location.where()))

The msg is a string describing the error, and location contains the chemfp.io.Location for the given record. Here’s what it looks like in action:

>>> import sys
>>> from chemfp import rdkit_toolkit as T
>>> T.parse_molecule("Q", "smistring", errors=OopsHandler())
>>> T.parse_molecule("Q", "smistring", errors=OopsHandler())
[10:54:57] SMILES Parse Error: syntax error while parsing: Q
[10:54:57] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Oops! RDKit cannot parse the SMILES string 'Q'. Skipping.
>>> for mol in T.read_molecules_from_string("Q Q-ane\nC methane\n", "smi",
...               errors=OopsHandler()):
...   print("Processed", mol)
...
[10:55:21] SMILES Parse Error: syntax error while parsing: Q
[10:55:21] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
Oops! RDKit cannot parse the SMILES 'Q', file '<string>', line 1, record #1: first line is 'Q Q-ane'. Skipping.
Processed <rdkit.Chem.rdchem.Mol object at 0x109486670>

The location’s where() method tries to give useful information based on the location’s filename, line number, record number, and the first line of the record (up to the first 40 characters).

It’s easy to see how to modify this to send the errors to a logger, or save them up to display in a GUI.

For the hybrid parsers, which give access to the raw record, you can do more advanced processing, like extract the title lines of any SDF record which RDKit can’t handle. The following will make an SDF-formatted string containing three records, where the second record is a 5-valent nitrogren that RDKit can’t parse. It will then try to parse the string, and store the ids for records which couldn’t be parsed.

from rdkit import Chem
from chemfp import rdkit_toolkit

# Use RDKit to make an SD file which RDKit cannot parse.
methane = rdkit_toolkit.parse_molecule("C methane", "smi")
# Bypass normal sanitization so RDKit will read 5-valent nitrogens
pentavalent_n = rdkit_toolkit.parse_molecule("CN(C)(C)(C)C pentavalent N",
        "smi", reader_args={"sanitize": False})

Chem.SanitizeMol(pentavalent_n, Chem.SanitizeFlags.SANITIZE_SETHYBRIDIZATION)
oxygen = rdkit_toolkit.parse_molecule("O=O oxygen", "smi")

# Use the three molecules to make an SD file as a string
with rdkit_toolkit.open_molecule_writer_to_string("sdf") as writer:
  writer.write_molecules([methane, pentavalent_n, oxygen])

sdf_content = writer.getvalue()

# User-defined error handler
class CaptureIds(object):
  def __init__(self):
    self.ids = []
  def error(self, msg, location):
    self.ids.append(location.first_line)

capture_ids = CaptureIds()

for mol in rdkit_toolkit.read_molecules_from_string(sdf_content, "sdf",
           errors=capture_ids):
    pass

print("Could not parse:", capture_ids.ids)

The content of sdf_content is:

methane
     RDKit

  1  0  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  END
$$$$
pentavalent N
     RDKit

  6  5  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
  2  4  1  0
  2  5  1  0
  2  6  1  0
M  END
$$$$
oxygen
     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0
M  END
$$$$

and the output from the above is:

[10:59:59] Explicit valence for atom # 1 N, 5, is greater than permitted
Could not parse: ['pentavalent N']

The fingerprint type documentation includes another example of how to write an error handler.

A Babel-like structure format converter

In this section you’ll learn how to use the chemfp toolkit API to create a Babel-like structure file format converter. This section goes into more details of how to develop real-world software using chemfp.

Pat Walters and Matt Stahl started Babel in the 1990s as a command-line program to convert from one chemical structure format to another. This developed over the years, and after a major rewrite became the LGPL toolkit “OELib”, OpenEye’s first commercial chemistry toolkit. OpenEye’s next rewrite lead to OEChem, a proprietary chemistry toolkit. OELib was still available, and others continued to develop it. It became Open Babel, and structure file format conversion continues to be Open Babel’s forte.

A full Babel-like program includes features to add and remove hydrogens of different sorts, select or reject structures based on substructure or other features, add 2D or 3D coordinates, and more. You cannot use chemfp for that. All chemfp can do is read structure files into a given toolkit’s molecule object, and write molecule objects to a given format.

Even that basic ability is useful. I’ll explain how to write such a converter yourself. I’ll use as my example file the following, “example.smi”:

c1ccccc1O phenol
C methane
O=O molecular oxygen

Here’s a minimal conversion program to convert the above into “example.sdf”:

from chemfp import rdkit_toolkit as T  # use your toolkit of choice

reader = T.read_molecules("example.smi")
writer = T.open_molecule_writer("example.sdf")
writer.write_molecules(reader)

That code depends on Python’s garbage collection to close the output file handle. This is fine for a script, but a longer running program may want to have more explicit control over closing the file handle and use a context manager (see Reader and writer context managers):

from chemfp import rdkit_toolkit as T # use your toolkit of choice

with T.read_molecules("example.smi") as reader:
    with T.open_molecule_writer("example.sdf") as writer:
        writer.write_molecules(reader)

With that we have enough to build our first Babel program, which takes the input and output filenames on the command-line. I’ll call this program “cbabel.py”, for “chemfp babel”, and have it implement the command-line

usage: cbabel.py [-h] input_filename output_filename

I’ll use argparse from Python’s standard library to handle command-line argument processing. The “nargs=1” in the following says that the input_filename and output_filename must exist, and only one filename is allowed. Argparse will save those in a list of size 1, which is why I use [0] to get the actual string I’m interested in:

import argparse
from chemfp import rdkit_toolkit as T

parser = argparse.ArgumentParser(
    description = "A minimial chemical structure file converter"
    )
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")

args = parser.parse_args()

with T.read_molecules(args.input_filename[0]) as reader:
    with T.open_molecule_writer(args.output_filename[0]) as writer:
        writer.write_molecules(reader)

I’ll convert the SMILES into canonical SMILES:

% python cbabel.py example.smi example.can
% cat example.can
Oc1ccccc1 phenol
C methane
O=O molecular oxygen

The only change is that the phenol went from c1ccccc1O to Oc1ccccc1.

I’ll add the ability to read from stdin and stdout. I’ll say that if the input filename is “-” then it will read from stdin, and if the output filename is “-” then it will write to stdout. (If you have a file named “-” then you’ll have to specify “./-” to read or write to it.):

import argparse
from chemfp import rdkit_toolkit as T

parser = argparse.ArgumentParser(
    description = "A minimial chemical structure file converter"
    )
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")

args = parser.parse_args()

# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
    input_filename = None

output_filename = args.output_filename[0]
if output_filename == "-":
    output_filename = None

with T.read_molecules(input_filename) as reader:
    with T.open_molecule_writer(output_filename) as writer:
        writer.write_molecules(reader)

There’s a limitation with this! When the input or output format is None, chemfp can’t figure out the format based on the filename, so will assume that it’s a SMILES file. When I run the above I get SMILES output:

% python cbabel.py example.smi -
Oc1ccccc1 phenol
C methane
O=O molecular oxygen

But what if I want SDF output? I need a way to specify the input and output file formats on the command-line. I’ll use the -i and -o options to specify those:

import argparse
from chemfp import rdkit_toolkit as T

parser = argparse.ArgumentParser(
    description = "A minimial chemical structure file converter"
    )
parser.add_argument("-i", metavar="FORMAT", dest="input_format",
                    help="input format name", default=None)
parser.add_argument("-o", metavar="FORMAT", dest="output_format",
                    help="output format name", default=None)
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")

args = parser.parse_args()

# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
    input_filename = None

output_filename = args.output_filename[0]
if output_filename == "-":
    output_filename = None

with T.read_molecules(input_filename, args.input_format) as reader:
    with T.open_molecule_writer(output_filename, args.output_format) as writer:
        writer.write_molecules(reader)

Now I can specify that I want stdout to be in SDF format:

% python cbabel.py -o sdf example.smi - | head -5
phenol
     RDKit

  7  7  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0

In practice, required command-line arguments make life more difficult. For a simple program like this, required arguments are not a problem, but what if I want to add a command to list the available formats? That option doesn’t need an input or output filename, but argparse will enforce that requirement anyway.

There are a couple of ways to solve the problem. The correct one is to use an argparse Action but that’s complicated. An easier one for this case is to let “-” be the default input and output filename. That’s easily done by changing:

parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")

to:

parser.add_argument("input_filename", nargs="?", default="-",
                    help="input filename")
parser.add_argument("output_filename", nargs="?", default="-",
                    help="output filename")

As a result I can add a new --help-formats argument:

parser.add_argument("--help-formats", action="store_true",
                    help="list the available file formats")

along with the handler for it, which prints information about the toolkit (its name and version string) and each of the formats. Some of the formats, like “smistring”, don’t have an I/O format (for that, use “smi”), so I need to filter those out. Also, some of the formats, like “inchikey”, are output only, and some of the toolkit have formats that they read but don’t write, so I give more details about those:

args = parser.parse_args()

if args.help_formats:
  print("Available I/O formats for toolkit %s (%s)" % (T.name, T.software))
  for format in T.get_formats():
    if not format.supports_io: # skip formats like "smistring" and "inchistring"
      continue
    if not format.is_output_format:
      msg = " (input only)"
    elif not format.is_input_format:
      msg = " (output only)"
    else:
      msg = ""
    print(" %s%s" % (format.name, msg))
  raise SystemExit(0)

For my version of RDKit I get:

% python cbabel.py --help-formats
Available I/O formats for toolkit rdkit (RDKit/2020.03.1)
 smi
 can
 usm
 sdf
 fasta
 mol2 (input only)
 pdb
 xyz (output only)
 mae (input only)
 inchi
 inchikey (output only)

If I used openeye_toolkit instead of rdkit_toolkit I get:

Available I/O formats for toolkit openeye (OEChem/20191016)
 smi
 usm
 can
 sdf
 skc (input only)
 mol2
 mol2h
 sln (output only)
 mmod
 pdb
 xyz
 cdx
 mopac (output only)
 mf (output only)
 oeb
 inchi
 inchikey (output only)
 oez
 cif
 mmcif
 fasta
 csv
 json

The code so far requires RDKit, but chemfp supports OEChem, Open Babel, and CDK. Why not add the command-line argument --toolkit to specify an alternate toolkit?

I ‘ll tell argparse that there’s a new --toolkit argument, which defaults to “rdkit” and also allows “openeye” and “openbabel”:

parser.add_argument("--toolkit", metavar="NAME",
                    choices=("rdkit", "openeye", "openbabel", "cdk"),
                    help="toolkit name", default="rdkit")

I can no longer import the toolkit directly, which I did as:

from chemfp import rdkit_toolkit as T

because that line requires that RDKit be installed. Otherwise it will raise an ImportError exception. While that might be reasonable if the user wanted to use the rdkit toolkit, it’s not reasonable if the user wanted to use the Open Babel toolkit and didn’t care to know that RDKit isn’t available.

Instead of a direct import, I’ll use chemfp.get_toolkit() to get the named toolkit. It raises a ValueError with a useful error message if the toolkit isn’t available or is unknown. If that happens, I’ll exit, and use that message as the explanation:

import chemfp
     # ... skipped many lines
try:
    T = chemfp.get_toolkit(args.toolkit)
except ValueError as err:
    raise SystemExit("Cannot use toolkit %s: %s" % (
        args.toolkit, err))

After a bit of experimentation I found a small SMILES string which gives a different canonicalization for each of the supported toolkits, which I present as evidence that it really is using a different toolkit:

% echo "NCC(N)O example" | python cbabel.py --toolkit openbabel
NCC(O)N example
% echo "NCC(N)O example" | python cbabel.py --toolkit rdkit
NCC(N)O example
% echo "NCC(N)O example" | python cbabel.py --toolkit openeye
C(C(N)O)N example
% echo "NCC(N)O example" | python cbabel.py --toolkit cdk
C(C(N)O)N example

Here’s the final code, so you can see how everything works in context:

import argparse
import chemfp

parser = argparse.ArgumentParser(
    description = "A minimial chemical structure file converter"
    )
parser.add_argument("-i", metavar="FORMAT", dest="input_format",
                    help="input format name", default=None)
parser.add_argument("-o", metavar="FORMAT", dest="output_format",
                    help="output format name", default=None)
## parser.add_argument("input_filename", nargs=1, help="input filename")
## parser.add_argument("output_filename", nargs=1, help="output filename")
parser.add_argument("--toolkit", metavar="NAME",
                    choices=("rdkit", "openeye", "openbabel", "cdk"),
                    help="toolkit name", default="rdkit")
parser.add_argument("--help-formats", action="store_true",
                    help="list the available file formats")
parser.add_argument("input_filename", nargs="?", default="-",
                    help="input filename")
parser.add_argument("output_filename", nargs="?", default="-",
                    help="output filename")

args = parser.parse_args()

try:
    T = chemfp.get_toolkit(args.toolkit)
except ValueError as err:
    raise SystemExit("Cannot use toolkit %s: %s" % (
        args.toolkit, err))


if args.help_formats:
  print("Available I/O formats for toolkit %s (%s)" % (T.name, T.software))
  for format in T.get_formats():
    if not format.supports_io: # skip formats like "smistring" and "inchistring"
      continue
    if not format.is_output_format:
      msg = " (input only)"
    elif not format.is_input_format:
      msg = " (output only)"
    else:
      msg = ""
    print(" %s%s" % (format.name, msg))
  raise SystemExit(0)

# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
    input_filename = None

output_filename = args.output_filename[0]
if output_filename == "-":
    output_filename = None

with T.read_molecules(input_filename, args.input_format) as reader:
    with T.open_molecule_writer(output_filename, args.output_format) as writer:
        writer.write_molecules(reader)

Amazing how the original four lines of code expands to 55. It would be even more if I added full error reporting instead of letting Python throw an exception on errors.

Speaking of errors, you may want to use hard-coded values of errors="ignore" or errors="report" to have the parser skip records that the toolkit doesn’t understand, or perhaps pass in that information as a command-line argument named --errors, with the possible choices of “strict”, “report”, or “ignore”.

You might also add the -R and -W options to set the reader args and writer args for the formats, but that’s more complicated than I wanted to describe in this context. See the next section for a description of how to do it.

argparse text settings to reader and writer args

In this section you’ll learn how to use argparse to handle reader args and writer args in the same style that chemfp does.

The previous section showed how to create a Babel-like structure format conversion program and how to use Python’s argparse library for command-line processing. That section was getting too long to describe how to support command-line configuration of the reader args and writer args. In this section I’ll start with a smaller version of the same code. This one requires an input filename and an output filename on the command-line, and lets the user specify the toolkit:

# I put this into a file named "convert.py"
import argparse
import chemfp

parser = argparse.ArgumentParser(
    description = "Experiment with -R and -W options"
    )
parser.add_argument("--toolkit", metavar="NAME",
                    choices=("rdkit", "openeye", "openbabel", "cdk"),
                    help="toolkit name", default="rdkit")
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")

args = parser.parse_args()

T = chemfp.get_toolkit(args.toolkit)
source = args.input_filename[0]
destination = args.output_filename[0]

with T.read_molecules(source) as reader:
  with T.open_molecule_writer(destination) as writer:
    writer.write_molecules(reader)

I’ll walk through the process of how to add support for the -R and -W options, to make it possible to say:

python convert.py example.smi example.can --toolkit rdkit -R delimiter=space -W allBondsExplicit=true

How to get from the command-line to reader and writer arguments

This requires a few conversions. I need to turn the command-line arguments into reader and writer text settings dictionaries, then convert the text settings into reader_args and and writer_args dictionaries, before finally passing the reader_args and writer_args to the molecule readers and writers. (See Convert text settings into reader and writer arguments for more details about converting text settings to reader and writer arguments.)

I’ll use argparse to place the -R and -W option values into separate lists of KEY=VALUE strings, and create a new function which splits them apart on the “=” to get a dictionary of text settings. Then I’ll use the Format object to convert the text settings into the correct reader_args and writer_args. The steps will look something like this:

>>> from chemfp import rdkit_toolkit as T
>>>
>>> format = T.get_format("smi")  # Specify the format and user-defined settings
>>> reader_settings = ["delimiter=space"]
>>> writer_settings = ["allBondsExplicit=true"]
>>>
>>> # The 'parse_text_settings' function doesn't yet exist. It will convert
... # the list of reader_settings into a dictionary of string values.
...
>>> reader_text_settings = parse_text_settings("-R", reader_settings)
>>> reader_text_settings
{'delimiter': 'space'}
>>>
>>> # Ask the format to turn the string values into string objects
...
>>> format.get_reader_args_from_text_settings(reader_text_settings)
{'delimiter': 'space'}
>>>
>>> # Do the same for the writer arguments
...
>>> writer_text_settings = parse_text_settings("-W", writer_settings)
>>> writer_text_settings
{'allBondsExplicit': 'true'}
>>> format.get_writer_args_from_text_settings(writer_text_settings)
{'allBondsExplicit': True}

For the actual code the input format may be different than the output format. By the way, if you look closely you’ll see how “allBondsExplicit” in the text settings has a value of “true”, and the string was converted to the Python object True to be a writer_arg.

To start! First, I need a way to read the list of -R and -W options. I’ll ask argparse to save them into a list, for later post-processing to get the right values:

parser.add_argument("-R", metavar="KEY=VALUE", dest="reader_settings", action="append",
                    help="specify a reader argument", default=[])
parser.add_argument("-W", metavar="KEY=VALUE", dest="writer_settings", action="append",
                    help="specify a writer argument", default=[])

This will parse all of the -R terms, like “-R delimiter=space”, into the reader_settings list, and “-W allBondsExplicit=true” into the writer_settings list, such that:

args.reader_settings == ["delimiter=space"]
args.writer_settings == ["allBondsExplicit=true"]

For that matter, it will also support “-R abc”, and put it into the reader_settings list even though it doesn’t have a “=” in it. I also need to go through and figure out if any terms are incorrect, and report the problem. I’ll make a function for this, along with a parameter so any error message can report if a problem comes from the -R or -W command-line flag:

def parse_text_settings(flag, terms):
  text_settings = {}
  for term in terms:
    left, mid, right = term.partition("=")
    if mid != "=":
      parser.error("%s setting %r must be of the form KEY=VALUE" %
                   (flag, term))
    text_settings[left] = right
  return text_settings

reader_text_settings = parse_text_settings("-R", args.reader_settings)
writer_text_settings = parse_text_settings("-W", args.writer_settings)

This gives me two text settings dictionaries, where the keys and values are both strings. I’ll use the respective Format object to convert a text setting dictionary into the correct reader and writer arguments dictionary:

input_format = T.get_input_format_from_source(source)
reader_args = input_format.get_reader_args_from_text_settings(reader_text_settings)
output_format = T.get_output_format_from_destination(destination)
writer_args = input_format.get_writer_args_from_text_settings(writer_text_settings)

All that’s left is to pass the reader_args and writer_args to the reader and writer. Since I already have the input and output format objects, I’ll pass those in as well, rather than have them guess again based on the source and destination names:

with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
    with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
        writer.write_molecules(reader)

Converter with -R and -W support

Here’s how it looks when I put it all together:

# I put this into a file named "convert.py"
import argparse
import chemfp

parser = argparse.ArgumentParser(
    description = "Experiment with -R and -W options"
    )
parser.add_argument("--toolkit", metavar="NAME",
                    choices=("rdkit", "openeye", "openbabel", "cdk"),
                    help="toolkit name", default="rdkit")
parser.add_argument("-R", metavar="KEY=VALUE", dest="reader_settings", action="append",
                    help="specify a reader argument", default=[])
parser.add_argument("-W", metavar="KEY=VALUE", dest="writer_settings", action="append",
                    help="specify a writer argument", default=[])
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")

def parse_text_settings(flag, terms):
  text_settings = {}
  for term in terms:
    left, mid, right = term.partition("=")
    if mid != "=":
      parser.error("%s setting %r must be of the form KEY=VALUE" %
                   (flag, term))
    text_settings[left] = right
  return text_settings


args = parser.parse_args()

T = chemfp.get_toolkit(args.toolkit)
source = args.input_filename[0]
destination = args.output_filename[0]

input_format = T.get_input_format_from_source(source)
reader_text_settings = parse_text_settings("-R", args.reader_settings)
reader_args = input_format.get_reader_args_from_text_settings(reader_text_settings)

output_format = T.get_output_format_from_destination(destination)
writer_text_settings = parse_text_settings("-W", args.writer_settings)
writer_args = input_format.get_writer_args_from_text_settings(writer_text_settings)


with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
  with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
    writer.write_molecules(reader)

Let’s see it in action. I’ll ask RDKit to include all of the bonds in the output SMILES, including the aromatic bonds, and I’ll ask it to use the space character as the SMILES delimiter:

% python convert.py example.smi example_output.smi --toolkit rdkit -R delimiter=space -W allBondsExplicit=true
% cat example_output.smi
O-c1:c:c:c:c:c:1 phenol
C methane
O=O molecular

The lack of “oxygen” in “molecular oxygen” shows that the input SMILES reader used the “space” delimiter instead of the default “to-eol” delimiter, just as I requested.

The -R and -W settings can also be qualified. (See Qualified reader and writer parameters names.) I’ll have Open Babel and OEChem use different delimiter styles to get different results:

% python convert.py example.smi example_ob_output.smi --toolkit openbabel \
    -R "openbabel.*.delimiter=to-eol"  -R "openeye.*.delimiter=whitespace"
% cat example_ob_output.smi
Oc1ccccc1 phenol
C methane
O=O molecular oxygen
%
% python convert.py example.smi example_oe_output.smi --toolkit openeye \
  -R "openbabel.*.delimiter=to-eol" -R "openeye.*.delimiter=whitespace"
% cat example_oe_output.smi
c1ccc(cc1)O phenol
C methane
O=O molecular

List the reader and writer arguments for the given formats

Finally, it’s difficult to remember all of the available settings for each input and output format. I’ll add a --list-args command-line option which shows the available options, and for each option show the current setting, along with an indicator if the current setting is the default value for that format or if the setting comes from the command-line option.

I need argparse to know about the new option:

parser.add_argument("--list-args", action="store_true",
                    help="list the available input and output options")

and for the rest I replace the last three lines of the earlier code with:

if args.help_args:
  # Make a helper function to display the arguments
  def report_args(format, msg, default_args, specified_args):
    print("%s %s:" % (format.name, msg))
    # Merge the arguments; command-line overrides defaults;
    all_args = default_args.copy()
    all_args.update(specified_args)
    for name, value in sorted(all_args.items()):
        # Was the name specified via -R/-W or is it a default?
        where = "from command-line" if name in specified_args else "default value"
        print("      %s: %r  (%s)" % (name, value, where))
  report_args(input_format, "reader arguments (-R)",
              input_format.get_default_reader_args(), reader_args)
  report_args(output_format, "writer arguments (-W)",
              output_format.get_default_writer_args(), writer_args)

else:
  with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
    with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
      writer.write_molecules(reader)

(See Get the default reader_args or writer_args for a format for more details on the default reader and writer arguments.)

With those changes, the output using the new --list-args is:

% python convert.py example.smi example_output.smi --toolkit rdkit \
?    -R delimiter=space -W allBondsExplicit=true --list-args
smi reader arguments (-R):
      delimiter: 'space'  (from command-line)
      has_header: False  (default value)
      sanitize: True  (default value)
smi writer arguments (-W):
      allBondsExplicit: True  (from command-line)
      allHsExplicit: False  (default value)
      canonical: True  (default value)
      cxsmiles: False  (default value)
      delimiter: None  (default value)
      isomericSmiles: True  (default value)
      kekuleSmiles: False  (default value)

Creating a specialized record parser

In this section you’ll learn how to make a specialized function to parse an record into a toolkit molecule. This function is somewhat faster than calling the more general purpose toolkit.parse_id_and_molecule() and might be used when you need to convert a lot of individual records into a molecule.

Sometimes you need to parse a lot of records which don’t come from a file. For example, substructure search is typically split into a screening stage based on substructure fingerprints, followed by the atom-by-atom substructure search. The screening stage returns identifiers and the substructure search takes molecules, so in between them is code to look up a record based on its id and convert the result to a molecule.

Assuming a database record API where “db[id]” returns the record for a given id, that lookup function might look like this:

def get_molecules(db, id_iter, toolkit, format, reader_args=None):
    for id in id_iter:
        record = db[id]
        mol = toolkit.parse_molecule(record, format, reader_args=reader_args)
        yield mol

(A more complex implementation should handle when the record id doesn’t exist, or can’t be converted into a molecule.)

This isn’t as fast as it could be. The toolkit.parse_molecule() function validates that the format and reader_args are correct and figures out the right parameters for the underlying toolkit code. It’s a waste of time to redo those checks for every single call.

The function also promises that the caller will get a new molecule each time. That promise isn’t needed for substructure screening. Timing tests with OEChem show that reusing the same molecule is faster than creating a new one. For example, this OEChem code:

mol = OEGraphMol()
for i in range(100000):
    OEParseSmiles(mol, "c1ccccc1Oc1ccccc1")
    mol.Clear()

takes about 60% of the time as this code:

for i in range(100000):
    mol = OEGraphMol()
    OEParseSmiles(mol, "c1ccccc1Oc1ccccc1")

(Bear in mind that this code isn’t doing aromaticity perception, which roughly halves the performance.)

The function toolkit.make_id_and_molecule_parser() returns a specialized function to parse records, based on the specified parameters:

>>> from chemfp import rdkit_toolkit as T
>>> parser = T.make_id_and_molecule_parser("smi")
>>> parser("c1ccccc1O phenol")
('phenol', <rdkit.Chem.rdchem.Mol object at 0x11254b7b0>)

For RDKit it’s about 10-15% faster to use the specialized function instead of the general purpose toolkit.parse_molecule():

>>> from chemfp import rdkit_toolkit as T
>>> import time
>>>
>>> smiles = "c1ccccc1Oc1ccccc1"
>>> if 1:
...     t1 = time.time()
...     for i in range(10000):
...         mol = T.parse_molecule(smiles, "smi")
...     print("Standard time:", time.time()-t1)
...
Standard time: 1.6667978763580322
>>> parser = T.make_id_and_molecule_parser("smi")
>>> if 1:
...     t1 = time.time()
...     for i in range(10000):
...         id, mol = parser(smiles)
...     print("Specialized time:", time.time()-t1)
...
Specialized time: 1.5279979705810547

The toolkit.make_id_and_molecule_parser() function parameters are almost identical to the ones in toolkit.parse_id_and_molecule(), and with the same meaning. The only difference is that make_id_and_molecule_parser does not support the record parameter. Instead, it returns a function which takes the record and returns the (id, toolkit molecule) pair:

>>> from chemfp import rdkit_toolkit
>>> parser = rdkit_toolkit.make_id_and_molecule_parser(
...     "smi", reader_args={"delimiter": "whitespace"}, errors="ignore")
>>> parser("c1ccccc1O methane 16.04246")
('methane', <rdkit.Chem.rdchem.Mol object at 0x11254bad0>)
>>> parser("Q q-ane")
[11:33:57] SMILES Parse Error: syntax error while parsing: Q
[11:33:57] SMILES Parse Error: Failed parsing SMILES 'Q' for input: 'Q'
('q-ane', None)

WARNING: The function that make_id_and_molecule_parser() returns may reuse the underlying molecule object. Calling the function again may change the molecule returned in previous call:

>>> from chemfp import openeye_toolkit
>>> parser = openeye_toolkit.make_id_and_molecule_parser("smi")
>>> id, mol = parser("C")
>>> mol.NumAtoms()
1
>>>
>>> parser("CCC")
(None, <openeye.oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x1151f5030> >)
3

RDKit doesn’t support molecule reuse so rdkit_toolkit returns a new molecule. Open Babel does support reuse and openbabel_toolkit will reuse the molecule. However, my tests using Open Babel show a barely detectable performance improvement if I reuse a molecule vs. creating a new one each time. Future versions of chemfp may change the default, and may add an implementation option to specify if a new molecule should be returned each time.

In multithreaded code you should create a new parser for each thread.

You might have noticed there is no “make_molecule_parser()”. While it would be useful, it takes time to develop, test, and document, and it wasn’t useful enough for this release. Let me know if you would like it in the future.

Molecule API: Get and set the molecule id

In this section you’ll learn how to get and set the molecule id for a toolkit molecule.

Sometimes you want to get or set toolkit molecule id. This should be pretty rare because the input routines all support a way to get the identifier in parallel with the molecule, and the output routines all support a way to specify an identifier.

One exception is if you read molecules from an SD file where you want to use one of the SD tag values as the identifier rather than the title line at the top of the record. This can occur with the ChEBI data set:

>>> from chemfp import rdkit_toolkit as T
>>> reader = T.read_ids_and_molecules("ChEBI_lite.sdf.gz", id_tag="ChEBI ID")
>>> next(reader)
('CHEBI:90', <rdkit.Chem.rdchem.Mol object at 0x1152648f0>)
>>> id, mol = _
>>> id
'CHEBI:90'
>>> mol
<rdkit.Chem.rdchem.Mol object at 0x1152648f0>
>>> mol.GetProp("_Name")
''
>>> print(reader.location.record[:34])
b'\n  Marvin  01211310252D          \n'
>>> print(reader.location.record.decode("utf8")[:200])

  Marvin  01211310252D

 22 24  0  0  0  0            999 V2000
   -2.8644   -0.2905    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.8656   -1.1176    0.0000 C   0  0  0  0  0  0  0

Note: the location.record is a byte string. The decode("utf8") step is to make it easier to display under Python 3.

The above used the RDKit-specific way to get the special “_Name” property and show that it’s the empty string. The location object for the rdkit_toolkit SDF reader is able to show the raw text for the current record. In the above I used the location.record to show that the record indeed has no title line.

I might want to tie that id directly to the molecule. For example, a lot of people write code which assume that the molecule’s name or title is the identifier, because only ChEBI and a scant handful of other databases use an alternative convention. You can use chemfp to get the appropriate id, then set the correct molecular property.

If I know it’s an RDKit molecule then I could do:

mol.SetProp("_Name", id)

This is not portable. OEChem and Open Babel call this a “title”, and use the molecule’s “GetTitle()” and “SetTitle()” accession methods to get and set it. For those toolkits I would need to do:

mol.SetTitle(id)

As part of chemfp’s limited molecule API, each chemfp toolkit layer implements a portable helper function named “get_id()” to get the toolkit-appropriate “identifier”, and “set_id()” to set it. The following shows an example of converting the title of a SMILES record to upper-case, and generating the corresponding canonical SMILES:

>>> import chemfp
>>> for toolkit_name in ("rdkit", "openeye", "openbabel", "cdk"):
...   T = chemfp.get_toolkit(toolkit_name)
...   mol = T.parse_smi("c1ccccc1O phenol")
...   T.set_id(mol, T.get_id(mol).upper())
...   smiles = T.create_smi(mol)
...   print(toolkit_name, "->", repr(smiles))
...
rdkit -> 'Oc1ccccc1 PHENOL\n'
openeye -> 'c1ccc(cc1)O PHENOL\n'
openbabel -> 'Oc1ccccc1 PHENOL\n'
cdk -> 'C1=CC=C(C=C1)O PHENOL\n'

Please note that this could be written more succinctly by passing the id directly to the chemfp.toolkit.create_string() function, as:

>>> import chemfp
>>> for toolkit_name in ("rdkit", "openeye", "openbabel", "cdk"):
...   T = chemfp.get_toolkit(toolkit_name)
...   id, mol = T.parse_id_and_molecule("c1ccccc1O phenol", "smi")
...   smiles = T. create_string(mol, "smi", id=id.upper())
...   print(toolkit_name, "->", repr(smiles))
...
rdkit -> 'Oc1ccccc1 PHENOL\n'
openeye -> 'c1ccc(cc1)O PHENOL\n'
openbabel -> 'Oc1ccccc1 PHENOL\n'
cdk -> 'C1=CC=C(C=C1)O PHENOL\n'

Note: I may add support for an optional id_tag, as in:

T.get_id(mol, id_tag="ChEBI id")    # Currently not valid chemfp code!

If you think this would be useful, please let me know about your use case.

Finally, if you want the output record as a UTF-8 encoded byte string rather than a Unicode string then use chemfp.toolkit.create_bytes() instead of create_string().

Molecule API: Copy a molecule

In this section you’ll learn how to make a copy of a native toolkit molecule.

The chemfp file readers may clear and reuse the underlying toolkit molecule. This is a problem if you want to load all of the molecules from a data set into memory:

>>> from chemfp import openeye_toolkit
>>> content = "C methane\nO=O oxygen\n"
>>> mols = list(openeye_toolkit.read_molecules_from_string(content, "smi"))
>>> mols
[<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x109776d20> >,
<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x109776d20> >]
>>> mols[0] is mols[1]
True
>>> openeye_toolkit.create_string(mols[0], "smistring")
''
>>> openeye_toolkit.create_string(mols[1], "smistring")
''

You can see that the openeye_toolkit reader reuses the same OEGraphMol, and that the molecule is cleared at the end of parsing.

In the future there may be a reader_args parameter to tell the reader to make a new molecule for each term. Until that possible future happens, one work-around is to make a copy of the molecule using the respective chemfp toolkit’s toolkit.copy_molecule() function:

>>> from chemfp import openeye_toolkit as T
>>> mols = [T.copy_molecule(mol) for mol in openeye_toolkit.read_molecules_from_string(content, "smi")]
>>> mols
[<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10b31e930> >,
<oechem.OEGraphMol; proxy of <Swig Object of type 'OEGraphMolWrapper *' at 0x10b31e6f0> >]
>>> mols[0] is mols[1]
False
>>> T.create_string(mols[0], "smistring")
'C'
>>> T.create_string(mols[1], "smistring")
'O=O'

The various writers may also modify the molecule, for example, by temporarily changing the molecule id or by reperceiving aromaticity. If this is a problem then you can use the copy_molecule() as a way to work around it.

This is definitely a work-around solution because it’s currently impossible to know if a copy is needed or not. The fail-safe solution is to always copy, which will lead to extra copies and slower code when using the rdkit_toolkit. Other more complicated workarounds might be faster, but the real solution that I hope to implement in the future is to specify the requested behavior as a parameter.

Molecule API: Working with SD tags

In this section you’ll learn how to work with SD tag data.

Chemfp supports a limited cross-toolkit API for working with SD tags. You can get a value for a single tag, the list of all tags and values, and add (and potentially replace) a tag with a given name.

NOTE: This is not a general-purpose SD tag API.

The two main goals of the SD tag API are to get a tag’s value (most likely for use as an identifier) and to add a fingerprint or similarity search result to a molecule. This can be done with the toolkit’s toolkit.add_tag() and toolkit.get_tag() functions:

>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("O=O oxygen", "smi")
>>> T.add_tag(mol, "score", "0.9851")
>>> T.get_tag(mol, "score")
'0.9851'
>>> print(T.create_string(mol, "sdf"))
oxygen
     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0
M  END
> <score>
0.9851

$$$$

If a given tag already exists then add_tag() may replace the existing value, or it may add a second tag with the same name. (Eg, rdkit_toolkit currently replaces an existing tag while openeye_toolkit creates a second entry.) Chemfp does no additional error checking, so please be careful about the use of “>” and newline characters in the tag value.

It is sometimes useful to get all of the tags and corresponding values. The toolkit’s toolkit.get_tag_pairs() function returns these as a list of 2-element tuples, where the first term is the tag name and the second is the value:

>>> T.add_tag(mol, "best_id", "ABC00000123")
>>> T.add_tag(mol, "text", "This continues\nacross multiple\nlines")
>>> T.get_tag_pairs(mol)
[('score', '0.9851'), ('best_id', 'ABC00000123'), ('text', 'This continues\nacross multiple\nlines')]
>>> print(T.create_string(mol, "sdf"))
oxygen
     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0
M  END
> <score>
0.9851

> <best_id>
ABC00000123

> <text>
This continues
across multiple
lines

$$$$

If there are multiple tags with the same name then get_tag() arbitrarily decides which value to return. The get_tag_pairs() function includes duplicates if the underlying toolkit supports it.

Add fingerprints to an SD file using a toolkit

In this section you’ll learn how to add a fingerprint as a tag to the structures in an SD file using a chemistry toolkit.

The FPS and FPB fingerprint file formats store the record id and the fingerprint, but not the original structure. The most common way to tie the structure to a fingerprint is to use an SD file, and store the fingerprint as one of the tag values. (Another is to create a SMILES file variant, also called a CSV file, with the fingerprint as a new column.)

The following will parse an SD file, and for each molecule it will compute the MACCS fingerprints and add the base64-encoded fingerprint to the molecule using the unimaginative tag name “FP”. It will save the results to the file named “example.sdf”, which is equally unimaginative:

import sys
import base64
import chemfp

# Portable code to convert a fingerprint to a string
# which the underlying toolkits will accept.
#
# b64encode returns a byte string, which is fine for
# all toolkits under Python 2.
if sys.version_info.major == 2:
  b64encode = base64.b64encode
else:
  # Under Python 3, RDKit and Open Babel accept a byte string.
  # OEChem does not. Always convert to Unicode.
  def b64encode(s):
    return base64.b64encode(s).decode("ascii")

# Select your toolkit of choice
fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
#fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
#fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")

T = fptype.toolkit

reader_args = {"rdkit.sdf.removeHs": False}
with T.read_molecules("Compound_099000001_099500000.sdf.gz",
                      reader_args=reader_args) as reader:
  with T.open_molecule_writer("example.sdf") as writer:
    for mol in reader:
      fp = fptype.compute_fingerprint(mol)
      T.add_tag(mol, "FP", b64encode(fp))
      writer.write_molecule(mol)

This is a very general purpose solution. It’s easy to change the fingerprint type, or switch the input to a SMILES file or other supported structure file format.

(Unfortunately, there is no general purpose base64 encoder which works across all toolkits and both Python 2 and Python 3. Hence the complicated code to do the right thing.)

What it doesn’t do is preserve all of the details of the input records. It converts the input record into a molecule, and back out to a new record, and the intermediate record doesn’t keep all of the details.

For example, if I use OpenEye-MACCS166 and compare the first record of the original with the first record of the transformed output then the diff comparison is:

2c2
<   -OEChem-04292009532D
---
>   -OEChem-06182011512D
221a222,224
> > <FP>
> AAAEAAAAMAABwEBOk+GQU9yga24b
>

This says that the second line changed, and three new lines were added at line 221

The second line contains a date stamp, so this isn’t a big change, and the three new lines are the ones I requested. This doesn’t look like much of a change, but that’s because OEChem was used to make the record in the first place. Open Babel and RDKit have their own set of differences from the OEChem output defaults. For example, RDKit will sort the SD tags alphabetically.

I wanted to compare the original OEChem-based PubChem record to the output record from RDKit. I commented/uncommented the fingerprint names to use RDKit instead of OEChem. When I did this originally (since fixed), I noticed that the atom and bond counts line changed.

The first problem I noticed, before I fixed it, is that the atom and bond counts line changed. The original record has 46 atoms and 49 bonds:

46 49  0     1  0  0  0  0  0999 V2000

while the RDKit output said there are only 28 atoms and 31 bonds:

28 31  0  0  1  0  0  0  0  0999 V2000

What happened is that RDKit by default will convert explicit hydrogens to implicit hydrogens as part of the input process, while OEChem does not.

I can disable that in RDKit using the removeHs reader_arg, which is in the code I showed earlier:

reader_args = {"rdkit.sdf.removeHs": False}

With removeHs disabled, the RDKit atom counts match the original atom counts. There are still a few differences in the molblock.

  • RDKit places a “0” in the obsolete 4th field of the counts line, while OEChem leaves it empty.
  • RDKit uses the CHG property block and does not include duplicate charge information in the atom line. The PubChem file only stores charge information in the atom line.
  • RDKit leaves the last three fields empty, while PubChem uses 0. These fields are respectively ‘obsolete’, used for “SSS queries only”, and used for “Reaction, Query”.

That aside, the MACCS fingerprints should be the same, right?

They are not. The RDKit (and Open Babel and CDK) MACCS keys implementations assume that all hydrogens are implicit. If there are explicit hydrogens then they will likely give a different fingerprint. If you run the above code using RDKit, with and without removeHs, you’ll see two different values for FP:

AAAEAAAAMAABxABOk+GwU9zhb24f   # RDKit, removeHs=False
AAAEAAAAMAABwABOk2GwUdzhZ24f   # RDKit, removeHs=True

I’m left with the unfortunate situation where I can’t preserve the explicit hydrogens without affecting the MACCS fingerprints. I think the right solution is to fix the SMARTS patterns that RDKit and others use (which is a goal of chemfp’s own RDMACCS fingerprints).

Another solution for this is to use the text_toolkit to preserve the input SDF record syntax, and combine it with a chemistry toolkit to get the molecule you want.