.. _toolkit_chapter:
====================
Toolkit API examples
====================
This chapter gives many examples of how to use the toolkit API. For an
overview of the toolkit API functions, see :mod:`chemfp.toolkit`. For
details about actual toolkit implementations, see
:mod:`chemfp.openeye_toolkit`, :mod:`chemfp.openbabel_toolkit`,
:mod:`chemfp.rdkit_toolkit`, and :mod:`chemfp.text_toolkit`.
Fingerprint search usually starts with a structure record, and not a
fingerprint. The functions
:func:`chemfp.read_molecule_fingerprints` and
:func:`chemfp.read_molecule_fingerprints_from_string` give a
quick way to read a file or string containing structure records as the
corresponding fingerprints.
Sometimes you want more control over the process. You might want to
generate multiple fingerprints for the same structure and not want to
reparse the structure record multiple times. Or you might want to
return the search results as extra fields to the query SDF record
instead of a simple list of values.
Chemfp uses a third-party chemistry toolkit to parse the records into
a molecule, or compute the fingerprint for a given molecule. It's not
hard to write your own Open Babel, OEChem/OEGraphSim, or RDKit code to
handle any of these or similar tasks. The problem comes in when you
want to mix multiple fingerprint types, like to compare the default
RDKit fingerprint to Open Babel's FP2 fingerprint. You end up writing
very different code for essentially the same fingerprint task.
There's an old saying in computer science; all problems can be solved
with another layer of indirection. The chemfp toolkit API follows that
tradition. It's a common API for molecule I/O which works across the
three supported toolkits. It's also my best effort at implementing a
next generation API.
Bear in mind that it is only an *I/O* API. Chemfp is a fingerprint
toolkit and it will not gain a common molecule API. For that, look
toward `Cinfony `_.
Get a chemfp toolkit
====================
In this section you'll learn how to load a "toolkit" -- a portable API
layer above the actual chemistry toolkit -- and how to check if a
toolkit is available and has a valid license.
Each toolkit I/O adapter is implemented as a chemfp submodule. If you
know the underlying chemistry toolkit is installed you can import the
adapter directly::
>>> from chemfp import openbabel_toolkit
>>> from chemfp import openeye_toolkit
>>> from chemfp import rdkit_toolkit
Use :func:`chemfp.get_toolkit_names` to get the available toolkit
names::
>>> chemfp.get_toolkit_names()
set(['openeye', 'rdkit', 'openbabel'])
This will try to import each module, which means it may take a second
or more depending on the shared library load time for your
system. (This overhead only occurs once.) The function returns a list
of the modules that could be loaded and have a valid license.
You can use :func:`chemfp.get_toolkit` to get the correct toolkit
module given a name; it raises a ValueError if the underlying toolkit
isn't installed or the toolkit name is unknown::
>>> chemfp.get_toolkit("rdkit")
>>> chemfp.get_toolkit("openeye")
>>> chemfp.get_toolkit("openbabel")
Existence isn't enough to know if you can use a toolkit. OEChem
requires a license. Each I/O adapter implements
:func:`chemfp.toolkit.is_licensed`. It returns True for Open Babel and
RDKit and the value of OEChemIsLicensed() for OEChem::
>>> from __future__ import print_function
>>> for name in chemfp.get_toolkit_names():
... T = chemfp.get_toolkit(name)
... print("Toolkit %r (%s) is licensed? %s" % (T.name, T.software, T.is_licensed()))
...
Toolkit 'openeye' (OEChem/20170208) is licensed? True
Toolkit 'rdkit' (RDKit/2016.09.3) is licensed? True
Toolkit 'openbabel' (OpenBabel/2.4.1) is licensed? True
(Thanks OpenEye for an no-cost developer license to their toolkit!)
There is currently no way to check if OEGraphSim is licensed; you'll
need to use native OpenEye code instead.
For fun I also showed the ``software`` attribute, which gives more
detailed information about the toolkit version in a standardized
format.
Finally, use :func:`chemfp.has_toolkit` to check if a toolkit is
available. In the following, I used one of my local testing
environments which has OEChem installed but not the other toolkits. (I
use `virtualenv `_ to create
and manage these environments; it's a very useful tool.)::
>>> chemfp.has_toolkit("openeye")
True
>>> chemfp.has_toolkit("openbabel")
False
>>> chemfp.has_toolkit("rdkit")
False
The other option is to catch the ValueError raised when trying to get
an unavailable toolkit::
>>> chemfp.get_toolkit("openeye")
>>> chemfp.get_toolkit("rdkit")
Traceback (most recent call last):
File "", line 1, in
File "chemfp/__init__.py", line 1823, in get_toolkit
raise ValueError("Unable to get toolkit %r: %s" % (toolkit_name, err))
ValueError: Unable to get toolkit 'rdkit': No module named rdkit
>>> chemfp.get_toolkit("cdk")
Traceback (most recent call last):
File "", line 1, in
File "chemfp/__init__.py", line 1845, in get_toolkit
raise ValueError("Toolkit %r is not supported" % (toolkit_name,))
ValueError: Toolkit 'cdk' is not supported
This is a bit more complicated to do, but does have the advantage of
giving access to an error message.
.. _about_smiles:
Parse and create SMILES
=======================
In this section you'll learn how to parse a SMILES into a molecule,
convert a molecule into a SMILES, and the difference between a SMILES
record and a SMILES string. You will need a chemistry toolkit for this
and most of the examples in this chapter.
The chemfp toolkit I/O API is the same across the different toolkits,
and examples with one will generally work with the other, except for
essential differences like the specific formats available, the
chemistry differences in how to interpret a record, the error
messages, and reader and writer arguments.
For most examples I'll use ``T`` as the toolkit module name, rather
than specify a specific toolkit. My examples will be based on RDKit,
but you can use any of the following, if available on your system::
# Choose one of these
from chemfp import openeye_toolkit as T
from chemfp import openbabel_toolkit as T
from chemfp import rdkit_toolkit as T
I'll parse the SMILES string for phenol as a toolkit molecule, then
convert the toolkit molecule into its canonical isomeric SMILES string
using :func:`chemfp.toolkit.create_string`::
>>> mol = T.parse_molecule("c1ccccc1O", "smistring")
>>> mol
>>> T.create_string(mol, "smistring")
'Oc1ccccc1'
The "smistring" format name means that the input is a SMILES
string. Chemfp follows the rule from the original SMILES paper that
the SMILES string ends at the first whitespace. The following is valid
across the chemfp toolkits API even if the underlying toolkit doesn't
accept the "junk" as part of a SMILES::
>>> mol = T.parse_molecule("c1ccccc1O junk", "smistring")
On the other hand, if you have a SMILES record, which is a SMILES
string followed by an id and possibly other fields, then use the "smi"
format name. That will parse the first characters as a SMILES string
and parse the rest of the input, up to the end of the line, as the
record id::
>>> mol = T.parse_molecule("c1ccccc1O junk", "smistring")
>>> T.get_id(mol)
>>> mol = T.parse_molecule("c1ccccc1O junk", "smi")
>>> T.get_id(mol)
'junk'
>>> mol = T.parse_molecule("c1ccccc1O flotsam and jetsam\nand more\n", "smi")
>>> T.get_id(mol)
'flotsam and jetsam'
I used the :func:`chemfp.toolkit.get_id` helper function. While chemfp
doesn't have a common molecule object, I found I do need a few
standard functions to manipulate toolkit molecules. Also,
:func:`.toolkit.parse_molecule` will only read the first record and ignore
trailing data, which is why the "and more" didn't affect anything.
Now that the molecule has an id, it's easy to see the difference
between the "smistring" and "smi" in the output string::
>>> T.create_string(mol, "smistring")
'Oc1ccccc1'
>>> T.create_string(mol, "smi")
'Oc1ccccc1 flotsam and jetsam\n'
Finally, you can pass an alternate *id* to the :func:`.toolkit.create_string`
function. One example of when this is useful is when your identifier
comes from one field of a database and the SMILES string from another,
and you want to combine the results to get an SDF record::
>>> T.create_string(mol, "smi", id="nothing to see here")
'Oc1ccccc1 nothing to see here\n'
WARNING: The toolkit implementation may temporarily change then
restore the toolkit molecule's own identifier in order to get the
correct output. This is not thread-safe.
Canonical, non-isomeric, and arbitrary SMILES
=============================================
In this section you'll learn the difference between the "smistring",
"canstring", and "usmstring" SMILES string formats and the "smi",
"can", and "usm" SMILES record formats. As with all examples which use
the generic ``T`` toolkit name, you'll need one of the supported
chemistry toolkits, and I'll use RDKit as my underlying toolkit.
The SMILES format supports many different ways to represent the same
molecule. "CO", "OC", "[OH][CH3]", and "C3.O3" are four different
SMILES strings for methanol. A canonicalization algorithm uses
additional rules to create a unique SMILES representation for a given
molecular graph. The different chemistry toolkit have different
canonicalization algorithms, so each toolkit will likely generate a
different canonical SMILES string for the same molecular graph.
There are multiple classes of canonical SMILES strings even in the
same toolkit. The original SMILES format did not handle isotopes,
chirality, or stereochemistry. The later extension to support these
was called "isomeric SMILES", to distinguish it from the original
SMILES.
Because of the history, when people asked a toolkit for "SMILES"
output they got non-isomeric non-canonical SMILES, while "canonical
SMILES" gave them "non-isomeric canonical". This caused subtle
usability errors. Many people, including people like me who should
have the experience to know better, expect canonical isomeric SMILES
by default. But for over 20 years nearly all of the toolkits followed
Daylight's lead in how they did things.
I learned about the problem when OEChem 2.0 broke with tradition and
fixed the mistake. It defined the default SMILES as canonical isomeric
SMILES. If you make the effort to ask for a canonical SMILES you get
canonical non-isomeric SMILES, and if you really want non-canonical,
non-isomeric SMILES you can ask for the "usm" format.
Year later I learned that that Open Babel did the right thing well
before OpenEye. Open Babel's "canonical" is isomeric SMILES, you must
specify the "i" option to not include isotopic or chiral markings, and
they don't even refer to "isomeric SMILES".
Chemfp follows OpenEye's naming convention. The "smistring" format
generates a canonical isomeric SMILES string, the "canstring" format
generates a canonical non-isomeric SMILES string, and the "usmstring"
format generates a non-canonical non-isomeric SMILES string::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>>
>>> mol = T.parse_molecule("[235P].[238U]", "smistring")
>>> T.create_string(mol, "smistring")
u'[235P].[238U]'
>>> T.create_string(mol, "canstring")
u'[P].[U]'
>>> T.create_string(mol, "usmstring")
u'[P].[U]'
Here's evidence that the "usmstring" format is non-canonical::
>>> mol = T.parse_molecule("[238U].[235P]", "smistring")
>>> T.create_string(mol, "usmstring")
u'[U].[P]'
>>> T.create_string(mol, "smistring")
u'[235P].[238U]'
These conventions also apply when creating "smi", "can", and "usm"
strings::
>>> T.set_id(mol, "radioactive")
>>> T.create_string(mol, "smi")
u'[235P].[238U] radioactive\n'
>>> T.create_string(mol, "can")
u'[P].[U] radioactive\n'
>>> T.create_string(mol, "usm")
u'[U].[P] radioactive\n'
By the way, :func:`chemfp.toolkit.parse_molecule` doesn't distinguish
between "smi", "can" and "usm" as input SMILES records, nor between
"smistring", "canstring" and "usmstring". The format only makes a
difference for output. Later on you'll see how to specify *writer_args*
to have more fine-grained control over the output SMILES format. (See
:ref:`rdkit_reader_and_writer_args`,
:ref:`openeye_reader_and_writer_args`, and
:ref:`openbabel_reader_and_writer_args` for toolkit-specific
examples.)
Use *format* to create a record in SDF format
=============================================
In this section you'll learn how to convert a toolkit molecule into an
SDF record. This example will use the RDKit toolkit but the results
will be substantially the same for any of the three supported
chemistry toolkits.
To create an SDF record as a Unicode string, pass "sdf" as the *format* to
:func:`chemfp.toolkit.create_string`::
>>> from __future__ import print_function
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("CO", "smistring")
>>> print(T.create_string(mol, "sdf"))
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
$$$$
Starting with chemfp 3.0, the ``create_string()`` function returns a
Unicode string, under both Python 2.7 and Python 3.5+::
>>> T.create_string(mol, "sdf")[:13]
u'\n RDKit '
In earlier versions of chemfp, ``create_string()`` returned a byte
string. This was the usual practice under Python 2.5 to 2.7. It was
fine for ASCII data, but caused problems with other characters, like
Greek letters in a compound name or a data item listing prices in with
the GBP or EUR symbol.
Python 3 makes a strong distinction between a byte string and a
Unicode string. Chemfp 3.x follows that lead by having
``create_string()`` return a Unicode string, and added the new
function :func:`chemfp.toolkit.create_bytes` to return a byte
string::
>>> T.create_bytes(mol, "sdf")[:13]
'\n RDKit '
Here I'll set the molecule's name to the lower-case Greek letter
'alpha', and show you the interactive output from Python 2.7::
>>> T.set_id(mol, u"\N{GREEK SMALL LETTER ALPHA}")
>>> T.create_string(mol, "sdf")[:13]
u'\u03b1\n RDKit '
>>> T.create_bytes(mol, "sdf")[:13]
'\xce\xb1\n RDKit'
>>> print(T.create_string(mol, "sdf")[:13])
α
RDKit
Here's the same output under Python 3.5::
>>> T.set_id(mol, u"\N{GREEK SMALL LETTER ALPHA}")
>>> T.create_string(mol, "sdf")[:13]
'α\n RDKit '
>>> T.create_bytes(mol, "sdf")[:13]
b'\xce\xb1\n RDKit'
>>> print(T.create_string(mol, "sdf")[:13])
α
RDKit
Use zlib record compression
===========================
In this section you'll learn about the "zlib" compression option for
single record parsers and byte string creation.
A record in SDF format can be large, but most of the content is
repetetive. Often it's better to store a zlib compressed record in a
database instead of the full record. When I use zlib to compress each
record of `Compound_027575001_027600000.sdf.gz
`_
I get a 4.5-fold compression. That is, the uncompressed records take
2704906 bytes, the individually compressed records take 594682
bytes, and the gzip compressed file takes 270419. (Gzip is nearly
twice as good as individually compressed records because it can
collect compression statistics across multiple records and build a
better prediction model.)
Chemfp supports a zlib compression option for the record-oriented
functions, though not the file-oriented functions. To enable it, add
".zlib" to the format string for
:func:`chemfp.toolkit.create_bytes`. Here you can see how adding that
suffix reduces the record size::
>>> from __future__ import print_function
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("CO", "smistring")
>>> print("uncompressed:", len(T.create_bytes(mol, "sdf")))
uncompressed: 228
>>> print("compressed:", len(T.create_bytes(mol, "sdf.zlib")))
compressed: 77
I'll complete a round-trip conversion by parsing the compressed SD
record to a molecule then converting it to a SMILES string::
>>> compressed = T.create_bytes(mol, "sdf.zlib")
>>> new_mol = T.parse_molecule(compressed, "sdf.zlib")
>>> T.create_string(new_mol, "smistring")
'CO'
The zlib option only works with ``create_bytes``; it does not work
with ``create_string`` because the latter only returns Unicode
strings, and it's possible for zlib to return something which isn't
valid Unicode. Here's what happens if you try to use it anyway::
>>> T.create_string(mol, "sdf.zlib")
Traceback (most recent call last):
File "", line 1, in
File "chemfp/rdkit_toolkit.py", line 428, in create_string
return _toolkit.create_string(mol, format, id, writer_args, errors)
File "chemfp/base_toolkit.py", line 1333, in create_string
raise ValueError("create_string() does not support compression. Use create_bytes()")
ValueError: create_string() does not support compression. Use create_bytes()
On the other hand, :func:`chemfp.toolkit.parse_molecule` takes both
Unicode strings and byte strings as input. It treats the byte strings
as being UTF-8 encoded.
Get a list of available formats and distinguish between input and output formats
================================================================================
In this section you'll learn how to get the list of available formats
for each object, and determine if a format can be used to get a
toolkit molecule from a string record, or convert a toolkit molecule
into a string record.
The toolkit's :func:`chemfp.toolkit.get_formats` function returns a
list of the available formats. On my computer RDKit supports 13
formats, OEChem 24, and Open Babel (showing off its heritage) supports
a whopping 189::
>>> from chemfp import rdkit_toolkit
>>> len(rdkit_toolkit.get_formats())
13
>>> rdkit_toolkit.get_formats()
[Format('rdkit/smi'), Format('rdkit/can'), Format('rdkit/usm'),
Format('rdkit/sdf'), Format('rdkit/smistring'),
Format('rdkit/canstring'), Format('rdkit/usmstring'),
Format('rdkit/molfile'), Format('rdkit/rdbinmol'),
Format('rdkit/inchi'), Format('rdkit/inchikey'),
Format('rdkit/inchistring'), Format('rdkit/inchikeystring')]
>>>
>>> from chemfp import openeye_toolkit
>>> len(openeye_toolkit.get_formats())
24
>>> openeye_toolkit.get_formats()
[Format('openeye/smi'), Format('openeye/usm'),
Format('openeye/can'), Format('openeye/sdf'),
Format('openeye/molfile'), Format('openeye/skc'),
Format('openeye/mol2'), Format('openeye/mol2h'),
Format('openeye/sln'), Format('openeye/mmod'),
Format('openeye/pdb'), Format('openeye/xyz'), Format('openeye/cdx'),
Format('openeye/mopac'), Format('openeye/mf'),
Format('openeye/oeb'), Format('openeye/inchi'),
Format('openeye/inchikey'), Format('openeye/smistring'),
Format('openeye/canstring'), Format('openeye/usmstring'),
Format('openeye/slnstring'), Format('openeye/inchistring'),
Format('openeye/inchikeystring')]
>>>
>>> from chemfp import openbabel_toolkit
>>> len(openbabel_toolkit.get_formats())
189
>>> openbabel_toolkit.get_formats()
[Format('openbabel/smi'), Format('openbabel/can'),
Format('openbabel/usm'), Format('openbabel/smistring'),
Format('openbabel/canstring'), Format('openbabel/usmstring'),
Format('openbabel/sdf'), Format('openbabel/inchi'),
Format('openbabel/inchikey'), Format('openbabel/inchistring'),
Format('openbabel/inchikeystring'), Format('openbabel/mp'),
Format('openbabel/gzmat'), Format('openbabel/txt'),
... many formats omitted ...
Format('openbabel/pdb')]
>>>
I'll use :func:`chemfp.toolkit.get_format`, which returns a
:class:`chemfp.base_toolkit.Format`, to get the "sdf" format for
OpenEye::
>>> sdf_format = openeye_toolkit.get_format("sdf")
>>> sdf_format.name
'sdf'
>>> sdf_format.toolkit_name
'openeye'
The "sdf" format can be used for both input and output in all toolkits::
>>> sdf_format.is_input_format, sdf_format.is_output_format
(True, True)
However, some formats are output only, like the InChIKey format
(assuming it's available for your toolkit)::
>>> inchi_fmt = openeye_toolkit.get_format("inchikey")
>>> inchi_fmt.is_input_format, inchi_fmt.is_output_format
(False, True)
On the other hand, some formats are input only, like Open Babel's
support for MOPAC's output format::
>>> mopout_fmt = openbabel_toolkit.get_format("mopout")
>>> mopout_fmt.is_input_format, mopout_fmt.is_output_format
(True, False)
Instead of asking for all available formats, you can ask for only the
input formats, or only the output formats, using
:class:`chemfp.toolkit.get_input_formats` or
:class:`chemfp.toolkit.get_output_formats`::
>>> from __future__ import print_function
>>> import chemfp
>>> for toolkit_name in ("openbabel", "openeye", "rdkit"):
... T = chemfp.get_toolkit(toolkit_name)
... print(toolkit_name, "has", len(T.get_input_formats()), "input formats")
... print(toolkit_name, "has", len(T.get_output_formats()), "output formats")
...
openbabel has 147 input formats
openbabel has 138 output formats
openeye has 16 input formats
openeye has 23 output formats
rdkit has 11 input formats
rdkit has 13 output formats
Determine the format for a given filename
=========================================
It's sometimes useful to know what format will be used for a given
filename. A filename can be used as a source for a reader or
destination for a writer, and a toolkit might understand a given
format when used as input but not as ouput, or vice-versa.
The function :func:`chemfp.toolkit.get_input_format_from_source`
returns a :class:`chemfp.base_toolkit.Format` for the given filename::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> T.get_input_format_from_source("abc.smi.gz")
Format('rdkit/smi.gz')
This is the same Format object you saw in the previous section. I
didn't mention the ``compression`` attribute in that discussion. It's
"gz" for gzip-ed files, and the empty string "" for uncompressed
files.
>>> fmt = T.get_input_format_from_source("abc.smi.gz")
>>> fmt.name
'smi'
>>> fmt.compression
'gz'
>>>
>>> fmt = T.get_input_format_from_source("abc.smi")
>>> fmt.name
'smi'
>>> fmt.compression
''
Asking for a supported format which isn't an input format raises a
ValueError exception::
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.get_input_format_from_source("example.inchikey")
Traceback (most recent call last):
File "", line 1, in
File "chemfp/openbabel_toolkit.py", line 143, in get_input_format_from_source
return _format_registry.get_input_format_from_source(source, format)
File "chemfp/base_toolkit.py", line 841, in get_input_format_from_source
format_config = self.get_input_format_config(register_name)
File "chemfp/base_toolkit.py", line 765, in get_input_format_config
% (self.external_name, register_name))
ValueError: Open Babel does not support 'inchikey' as an input format
even though "inchikey" is supported as an output format::
>>> openbabel_toolkit.get_output_format_from_destination("example.inchikey")
Format('openbabel/inchikey')
Yes, there's a different function to get the format name for a source
filename than for a destination filename. Maybe in the future I'll
support a generic ``get_format_from_filename()``; let me know if that
would be useful.
If you ask for a format which doesn't exist then the functions raises
a different ValueError exception::
>>> openbabel_toolkit.get_input_format_from_source("example.does-not-exist")
Traceback (most recent call last):
File "", line 1, in
.....
File "/Library/Python/2.7/site-packages/chemfp/base_toolkit.py", line 532, in get_format_config
% (self.external_name, register_name))
ValueError: Open Babel does not support the 'does-not-exist' format
I've found it useful to have a way to override the default guess. It's
amazing how many people use ".dat" for SMILES or SDF files, and ".txt"
files for SMILES. The format lookup functions support a second,
optional parameter, which is the format name to use.
>>> openbabel_toolkit.get_input_format_from_source("example.does-not-exist", "smi.gz")
Format('openbabel/smi.gz')
This exists so that code like::
if format is not None:
fmt = T.get_format(format)
else:
fmt = T.get_format_from_source(filename)
can be replaced with::
fmt = T.get_format_from_source(filename, format)
Working with a format object is useful when combined with format's
*reader_args* and *writer_arg* functions discussed in
:ref:`smiles_delimiter_reader_arg`
::
>>> fmt = openbabel_toolkit.get_input_format_from_source("input.smi")
>>> fmt.get_default_writer_args()
{'explicit_hydrogens': False, 'isomeric': True, 'delimiter': None, 'options': None, 'canonicalization': 'default'}
>>> fmt.get_writer_args_from_text_settings({
... "explicit_hydrogens": "true",
... "isomeric": "false",
... "delimiter": "tab"})
{'explicit_hydrogens': True, 'isomeric': False, 'delimiter': 'tab'}
.. _parse_id_and_molecule:
Parse the id and the molecule at the same time
==============================================
In this section you'll learn how to parse a structure record, as a
string, to extract both the identifier and the native molecule object.
Usually you want both the molecule and its id. You could parse the
molecule then use :func:`T.get_id(mol) ` to get
the id, but that's extra work, it leads to awkward looking code, and
is slower than having chemfp do the work for you when it parses the
molecule.
Instead, use :func:`chemfp.toolkit.parse_id_and_molecule`::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>>
>>> T.parse_id_and_molecule("C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a", "smi")
(u'vitamin a', )
Note that the identifier is a Unicode string. This is new with chemfp
3.0. Earlier versions returned byte string instead.
If there is no id/title field then the id will either be None or the
empty string, depending on the toolkit and format::
>>> T.parse_id_and_molecule("C", "smi")
(None, )
Instead of testing for the empty string or None, your code you should
use "if not id:" to test for a missing id::
>>> id, mol = T.parse_id_and_molecule("C", "smi")
>>> if not id:
... print("Missing id!")
...
Missing id!
.. _parse_molecule_with_error:
Specify alternate error behavior
================================
In this section you'll learn how to use the *errors* parameter to have
:func:`chemfp.toolkit.parse_molecule` return None rather than raise an
exception, and to have it print a report about the failing molecule.
The string "Q" is not a valid SMILES string. All of the toolkits will
fail to parse it, and the chemfp toolkit I/O adapter by default raises
an exception when that happens::
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.parse_molecule("Q", "smistring")
Traceback (most recent call last):
File "", line 1, in
...
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: Open Babel cannot parse the SMILES 'Q'
>>>
>>> rdkit_toolkit.parse_molecule("Q", "smistring")
[01:14:30] SMILES Parse Error: syntax error for input: 'Q'
Traceback (most recent call last):
File "", line 1, in
...
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: RDKit cannot parse the SMILES string 'Q'
>>>
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.parse_molecule("Q", "smistring")
Warning: Problem parsing SMILES:
Warning: Q
Warning: ^
Traceback (most recent call last):
File "", line 1, in
...
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: OEChem cannot parse the smistring record: 'Q'
On the other hand, "[NH8]" is a valid SMILES, but RDKit by default
will reject it as chemically unreasonable, while OEChem and Open Babel
are less strict and treat it as a molecular graph rather than a
chemical molecule.
I'll write a program which checks which toolkits will parse "[NH8]"
.. code-block:: python
# I call this "check_NH8.py"
from __future__ import print_function
import chemfp
allowed = []; rejected = []
for name in chemfp.get_toolkit_names():
T = chemfp.get_toolkit(name)
try:
T.parse_molecule("[NH8]", "smistring")
except ValueError:
rejected.append(name)
else:
allowed.append(name)
print("Allowed:", allowed, "Rejected:", rejected)
.. code-block:: console
% python check_NH8.py
[01:25:49] Explicit valence for atom # 0 N, 8, is greater than permitted
Allowed: ['openeye', 'openbabel'] Rejected: ['rdkit']
I think the try/except/else is sometimes harder to understand than
returning an error value, because it's harder to see the control
flow. I can ask :func:`chemfp.toolkit.parse_molecule` to ``ignore``
errors, which causes it to return a None object rather than raise an
exception. turns the above loop into the following:
.. code-block:: python
for name in chemfp.get_toolkit_names():
T = chemfp.get_toolkit(name)
mol = T.parse_molecule("[NH8]", "smistring", errors="ignore")
if mol is None:
rejected.append(name)
else:
allowed.append(name)
The *errors* option is more useful in later sections, when parsing
multiple records.
The ``errors`` parameter can also take the value ``report``. Like
``ignore``, this will return a None when there is an error rather than
raise an exception. It will also write a consistent, cross-toolkit
error message to stderr, including the SMILES string that failed if
the input is a SMILES::
>>> for name in chemfp.get_toolkit_names():
... T = chemfp.get_toolkit(name)
... T.parse_molecule("Q", "smistring", errors="report")
... T.parse_molecule("[NH8]", "smistring", errors="report")
...
Warning: Problem parsing SMILES:
Warning: Q
Warning: ^
ERROR: OEChem cannot parse the smistring record: 'Q'. Skipping.
>
[01:31:22] SMILES Parse Error: syntax error for input: 'Q'
ERROR: RDKit cannot parse the SMILES string 'Q'. Skipping.
[01:31:22] Explicit valence for atom # 0 N, 8, is greater than permitted
ERROR: RDKit cannot parse the SMILES string '[NH8]'. Skipping.
ERROR: Open Babel cannot parse the SMILES 'Q'. Skipping.
>
The :func:`chemfp.toolkit.parse_id_and_molecule` function also takes
the ``errors`` parameter. If the structure could not be parsed then
the second component of the tuple (the molecule) will be None. The
first component (the id) may or or may not be None, depending on the
underlying implementation::
>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
[01:33:48] SMILES Parse Error: syntax error for input: 'Q'
(None, None)
>>>
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
Warning: Problem parsing SMILES:
Warning: Q q-ane
Warning: ^
(None, None)
>>>
>>> from chemfp import openbabel_toolkit
>>> openbabel_toolkit.parse_id_and_molecule("Q q-ane", "smi", errors="ignore")
('q-ane', None)
Future versions of chemfp may work to normalize this behavior, or let
the caller choose a specific behavior.
.. _smiles_delimiter_reader_arg:
Specify a SMILES delimiter through reader_args
==============================================
In this section you'll learn how to parse a SMILES record as a set of
delimited fields instead of the default of a SMILES string followed by
a title, and some of the limitations of chemfp's attempt at a
consistent cross-toolkit SMILES record parser.
You might think that the SMILES file format is well defined, but it
sadly isn't. Different toolkits have slightly different
interpretations for a SMILES record format. Consider the SMILES
record:
.. code-block:: none
C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a
The original Daylight definition is that a SMILES record is single
line, which starts with the SMILES string. The SMILES string ends with
the first whitespace character or the end of the line, and if there
was a whitespace character than the rest of the line is the
title. OpenEye follows this definition, as does chemfp. That's why the
previous example extracted "vitamin A" as the record id.
However, RDKit treats a SMILES file record as a space or tab separated
set of fields, where the first field is the SMILES, the second field
is the id/title and additional columns may store other
properties. RDKit would use "vitamin" as the record id for this
record. (RDKit can also be configured to interpret the first line as
column names. Chemfp does not currently support this option, though I
plan to have a cross-platform implementation in a future release.)
Chemfp normalizes the SMILES record parser API so that all toolkits by
default expect the Daylight format. Use the optional *reader_args*
dictionary to specify an alternate interpretation::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>>
>>> smiles = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a"
>>> T.parse_id_and_molecule(smiles, "smi", reader_args={"delimiter": "whitespace"})
(u'vitamin', )
In this case I asked it to parse the record as a set of whitespace
delimited fields. If you have tab-separated fields, where a space
inside of a field is not part of the delimiter, then use the "tab"
delimiter::
>>> T.parse_id_and_molecule("O=O\tmolecular oxygen\t31.9988\n", "smi",
... reader_args={"delimiter": "tab"})
(u'molecular oxygen', )
The supported delimiters are:
* to-eol - (default) everything past the first whitespace is interpreted as the id/title;
* tab or "\\t" - the fields are tab-separated; the first field is the SMILES and the second the id;
* space or " " - the fields are space-separated;
* whitespace - the fields are whitespace-separated;
* native - use the native interpretation for the given toolkit;
While chemfp strives for cross-toolkit portability, it is not
perfect. Leading and trailing whitespace might not be supported, so
the first character of the SMILES record must also be the first
character of the SMILES string. Also, the toolkit is free to interpret
the first whitespace as the delimiter despite the *reader_args*
setting. You can see the difference between RDKit and OEChem toolkits
in the following, where the record is tab-separated but OEChem always
stops parsing a SMILES at the first whitespace::
>>> smiles = "O=O\tmolecular oxygen\n"
>>>
>>> from chemfp import openeye_toolkit
>>> openeye_toolkit.parse_id_and_molecule(smiles, "smi", reader_args={"delimiter": "space"})
(u'molecular', >)
>>>
>>> from chemfp import rdkit_toolkit
>>> rdkit_toolkit.parse_id_and_molecule(smiles, "smi", reader_args={"delimiter": "space"})
[01:38:59] SMILES Parse Error: syntax error for input: 'O=O molecular'
Traceback (most recent call last):
File "", line 1, in
...
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: RDKit cannot parse the SMILES 'O=O\tmolecular'
Neither the SMILES parser nor the other parsers validate the full
contents of the *reader_args* dictionary. Extra items are
ignored. This is deliberate because it lets you combine, say, SMILES
and SDF parameters in the same dictionary without needing to check the
specific format first.
To a lesser extent, it also makes it easier to specify parameters
which work across multiple toolkit versions. For example, the most
recent version of OEChem's SMILES parsers added a *quiet* option,
which chemfp will support in the future. Your code can have a
{"quiet": True} without first checking to see if this version of
chemfp is new enough to support the parameter.
WARNING: As a result, it's very easy to specify a key with a typo,
which is ignored, and not notice that it nothing happens.
WARNING #2: Really, I've been bitten by this a few times. Be extra
cautious to check that you are using the right keys.
.. _smiles_delimiter_writer_args:
Specify an output SMILES delimiter through writer_args
======================================================
In this section you'll learn how to create a SMILES record with a tab
character separating the SMILES from the title using the *writer_args*
parameter of :func:`chemfp.toolkit.create_string`.
By default ``create_string`` uses a space character to separate the
SMILES from the rest of the id::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>>
>>> mol = T.parse_molecule("O=O molecular oxygen\n", "smi")
>>> T.create_string(mol, "smi")
u'O=O molecular oxygen\n'
To use a tab character instead, pass in a *writer_args* dictionary
with a "delimiter" of "tab"::
>>> T.create_string(mol, "smi", writer_args={"delimiter": "tab"})
u'O=O\tmolecular oxygen\n'
The *writer_args* delimiter also accepts "whitespace", "space",
"to-eol" and the other values from *reader_args*. Only "tab" and "\\t"
will use a tab character as the delimiter; all of the the others will
use a space character.
Neither the SMILES writer nor the other writers validate the full
contents of the *writer_args* dictionary. Extra items are
ignored. This is deliberate because it lets you combine, say, SMILES
and SDF parameters in the same dictionary without needing to check the
specific format first. It also makes it easier to specify parameters
which work across multiple toolkit versions.
WARNING: As a result, it's very easy to specify a key with a typo,
which is ignored, and not notice that it nothing happens.
WARNING #2: Really, I've been bitten by this a few times. Be extra
cautious to check that you are using the right keys.
.. _rdkit_reader_and_writer_args:
RDKit-specific SMILES reader_args and writer_args
=================================================
In this section you'll learn how to pass toolkit-specific parameters
to the RDKit toolkit functions to parse and create a SMILES
string. You will need the RDKit toolkit.
Earlier I showed that RDKit by default does a sanitization check to
verify that the input is correct.
>>> from chemfp import rdkit_toolkit
>>> mol = rdkit_toolkit.parse_molecule("[NH8]", "smistring", errors="ignore")
[01:46:17] Explicit valence for atom # 0 N, 8, is greater than permitted
>>> mol is None
True
The underlying RDKit code to parse a SMILES string, `MolFromSmiles
`_,
takes a *sanitize* parameter. The default, True, tells it to do the
sanitization step, while False disables it.
Use the *reader_args* dictionary to pass the sanitize parameter to the
underlying toolkit function::
>>> mol = rdkit_toolkit.parse_molecule("[NH8]", "smistring", reader_args={"sanitize": False})
>>> mol
>>> from rdkit import Chem
>>> Chem.MolToSmiles(mol)
'[NH8]'
Use the *writer_args* dictionary to pass toolkit-specific parameters
to RDKit's MolToSmiles::
>>> mol = rdkit_toolkit.parse_molecule("c1ccccc1[16OH]", "smistring")
>>> rdkit_toolkit.create_string(mol, "smistring")
u'[16OH]c1ccccc1'
>>> rdkit_toolkit.create_string(mol, "smistring",
... writer_args={"isomericSmiles": False})
u'Oc1ccccc1'
>>> rdkit_toolkit.create_string(mol, "smistring",
... writer_args={"kekuleSmiles": True, "allBondsExplicit": True})
u'[16OH]-C1:C:C:C:C:C:1'
See :ref:`get_default_reader_and_writer_args` for a description of how
to get the default reader and writer arguments for a given format, and
use help(:func:`rdkit_toolkit.read_molecules <.rdkit_toolkit.read_molecules>`)
and help(:func:`rdkit_toolkit.open_molecule_writer
<.rdkit_toolkit.open_molecule_writer>`) to get a more human-readable description.
.. _openeye_reader_and_writer_args:
OpenEye-specific SMILES reader_args and writer_args
===================================================
In this section you'll learn how to pass toolkit-specific parameters
to the OEChem toolkit functions to parse and create a SMILES
string. You will need the OEChem toolkit. See the next section for
specific details about aromaticity.
By default the OEChem SMILES parser is tolerant of bad SMILES. I
believe it's too tolerant, because will gladly parse what I think are
invalid SMILES, like "``C-=C``"::
>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("C-=C", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring")
u'C=C'
The developers at OpenEye recognize that pedantic folk like me
exist. The OEChem SMILES parser has a "strict" mode, which I can
enable in chemfp through the "flavor" parameter of the *reader_args*
dictionary::
>>> mol = openeye_toolkit.parse_molecule("C-=C", "smistring",
... reader_args={"flavor": "Strict"})
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C-=C
Warning: ^
Traceback (most recent call last):
File "", line 2, in
.... lines omitted ....
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: OEChem cannot parse the smistring record: 'C-=C'
The underlying `OEParseSmiles()
`_
function takes the optional *strict* and *canon* parameters. Why does
chemfp use the term "flavor"? Why the capitalization for "Strict"?
Historically the low-level OEChem functions took individual
parameters, like the positional arguments *canon* and *strict*::
>>> mol = OEGraphMol()
>>> OEParseSmiles(mol, "C-=C", False, True)
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C-=C
Warning: ^
False
(I wrote "historically" because more recent versions have
format-specific options classes, like `OEParseSmilesOptions
`_
for SMILES. These collect all of the configuration options into a
single parameter, which is easier to pass around.)
On the other hand, the high-level molecule parsers take a single
"`flavor
`_"
integer value to specify the options for a given format. This flavor
is usually expressed as the union of a set of bitmasks. I'll show how
OEChem's Python API uses the flavor parameter.
The following OEChem code reads a SMILES file in the default
non-strict mode (with no specified flavor)::
% cat example.smi
C=-C bad
CCC good
% python
...
>>> from __future__ import print_function
>>> from openeye.oechem import *
>>> ifs = oemolistream("example.smi")
>>> for mol in ifs.GetOEGraphMols():
... print(mol.GetTitle(), mol.NumAtoms())
...
bad 2
good 3
while the following sets the SMILES flavor to use "strict" mode::
>>> ifs = oemolistream("example.smi")
>>> ifs.SetFlavor(OEFormat_SMI, OEIFlavor_SMI_Strict)
True
>>> for mol in ifs.GetOEGraphMols():
... print(mol.GetTitle(), mol.NumAtoms())
...
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C=-C bad
Warning: ^
Warning: Error reading molecule "" in Canonical stereo SMILES format.
good 3
(You can see some terminology differences between me and OpenEye in
the warning message. The "Canonical" and "stereo" are only meaningful
as a description of the output format, not the input format, and I use
the traditional term "isomeric" while they highlight the more
important stereochemistry aspect. I also got confused because I
thought at first the "Canonical" had something to do with
OEIFlavor_SMI_Canon.)
I decided to base :mod:`chemfp openeye_toolkit
` API on the high-level "flavor" API of
OEChem, which is better documented and requires less work on my part
to implement than low-level functions. But I also decided to extend it
to support a string value, and not just an integer.
To explain how that works, I'll switch from describing *reader_args*
to *writer_args*, because raising an exception with the "Strict"
option gets boring, fast.
The OEChem SMILES output flavors are: ``OEOFlavor_SMI_AtomMaps``,
``OEOFlavor_SMI_AtomStereo``, .... and you know what? The
``OEOFlavor_SMI_`` prefix is part of what makes the flavors hard to
use in Python, so I'll omit the prefix in chemfp. The OEChem SMILES
output flavors are: ``AtomMaps``, ``AtomStero``, ``BondStereo``,
``Canonical``, ``ExtBonds``, ``Hydrogens``, ``ImpHCount``,
``Isotopes``, ``Kekule``, ``RGroups``, ``SmiMask``, and
``SuperAtoms``. There are also ``Default`` and ``DEFAULT`` which are
the bitwise union
``RGroups|Isotopes|AtomStereo|BondStereo|AtomMaps|Canonical``.
In chemfp you can specify the fields as a "|" or "," separated list of
flavor flags, without the prefix. Here are several different ways to
specify the default settings for isomeric canonical SMILES string
output::
>>> mol = openeye_toolkit.parse_molecule("[16O][*:1]", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring")
u'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": ""})
u'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Default"})
u'[R1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "RGroups|Isotopes|AtomStereo|BondStereo|AtomMaps|Canonical"})
u'[R1][16O]'
These settings override any options which might be implied by the
format name. Thus, even though "smistring" is supposed to generate an
isomeric canonical SMILES, I can use the *writer_args* to remove the
isomeric component from the flavor::
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "RGroups|AtomStereo|BondStereo|AtomMaps|Canonical"})
u'[R1][O]'
While I used "|" as the separator, I can equally use ",", as in::
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Isotopes,Canonical"})
'*[16O]'
OEChem uses the bar as a bitwise-or operator which merges the
different flags. I added the comma as an alternative to the vertical
bar because chemfp has additional syntax for removing options. The
following removes the "RGroups" option from the isomeric and
non-isomerical formats defaults, but otherwise leaves the defaults
alone::
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Default,-RGroups"})
u'[*:1][16O]'
>>>
>>> openeye_toolkit.create_string(mol, "canstring",
... writer_args={"flavor": "Default,-RGroups"})
u'[*:1][O]'
(The terms are evaluated from left to right, so you can delete a term
then add it back if you want.)
Writing this as ``Default|-RGroups`` caused the C programmer mind in
me to gasp in bewilderment. ("The bitwise-or with the negative of
the RGroups bitflags?!!")
You don't need to specify the OEChem flavor using a flavor string. You
can also specify it as an integer::
>>> from openeye.oechem import *
>>> (OEOFlavor_SMI_Isotopes|OEOFlavor_SMI_AtomStereo|OEOFlavor_SMI_BondStereo|
... OEOFlavor_SMI_AtomMaps|OEOFlavor_SMI_Canonical)
121
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": 121})
u'[*:1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": 0})
u'[O]*'
or (and this might be a bit excessive) as a string-encoded integer::
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "121"})
u'[*:1][16O]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "0"})
u'[O]*'
Chemfp tries to be helpful. It will include the list of available
flavor names in the exception if it doesn't understand what you gave
it:
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "chocolate"})
Traceback (most recent call last):
File "", line 2, in
File "chemfp/openeye_toolkit.py", line 428, in create_string
return _toolkit.create_string(mol, format, id, writer_args, errors)
...
File "chemfp/_openeye_toolkit.py", line 895, in parse_flavor
% (self.register_name, term, available_flavors))
ValueError: OEChem smi format does not support the 'chocolate'
flavor option. Available flavors are: AllBonds, AtomMaps,
AtomStereo, BondStereo, Canonical, ExtBonds, Hydrogens,
ImpHCount, Isotopes, Kekule, RGroups, SuperAtoms
See :ref:`get_default_reader_and_writer_args` for a description of how
to get the default reader and writer arguments for a given format, and
use help(:func:`openeye_toolkit.read_molecules <.openeye_toolkit.read_molecules>`)
and help(:func:`openeye_toolkit.open_molecule_writer
<.openeye_toolkit.open_molecule_writer>`) to get a more human-readable description.
OpenEye-specific aromaticity
============================
In this section you'll learn how chemfp handles OpenEye's aromaticity
parameter. You will need the OEChem toolkit, and you should read the
previous section to understand some of the terminology.
Note: the OEGraphSim fingerprints are not affected by the
*aromaticity* of the reader because they ensure that the molecules are
always perceived using "openeye" aromaticity before generating the
fingerprint.
The OpenEye toolkit supports the "openeye", "daylight", "tripos",
"mdl", and "mmff" aromaticity models. In the high-level API, which is
meant for reading and writing files or file-like objects, the
aromaticity is an aspect of the flavor integer. If unspecified, OEChem
uses the appropriate default aromaticity model for that format. As a
result, aromaticity perception is required for both reading and
writing files.
The low-level API handles file processing and aromaticity perception
as distinct steps. This API can also process a single record directly,
while the high-level API requires wrapping the record in a file-like
object and then reading the first molecule from it.
The chemfp toolkit API is a high-level API for both files and records,
which means I had to implement record conversion routines on top of
OEChem's low-level API. Consequently, some of the details are
different between the file I/O and record I/O APIs; the most
significant being that the record I/O routines also support a "none"
aromaticity.
The following shows the default aromaticity proceessing in action::
>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("C1=CC=CC=C1", "smistring")
>>> [bond.IsAromatic() for bond in mol.GetBonds()]
[True, True, True, True, True, True]
Automatic aromaticity perception is normally the right thing to do,
because different toolkits and even different versions of the same
toolkit may have different ideas of what is aromatic, and it's best to
ensure that they are consistently interpreted.
Aromaticity perception isn't needed when you know that the input
aromaticity is correct and unambiguous. My timings show that
aromaticity perception takes about half of the time needed to parse a
SMILES string. If the string comes from a good data source, like a
database record where OEChem created the SMILES, then you can nearly
double the performance by omitting the perception step.
What does "ambiguous" mean? Consider azulene, which can be described
by the SMILES "c1ccc2cccc2cc1". The fusion bond is not aromatic, while
the peripheral bonds form a 10 pi electron system. In SMILES, an
unspecified bond means "single or aromatic". If one of the terminal
atoms is aliphatic then the bond must be a single bond. But as the
fusion bond in azulene shows, it's possible for an unspecified bond
with terminal aromatic atoms to still be non-aromatic. The above
SMILES is ambiguous, and OEChem needs to do a full aromaticity
analysis to determine that the fusion bond is not aromatic.
An unambiguous SMILES for azulene is "c1ccc-2cccc2cc1", where the
fusion bond is marked explicitly as a single bond. The SMILES parser
can use the simpler rule that an unspecified ring bond is aromatic
whenever both terminal atoms are aromatic, and not require the lengthy
aromatic perception step to determine that. OEChem generates
unambiguous SMILES, so if you know OEChem generated the SMILES then
you can recover the original aromaticity directly.
(As a side note, Daylight first introduced this in 4.71, and used
fluorene ("C1c2ccccc2-c3ccccc13") as the prototypical case. Daylight's
rule is to include the "-" for a single bond between two aromatic
atoms, while OEChem's rule is to include the "-" for a single bond
between two aromatic atoms and which is in a ring. Ring identification
is much easier than aromaticity perception.)
So where was I ... ah, right, specifing the aromaticity model. I
decided to separate aromaticity from the rest of the flavor flags, and
specify it with its own *reader_args* and *writer_args* field. It's
easiest to see using beneze in Kekule form::
>>> mol = openeye_toolkit.parse_molecule("C1=CC=CC=C1", "smistring",
... reader_args={"aromaticity": "none"})
>>>
>>> [bond.IsAromatic() for bond in mol.GetBonds()]
[False, False, False, False, False, False]
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"aromaticity": "none"})
'C1=CC=CC=C1'
NOTE: the aromaticity flags are volatile. If you don't specify the
"none" aromaticity model then :func:`chemfp.toolkit.create_string`
will reperceive aromaticity using the "openeye" aromaticity model and
possibly reassign the aromaticity flags.
>>> openeye_toolkit.create_string(mol, "smistring")
u'c1ccccc1'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"aromaticity": "none"})
u'c1ccccc1'
This is consistent with how OEChem's high-level operations also modify
the input molecule when creating output. I'm not fully happy with it.
OEChem also has a "ConstMolecule" version, so this detail may change
in the future.
.. _openbabel_reader_and_writer_args:
Open Babel-specific SMILES reader_args and writer_args
======================================================
In this section you'll learn how to pass toolkit-specific parameters
to the Open Babel toolkit functions to create a SMILES string. You
will need the Open Babel toolkit.
As far as I can tell, Open Babel does not have configuration options
to change the default SMILES parser, so chemfp has no toolkit-specific
*reader_args* for that toolkit. Open Babel does have configuration
options to change the default SMILES output routines. These can be set
in chemfp with the *writer_args* dictionary.
Open Babel uses an options string to change the configuration. The
string "``i U smilesonly``" generates non-isomeric SMILES output, where
the atom ordering is determined by the InChI's canonicalization
algorithm ("Universal SMILES"), and where the identifier is excluded
from the SMILES output.
Did you know all of that? I didn't. Some of these options are only
documented in the code. It's also difficult for chemfp to handle since
some of the options conflict with how chemfp thinks of things. For
example, chemfp is in charge of including the identifier, so it will
always enable "smilesonly", and it's difficult for the "cansmiles"
output, which is non-isomeric, to know if an options string wants to
override the default"i" option that it requires.
I ended up making my own *writer_args* API to have more explicit control
over the individual parameters:
* explicit_hydrogens - boolean
* isomeric - boolean
* canonicalization - a string like "default", "none", "universal", "anticanonical", or "inchified"
* options - the Open Babel options string (if you must use it; using
it may break things if you are not very careful.)
Here's an example of how to disable isomeric support for the
"smistring" output, which would normally generate an isomeric SMILES::
>>> from chemfp import openbabel_toolkit
>>> mol = openbabel_toolkit.parse_molecule("[16O]=O", "smistring")
>>> openbabel_toolkit.create_string(mol, "smistring")
u'[16O]=O'
>>> openbabel_toolkit.create_string(mol, "smistring",
... writer_args={"isomeric": False})
u'O=O'
I can also enable isomeric SMILES for the "canstring" format, which is
normally non-isomeric::
>>> openbabel_toolkit.create_string(mol, "canstring")
u'O=O'
>>> openbabel_toolkit.create_string(mol, "canstring",
... writer_args={"isomeric": True})
u'[16O]=O'
Open Babel supports several different canonicalization
algorithms. Perhaps the most unusual one is "anticanonical", which
uses random numbers for the atom ordering algorithm. The same molecule
can generate different SMILES strings across multiple calls, so it's
the antithesis of "canonical"::
>>> for i in range(5):
... print(openbabel_toolkit.create_string(mol, "smistring",
... writer_args={"canonicalization": "anticanonical"}))
...
[16O]=O
[16O]=O
O=[16O]
[16O]=O
[16O]=O
See :ref:`get_default_reader_and_writer_args` for a description of how
to get the default reader and writer arguments for a given format, and
use help(:func:`openbabel_toolkit.read_molecules <.openbabel_toolkit.read_molecules>`)
and help(:func:`openbabel_toolkit.open_molecule_writer
<.openbabel_toolkit.open_molecule_writer>`) to get a more human-readable description.
.. _get_default_reader_and_writer_args:
Get the default reader_args or writer_args for a format
=======================================================
In this section you'll learn how to get the default *reader_args* and
*writer_args* for a given format.
As you've seen, each toolkit format can have its own *reader_args* and
*writer_args* parameters, and chemfp layers its own format types (like
"smistring") on top of the native formats. It's easy to forget the
specific parameters for a given format, much less the default values.
The :meth:`~.Format.get_default_reader_args` and
:meth:`~.Format.get_default_writer_args` methods of the
:class:`.Format` object return the respective default arguments::
>>> from chemfp import rdkit_toolkit
>>> fmt = rdkit_toolkit.get_format("smi")
>>> fmt.get_default_reader_args()
{'delimiter': None, 'has_header': False, 'sanitize': True}
>>> fmt.get_default_writer_args()
{'isomericSmiles': True, 'delimiter': None, 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
You can sometimes use this information to see how chemfp maps its
format types to the toolkit parameters. In RDKit, the difference
between chemfp's "smi" and "can" formats is that *isomericSmiles* is
True for the first and False for the second::
>>> rdkit_toolkit.get_format("can").get_default_writer_args()
{'isomericSmiles': False, 'delimiter': None, 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
While writing this documentation I realized that the OEChem toolkit
shows neither the default flavor nor the default aromaticity for a
given format type. I will likely fix that in a future version of
chemfp.
.. _text_settings_to_reader_and_writer_args:
Convert text settings into reader and writer arguments
======================================================
In this section you'll learn how to convert text-based configuration
settings into the appropriate *reader_args* or *writer_args*
dictionary.
The *reader_args* and *writer_args* take native Python values,
including integers and booleans. In practice these will often be
defined in a configuration file, through command-line options, or as
CGI parameters. The :class:`.Format` methods
:meth:`~.Format.get_reader_args_from_text_settings` and
:meth:`~.Format.get_writer_args_from_text_settings` convert a
text-based settings dictionary into the appropriate arguments
dictionary with native Python objects as values. (These are methods of
the Format object, because the parameter details are format-specific.)
The following shows an example using the RDKit toolkit's "sdf" format
to get *reader_args* from a dictionary of text settings:
>>> from chemfp import rdkit_toolkit as T
>>>
>>> sdf_format = T.get_format("sdf")
>>> sdf_format.get_default_reader_args()
{'strictParsing': True, 'removeHs': True, 'sanitize': True}
>>>
>>> sdf_format.get_reader_args_from_text_settings({
... "strictParsing": "true",
... "removeHs": "False",
... "sanitize": "0"})
{'strictParsing': True, 'removeHs': False, 'sanitize': False}
The boolean setting parser converts "true", "True", and "1" to
Python's True, and "false", "False", and "0" to Python's
False. Otherwise it raises a ValueError.
The following shows an equivalent example for RDKit's SDF *writer_args*::
>>> sdf_format.get_default_writer_args()
{'kekulize': True, 'includeStereo': False}
>>> sdf_format.get_writer_args_from_text_settings({
... "kekulize": "false",
... "includeStereo": "True"})
{'kekulize': False, 'includeStereo': True}
WARNING: these functions will ignore unknown keys. This was done to
allow the text settings dictionary to contain settings for other
toolkits and formats. As a result, typos are harder to detect, because
they will be ignored.
See :ref:`argparse_text_settings` for an example of converting text
settings from the command-line into reader and writer arguments.
.. _multitoolkit_reader_and_writer_args:
Multi-toolkit reader_args and writer_args
=========================================
In this section you'll learn how to configure *reader_args* and
*writer_args* so the same dictionary can be used to configure multiple
toolkits and formats.
Sometimes you don't know which toolkit will be used for parsing, but
you do know that you want Open Babel, OEChem, and RDKit to act in
non-standard ways. For example, the choice of toolkit may depend on
the user-defined fingerprint type, or simply (as in the following
example) depend on user input.
The *reader_args* and *writer_args* will ignore unknown parameters,
which lets you combine arguments for different toolkits into a
single dictionary. As the toolkits use completely different parameter
names (except a couple, like "delimiter", which are supposed to act
the same for all toolkits), there's no conflict in the names for a
given format.
The following defines a *reader_args* dictionary and a *writer_args*
dictionary with parameters for each supported toolkit, then enters a
loop. The loop asks the user for a SMILES string, or the name of the
toolkit to use, or "q" to quit the loop. It will parse each SMILES
into a molecule, then generate a SMILES output, although with
decidedly strange parameters::
from __future__ import print_function
import chemfp
from chemfp import rdkit_toolkit as T # use your default toolkit of choice
try:
raw_input # Python 2 name
except NameError:
raw_input = input # Python 3
reader_args = {
"sanitize": False, # RDKit,
"flavor": "Default|Strict", # OEChem
"aromaticity": "none", # OEChem
}
writer_args = {
"kekuleSmiles": True, # RDKit
"canonicalization": "anticanonical", # Open Babel
"aromaticity": "daylight", # OEChem
}
print("Using", T.name, "toolkit")
while 1:
query = raw_input("SMILES, toolkit name, or 'q' to quit? ")
if not query or query == "q":
break
if query in ("rdkit", "openeye" ,"openbabel"):
try:
T = chemfp.get_toolkit(query)
except ValueError:
print("Toolkit %r not available" % (query,))
print("Using", T.name, "toolkit")
continue
mol = T.parse_molecule(query, "smistring", reader_args=reader_args, errors="ignore")
if mol is None:
print("Toolkit", T.name, "could not parse query as SMILES")
continue
smiles = T.create_string(mol, "smistring", writer_args=writer_args, errors="ignore")
if not smiles:
print("Toolkit", T.name, "could not convert the molecule to SMILES")
continue
print(" -->", smiles)
I saved the above to a script and then ran it. It starts using RDKit,
where I've set the reader's "sanitize" to False so RDKit won't
perceive aromaticity on input, and set the writer's "kekuleSmiles" to
show explicit aromatic bond types::
Using rdkit toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> OC1=CC=CC=C1
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> OC1:C:C:C:C:C:1
I then switch to the OpenEye toolkit, show that it is operating with
"strict" added to the default reader flavor, and convert a couple of
SMILES to canonical SMILES to show the output uses the Daylight
aromaticity model instead of the default::
SMILES, toolkit name, or 'q' to quit? openeye
SMILES, toolkit name, or 'q' to quit? C==C
Warning: Problem parsing SMILES:
Warning: Bond without end atom.
Warning: C==C
Warning: ^
Toolkit openeye could not parse query as SMILES
Using openeye toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> c1ccc(cc1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> c1ccc(cc1)O
Finally, I switched to the Open Babel toolkit and showed that it
generates "anti-canonical" SMILES, where the spanning tree priority
order for SMILES output is randomly assigned::
SMILES, toolkit name, or 'q' to quit? openbabel
Using openbabel toolkit
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? C1=CC=CC=C1O
--> c1ccc(cc1)O
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> Oc1ccccc1
SMILES, toolkit name, or 'q' to quit? c1ccccc1O
--> c1c(O)cccc1
SMILES, toolkit name, or 'q' to quit? q
See :ref:`argparse_text_settings` for an example of using
multi-toolkit reader_args and writer_args.
.. _qualified_parameter_names:
Qualified reader and writer parameters names
============================================
In this section you'll learn how to use qualified parameter names.
These give fine-grained control over the configuration options for
each toolkit and format.
The previous section pointed out that the three toolkits use different
parameter names, so for a given format you can combine the
toolkit-specific *reader_args* into one unified dictionary and
*writer_args* into another unified dictionary. However, within a
toolkit the same parameter name can be reused for different formats,
with different meanings.
This best example is for the :mod:`chemfp.openeye_toolkit`, where the
*reader_args* and *writer_args* for all formats support the "flavor"
and "aromaticity" parameters. The following shows examples where I
might use a different flavor for the SMILES and InChI outputs, to get
something other than the default representation::
>>> from chemfp import openeye_toolkit
>>> mol = openeye_toolkit.parse_molecule("CC([O-])=O", "smistring")
>>>
>>> openeye_toolkit.create_string(mol, "smistring")
u'CC(=O)[O-]'
>>> openeye_toolkit.create_string(mol, "smistring",
... writer_args={"flavor": "Default|ImpHCount"})
u'[CH3]C(=O)[O-]'
>>>
>>> openeye_toolkit.create_string(mol, "inchistring")
u'InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1'
>>> openeye_toolkit.create_string(mol, "inchistring",
... writer_args={"flavor": "Default|FixedHLayer"})
u'InChI=1/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1/fC2H3O2/q-1'
Chemfp uses "qualified" parameter names to handle this situation. For
example, the qualified name "smistring.flavor" is the flavor parameter
for the smistring format::
>>> writer_args = {
... "smistring.flavor": "Default|ImpHCount",
... "inchistring.flavor": "Default|FixedHLayer",
... }
>>> mol = openeye_toolkit.parse_molecule("CC([O-])=O", "smistring")
>>> openeye_toolkit.create_string(mol, "smistring", writer_args=writer_args)
u'[CH3]C(=O)[O-]'
>>> openeye_toolkit.create_string(mol, "inchistring", writer_args=writer_args)
'InChI=1/C2H4O2/c1-2(3)4/h1H3,(H,3,4)/p-1/fC2H3O2/q-1'
WARNING: there are six SMILES-related formats ("smi", "can", "usm",
"smistring", "canstring", and "usmstring") so to be complete you'll
need to specify values for all of them. There are also two
InChI-related formats ("inchi" and "inchistring").
A "fully qualified" name looks like "openeye.smistring.flavor". The
first term is the toolkit, the second the format name, and the last
the parameter name. At present there is no real need for fully
qualified names because the toolkits don't share any parameter names
except for a couple which are supposed to be identical across all
toolkits.
The following demonstration, which is more a parlor trick than
something useful, shows how to have each toolkit use a different
SMILES delimiter::
>>> from __future__ import print_function
>>> import chemfp
>>>
>>> reader_args = {
... "rdkit.smi.delimiter": "tab",
... "openbabel.smi.delimiter": "whitespace",
... "openeye.smi.delimiter": "to-eol",
... }
>>>
>>> for toolkit_name in ("rdkit", "openbabel", "openeye"):
... T = chemfp.get_toolkit(toolkit_name)
... id, mol = T.parse_id_and_molecule("C\tabc def\tghi", "smi",
... reader_args=reader_args)
... print(toolkit_name, "sees the id", repr(id))
...
rdkit sees the id u'abc def'
openbabel sees the id u'abc'
openeye sees the id u'abc def\tghi'
(As a reminder, the 'delimiter' implementation is not perfect. A
toolkit may accept the first whitespace after the SMILES term as a
valid delimiter even if it doesn't match the actual parameter, and a
toolkit may decide to stop parsing the SMILES term at the first
whitespace.)
The final type of qualified parameter looks like
"openeye.\*.aromaticity", where the first term is the toolkit name,
the second term is "\*", and the third term is the parameter
name. This is most useful if you want OEChem to enforce the same
aromaticity across all formats, or have the RDKit parsers ignore
sanitization, with configuration entries like::
{"openeye.*.aromaticity": "daylight",
"rdkit.*.sanitize": False}
However, as only OEChem supports "aromaticity" and only RDKit supports
"sanitize", you could also write this as simply::
{"aromaticity": "daylight",
"sanitize": False}
The reason qualifier exist, even if not currently needed, is because I
predict there will be parameter name conflicts in the future. That
possibility affects the API enough that I wanted a solution now.
.. _qualified_parameter_priorities:
Qualified parameter priorities
==============================
In this section you'll learn the priority order when multiple
terms try to specify the same parameter.
In the previous section you learned how "delimiter", "smi.delimiter",
"rdkit.*.delimiter" and "rdkit.smi.delimiter" can all be used to set
the delimiter style for RDKit's "smi" format. If more then one term is
specified, which one wins?
Chemfp checks for the parameters in the following order:
1. rdkit.smi.delimiter
2. rdkit.*.delimiter
3. smi.delimiter
4. delimiter
The parameter with the highest ranking determines the setting, as the
following shows::
>>> from chemfp import rdkit_toolkit as T
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
... reader_args={"delimiter": "to-eol",
... "smi.delimiter": "whitespace"})
>>> id
u'methane'
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
... reader_args={"rdkit.*.delimiter": "to-eol",
... "smi.delimiter": "whitespace"})
>>> id
u'methane 16.04246'
>>> id, mol = T.parse_id_and_molecule("C methane 16.04246", "smi",
... reader_args={"rdkit.*.delimiter": "to-eol",
... "rdkit.smi.delimiter": "whitespace"})
>>> id
u'methane'
One way to remember it is the longest name has priority.
It can be confusing to have a large dictionary with multiple format
and toolkit qualifiers. The :meth:`~.Format.get_unqualified_reader_args`
and :meth:`~.Format.get_unqualified_writer_args` methods of
:class:`.Format` object will return the fully unqualified
*reader_args* and *writer_args* for that format::
>>> fmt = T.get_format("smi")
>>> fmt.get_unqualified_reader_args({
... "delimiter": "to-eol",
... "smi.delimiter": "whitespace",
... })
{'delimiter': 'whitespace', 'has_header': False, 'sanitize': True}
>>> fmt.get_unqualified_writer_args({
... "delimiter": "space",
... "smi.delimiter": "tab",
... })
{'isomericSmiles': True, 'delimiter': 'tab', 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True}
This can also be helpful if you think you made a typo; get the
unqualified *reader_args* and see if the result has the arguments you
think it should have.
Qualified names and text settings
==================================
In this section you'll learn how the qualified names also apply to
text settings.
Earlier you learned that text settings are string-based keys and
values, which might come from the command-line, a configuration file,
or some other text-based source. These need to be converted into
Python values before they can be used as reader_args or writer_args.
A :class:`.Format` object can convert a dictionary of text settings
into the correct argument dictionary. To get a Format object, ask the
toolkit for the format of the given name::
>>> from chemfp import rdkit_toolkit as T
>>> fmt = T.get_format("sdf")
>>> fmt.get_default_reader_args()
{'strictParsing': True, 'removeHs': True, 'sanitize': True}
The section :ref:`text_settings_to_reader_and_writer_args` showed
how to convert the text settings with unqualified names into a
*reader_args* dictionary::
>>> fmt.get_reader_args_from_text_settings({
... "strictParsing": "false",
... "removeHs": "false",
... })
{'strictParsing': False, 'removeHs': False}
The text settings dictionary also supports qualified parameter names,
including handling the priority resolution described in
:ref:`qualified_parameter_priorities`::
>>> fmt.get_reader_args_from_text_settings({
... "strictParsing": "false",
... "sdf.strictParsing": "true",
... "removeHs": "false",
... "rdkit.*.removeHs": "true",
... "rdkit.sdf.sanitize": "false",
... })
{'strictParsing': True, 'removeHs': True, 'sanitize': False}
If you stare at it for a bit you'll see that "sdf.strictParsing" has a
higher priority than "strictParsing" and "rdkit.*.removeHs" is higher
than "removeHs", which is how it's supposed to work.
.. _read_molecules_from_sd_file:
Read molecules from an SD file or stdin
=======================================
In this section you'll learn how to read an SD file and iterate
through its records as toolkit molecules. You will need
`Compound_027575001_027600000.sdf.gz
`_
from PubChem.
Time to get back to molecules! The
:func:`chemfp.toolkit.read_molecules` function reads molecules from a
structure file:
.. code-block:: python
from __future__ import print_function
from chemfp import rdkit_toolkit as T # use your toolkit of choice
for mol in T.read_molecules("Compound_014550001_014575000.sdf.gz"):
print(T.create_string(mol, "smistring"))
By default it uses the filename extension to figure out the format and
compression type. You can specify it yourself, if you wish, using the
*format* option:
.. code-block:: python
from __future__ import print_function
from chemfp import rdkit_toolkit as T # use your toolkit of choice
for mol in T.read_molecules("Compound_014550001_014575000.sdf.gz",
format="sdf.gz"):
print(T.create_string(mol, "smistring"))
Examples of valid format values are "smi", "can", and "usm" (but not
the \*string variants like "smistring", because those aren't
record-based formats), and "sdf", as well as gzip-compressed versions
like "smi.gz" and "sdf.gz".
(For Open Babel the ".gz" extension does nothing as Open Babel will
auto-detect and handle gzip compressed input.)
If the first parameter (the *source* parameter) is the Python None
value then the toolkit will read from stdin. As there's no filename,
chemfp can't look at the extension to figure out the format, so it
assumes the input is in "smi" format, that is, an uncompressed SMILES
file.
Therefore, to read an SD file from stdin you must specify the
format. The following program reads a gzip compressed SD file from
stdin, convert it to SMILES, and find the 10 most common characters
used in the SMILES strings:
.. code-block:: python
# This file is named 'count_smiles_characters.py'
from __future__ import print_function
from collections import Counter
from chemfp import rdkit_toolkit as T # use your toolkit of choice
symbol_counts = Counter()
for mol in T.read_molecules(None, "sdf.gz"):
smiles = T.create_string(mol, "smistring")
symbol_counts.update(smiles)
for symbol, count in symbol_counts.most_common(10):
print("%5d: %r" % (count, symbol))
Now to try it on a data set:
.. code-block:: console
% python count_smiles_characters.py < Compound_014550001_014575000.sdf.gz
50826: 'C'
38147: 'c'
21010: '('
21010: ')'
15316: 'O'
12676: '1'
9054: '='
7444: '2'
5755: '@'
5427: '['
(I double-checked; the next most common is indeed ']' with 5427
occurrences, which you probably expected. :)
.. _read_ids_and_molecules:
Read ids and molecules from an SD file at the same time
=======================================================
In this section you'll learn how to read an SD file and iterate
through its records as the two-element tuple of (id, molecule). You
will need the `Compound_027575001_027600000.sdf.gz
`_
from PubChem, which was used in the previous section.
In an earlier section, :ref:`parse_id_and_molecule`, you learned how
to parse a structure record to get both the identifier and the
molecule at the same time. The toolkit function
:func:`chemfp.toolkit.read_ids_and_molecules` is the equivalent for
reading from a structure file.
In the following example I'll use the RDKit toolkit to create a
tab-separated file with the id in the first column, the number of
carbon atoms in the second, and the SMILES in the third. For brevity,
I'll display only the first 10 records, which also gives a nice
example of when to use `itertools.islice `_:
.. code-block:: python
from __future__ import print_function
from itertools import islice
from chemfp import rdkit_toolkit
filename = "Compound_014550001_014575000.sdf.gz"
reader = rdkit_toolkit.read_ids_and_molecules(filename)
for id, mol in islice(reader, 0, 10):
num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
smiles = rdkit_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
(See the next section for a description of how the line with the
``sum()`` works.)
Here's the output, and a spot check shows the carbon counts are correct:
.. code-block:: none
14550001 9 O=[N+]([O-])c1ccc(O)c(CSCCO)c1
14550002 9 Nc1ccc(O)c(CSCCO)c1
14550003 8 CSCc1cc(N)ccc1O
14550004 7 Cc1[nH]ncc1C(C)C
14550005 12 O=C([O-])c1ccc2cc(C(=O)[O-])ccc2c1.[K+].[K+]
14550010 10 O=C(O)CN(Cc1ccccc1)CP(=O)(O)O
14550011 10 CCO[Si]1(C)OC(=O)c2ccccc2O1
14550044 54 CCCCCCCCCCCCCCCCCC(=O)[O-].CCCCCCCCCCCCCCCCCC(=O)[O-].CCCCCCCCCCCCCCCCCC(=O)[O-].[Gd+3]
14550045 19 O=C(O)CCCCCCCCCCC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F
14550054 14 c1ccc(SCSCSc2ccccc2)cc1
What's fun is that RDKit and OEChem both implement ``mol.GetAtoms()``
and ``atom.GetAtomicNum()`` so to port the above from RDKit to OEChem
is trivial; replace ``rdkit_toolkit`` with ``openeye_toolkit``!
The Open Babel port isn't quite as easy because Open Babel has a
different way to get the atoms in a molecule. To make it easy to copy
and paste, here's the equivalent code for Open Babel::
from __future__ import print_function
from itertools import islice
from chemfp import openbabel_toolkit
filename = "Compound_014550001_014575000.sdf.gz"
reader = openbabel_toolkit.read_ids_and_molecules(filename)
for id, mol in islice(reader, 0, 10):
num_carbons = sum(1 for atom_idx in range(mol.NumAtoms())
if mol.GetAtom(atom_idx+1).GetAtomicNum() == 6)
smiles = openbabel_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
.. _read_with_id_tag:
Read ids and molecules using an SD tag for the id
=================================================
In this section you'll learn how to use the *id_tag* to get the id
from one of the SD tags, rather than from the record's title. You will
need the `Compound_027575001_027600000.sdf.gz
`_
from PubChem, which was used in the previous section. I'll also
explain an idiom for how to count the number of records in an
iterator.
Sometimes you would rather use a tag value as the id rather than the
title line of the SDF record. This is critical for ChEBI data set and
older ChEMBL data sets, which leave the title line (mostly) blank. In
this case, use the *id_tag* to specify the tag to use.
The following example modifies the RDKit code from previous code to
use PUBCHEM_IUPAC_SYSTEMATIC_NAME as the id, rather than the title
line:
.. code-block:: python
from __future__ import print_function
from itertools import islice
from chemfp import rdkit_toolkit
filename = "Compound_014550001_014575000.sdf.gz"
reader = rdkit_toolkit.read_ids_and_molecules(filename, id_tag="PUBCHEM_IUPAC_SYSTEMATIC_NAME")
for id, mol in islice(reader, 0, 10):
num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
smiles = rdkit_toolkit.create_string(mol, "smistring")
print(id, num_carbons, smiles, sep="\t")
The first 7 lines of output is:
.. code-block:: none
2-(2-hydroxyethylsulfanylmethyl)-4-nitro-phenol 9 O=[N+]([O-])c1ccc(O)c(CSCCO)c1
4-azanyl-2-(2-hydroxyethylsulfanylmethyl)phenol 9 Nc1ccc(O)c(CSCCO)c1
4-azanyl-2-(methylsulfanylmethyl)phenol 8 CSCc1cc(N)ccc1O
5-methyl-4-propan-2-yl-1H-pyrazole 7 Cc1[nH]ncc1C(C)C
dipotassium;naphthalene-2,6-dicarboxylate 12 O=C([O-])c1ccc2cc(C(=O)[O-])ccc2c1.[K+].[K+]
2-[(phenylmethyl)-(phosphonomethyl)amino]ethanoic acid 10 O=C(O)CN(Cc1ccccc1)CP(=O)(O)O
2-ethoxy-2-methyl-1,3,2-benzodioxasilin-4-one 10 CCO[Si]1(C)OC(=O)c2ccccc2O1
You might have found the "``sum(1 for atom in ....)``" a bit odd. I agree
with you. It is, however, the standard way in Python to count the
number of elements in the iterator which match a given condition. I'll
break it down so you can understand how it works.
A list comprehension iterates through each element in an iterator (in
the following it iterates over the characters in a string) and returns
a list::
>>> [c for c in "Hello"]
['H', 'e', 'l', 'l', 'o']
Add an "if" to it to operate on only a subset of the characters::
>>> [c for c in "Hello" if c != "l"]
['H', 'e', 'o']
I could use len() of this to get the number of non-"l" characters, but
that would require making a list only to throw it away. There's
another route to the same answer. To get there, use the value 1 for
each character rather than the character itself::
>>> [1 for c in "Hello" if c != "l"]
[1, 1, 1]
Then use sum() to sum the values, which in this case is also the
number of elements in the list::
>>> sum([1 for c in "Hello" if c != "l"])
3
Unlike len(), sum() only needs an iterator, not a list. I can replace
the list comprehension with a generator comprehension, to get::
>>> sum(1 for c in "Hello" if c != "l")
3
Going back to the RDKit/OEChem expression:
.. code-block:: python
num_carbons = sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() == 6)
I hope you can see how this counts the number of atoms in the
molecule whose atomic number is 6. Or, if you want another way to
think of it, the expression is the same as:
.. code-block:: python
num_carbons = 0
for atom in mol.GetAtoms():
if atom.GetAtomicNum() == 6:
num_carbons += 1
.. _read_molecules_from_string:
Read from a string instead of a file
====================================
In this section you'll learn how to read molecules from a string
containing multiple SMILES records.
In the section :ref:`read_molecules_from_sd_file` you learned how to
read molecules from a structure file or stdin. Sometimes the input
structures come from a string. For example, if a web page has a form
with a text box, where users can paste in a set of SMILES or SDF
records and submit the form, then the web application on the server
will likely receive those records as a single string.
When the records are in a string instead of a file, use
:func:`chemfp.toolkit.read_molecules_from_string`. It's very similar
to :func:`chemfp.toolkit.read_molecules`, except that the first
parameter, *content*, is the string instead of the source filename,
and the second parameter, *format*, is required. (chemfp doesn't try
to auto-detect the format from the content.)
The following reads the records from a string containing two simple
SMILES records and prints the number of non-implicit atoms for each
one. I've included implementations for all three toolkits; use the
one(s) that are available to you:
.. code-block:: python
from __future__ import print_function
from chemfp import rdkit_toolkit
content = ("C methane 16.04246\n"
"O=O water 31.9988\n")
for mol in rdkit_toolkit.read_molecules_from_string(content, "smi"):
print("RDKit:", mol.GetNumAtoms())
from chemfp import openeye_toolkit
for mol in openeye_toolkit.read_molecules_from_string(content, "smi"):
print("OEChem:", mol.NumAtoms())
from chemfp import openbabel_toolkit
for mol in openbabel_toolkit.read_molecules_from_string(content, "smi"):
print("Open Babel:", mol.NumAtoms())
When I run the above (on a computer where all three supported toolkits
are installed), the above reports:
.. code-block:: none
RDKit: 1
RDKit: 2
OEChem: 1
OEChem: 2
Open Babel: 1
Open Babel: 2
I would like to improve the output a bit to also include the record id
in the output. The toolkit function
:func:`chemfp.toolkit.read_ids_and_molecules_from_string` is similar
to :func:`chemfp.toolkit.read_molecules_from_string` except that it
iterates through the (id, toolkit molecule) tuple rather than just the
molecule::
>>> from __future__ import print_function
>>> from chemfp import rdkit_toolkit
>>> content = ("C methane 16.04246\n"
... "O=O water 31.9988\n")
>>> for (id, mol) in rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi"):
... print("RDKit:", repr(id), mol.GetNumAtoms())
...
RDKit: 'methane 16.04246' 1
RDKit: 'water 31.9988' 2
You can see that the default SMILES reader assumes the rest of the
line is the id. The file and string record readers take a
*reader_args* parameter just like
:func:`chemfp.toolkit.parse_id_and_molecule`. I'll specify the
"whitespace" delimiter so the parser uses only the second word as the
id::
>>> for (id, mol) in rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi",
... reader_args={"delimiter": "whitespace"}):
... print("RDKit:", repr(id), mol.GetNumAtoms())
...
RDKit: 'methane' 1
RDKit: 'water' 2
See :ref:`smiles_delimiter_reader_arg` for more details about setting
the "delimiter" *reader_args*.
The string readers, like the file readers, also support the *id_tag*
option to get the id from an SD tag instead of the title line. See
:ref:`read_with_id_tag` for more details about using the *id_tag*.
The reader may reuse molecule objects!
======================================
In this section you'll learn that the OEChem and Open Babel toolkits
reuse the same molecule object, which means you can't save a molecule
for later.
Suppose you want to read all of the molecules from a file into a
list. It's very tempting to write it as::
>>> import chemfp
>>> from chemfp import openeye_toolkit as T
>>> mols = list(T.read_molecules_from_string("C methane\nO water\n", "smi"))
This does not work for the openeye_toolkit or the openbabel_toolkit::
>>> mols
[ >,
>]
>>> T.create_string(mols[0], "smistring")
u''
>>> T.create_string(mols[1], "smistring")
u''
This is because the underlying reader for those two toolkits reuse the
same molecule object. You can see that in the above, which returns the
same OEGraphMol object (with id 0x10326ba40) for each record. The
reason why Open Eye decided to reuse the object is to get better
performance. Clearing the molecule object is faster than deleting it
and reallocating a new one.
In addition, the OEChem reader code does a "clear molecule" followed
by "read next record or stop". At the end of the file there is no
record, so the reader ends with a clear molecule. That explains why
the OEGraphMol produces an empty SMILES string for the last couple of
lines in the above code.
The only portable way to load a list of molecules is to use
:func:`chemfp.toolkit.copy_molecule`, as in::
>>> from chemfp import openeye_toolkit as T
>>> mols = [T.copy_molecule(mol) for mol in T.read_molecules_from_string("C methane\nO water\n", "smi")]
>>> mols
[ >,
>]
>>> T.create_string(mols[0], "smistring")
u'C'
>>> T.create_string(mols[1], "smistring")
u'O'
I don't really like this solution because the RDKit reader doesn't
need a copy, so the extra copy is pure overhead.
Future versions of chemfp will likely have a *reader_arg* to specify
if it's okay to reuse a molecule object or if a new one must be used
each time.
.. _write_molecules_to_smiles_file:
Write molecules to a SMILES file
================================
In this section you will learn how to write toolkit molecules into a
structure file. You will need
`Compound_014550001_014575000.sdf.gz
`_
from PubChem, which is a different PubChem file than what I used
in earlier sections.
Chemfp can write toolkit molecules to a file in a given format. I'll
start by making an RDKit molecule, though the same API works with Open
Babel and OEChem::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("c1ccccc1O phenol", "smi")
Use :func:`chemfp.toolkit.open_molecule_writer` to create a writer
object. By default it will look at the output filename extension to
figure out the format and compression type, and if that doesn't work
it defaults to SMILES output::
>>> writer = T.open_molecule_writer("example.smi")
The fingerprint writer has several methods to write a molecule to the
file. If you write a molecule by itself it will use the molecule's own
id (in this case, "phenol")::
>>> writer.write_molecule(mol)
Or, use :meth:`~.BaseMoleculeWriter.write_id_and_molecule` if you want to
specify an alternate id::
>>> writer.write_id_and_molecule("something else", mol)
WARNING: The toolkit implementation may temporarily change the toolkit
molecule's own identifier in order to get the correct output. You
should not alter the molecule's id in another thread while calling
this function.
Let's see if it worked, by closing the writer (otherwise some of the
output may be in an internal buffer) and reading the file::
>>> writer.close()
>>> print(open("example.smi").read())
Oc1ccccc1 phenol
Oc1ccccc1 something else
The :meth:`~.BaseMoleculeWriter.write_molecules` method is optimized
for passing in a list or iterator of molecule objects, and
:meth:`~.BaseMoleculeWriter.write_ids_and_molecules` is the equivalent
if you have (id, molecule) pairs. For example, the following converts
an SD file into a compressed SMILES file::
from chemfp import rdkit_toolkit as T # use your toolkit of choice
reader = T.read_molecules("Compound_014550001_014575000.sdf.gz")
writer = T.open_molecule_writer("example.smi.gz")
writer.write_molecules(reader)
# These are optional, but recommended. Even better would be
# to use the context manager described in the next section.
writer.close()
reader.close()
If you have a list (or iterator) of molecules, then use the
:meth:`~.BaseMoleculeWriter.write_molecules` method.
The open function also supports the *format* parameter, so you can
specify "smi" or "sdf.gz" some other combination of structure format
and compression type::
writer = T.open_molecule("wrong_extensions.smi", format="sdf.gz")
.. _context_managers:
Reader and writer context managers
==================================
In this section you'll learn how to use chemfp's readers and writers
to close the file, rather than depend on Python's garbage collector or
manual "close()". You will need `Compound_014550001_014575000.sdf.gz
`_
from PubChem.
In the previous section, :ref:`write_molecules_to_smiles_file`, you
learned how to convert an SD file into a SMILES file. At the end was a
small program with optional "close()" statements. These are optional
because Python's garbage collector and chemfp work together. When a
chemfp reader or writer is no longer needed, the garbage collector asks
chemfp to clean up, and chemfp closes the native toolkit's file object.
This is fine for a simple script or function, but sometimes you want
more control over when the file is closed. You can call the writer's
:meth:`~.BaseMoleculeWriter.close` method yourself, but it's really easy to
forget to do that.
Python supports "context managers", which carry out certain actions
when a block of code finishes. See :pep:`343` if you want the full
details. For chemfp you only need to know that the reader and writer
context managers will always close the file at the end of the block.
A normal Python file context manager works like this:
>>> with open("example.txt", "w") as outfile:
... outfile.write("I am here.\n")
...
>>> print(repr(open("example.txt").read()))
'I am here.\n'
If instead I use one file object to write the data and another to read
the file, without a flush() or close() by the writer, then there's a
syncronization problem::
>>> outfile = open("example.txt", "w")
>>> outfile.write("I am here.\n")
>>> print(repr(open("example.txt").read()))
''
Why does this print the empty string? The output text is still in an
internal buffer, which isn't written to the disk until the close
call::
>>> outfile.close()
>>> print(repr(open("example.txt").read())
'I am here.\n'
The same problem occurs with molecule output::
>>> from chemfp import rdkit_toolkit as T # can also use openbabel_toolkit
>>> mol = T.parse_molecule("C=O carbon monoxide", "smi")
>>> writer = T.open_molecule_writer("example.smi")
>>> writer.write_molecule(mol)
>>> open("example.smi").read()
''
>>> writer.close()
>>> open("example.smi").read()
'C=O carbon monoxide\n'
Note: this problem does not occur with the openeye_toolkit. Most
likely that toolkit always flushes its output buffers after each
molecule.
The chemfp readers and writers support a context manager, so you can
use the same solution you would for regular files:
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("C=O carbon monoxide", "smi")
>>> with T.open_molecule_writer("example.smi") as writer:
... writer.write_molecule(mol)
...
>>> open("example.smi").read()
'C=O carbon monoxide\n'
With the context manager concept firmly in mind, the following is the
way I prefer to write the conversion script from the previous section:
.. code-block:: python
from chemfp import rdkit_toolkit as T # use your toolkit of choice
with T.read_molecules("Compound_014550001_014575000.sdf.gz") as reader:
with T.open_molecule_writer("example.smi.gz") as writer:
writer.write_molecules(reader)
That said, if you really want to depend on the garbage collector, you
can also write it with one (or two) fewer lines:
.. code-block:: python
from chemfp import rdkit_toolkit as T # use your toolkit of choice
T.open_molecule_writer("example.smi.gz").write_molecules(
T.read_molecules("Compound_014550001_014575000.sdf.gz"))
Write molecules to stdout in a specified format
===============================================
In this section you'll learn how to specify the structure writer's
output format, and to write to stdout instead of to a file.
The function :func:`chemfp.toolkit.open_molecule_writer` supports a
*format* parameter, in case you don't want chemfp to determine the
output format and compression based on the filename extension.
For example, if the destination is None (instead of a filename) then
chemfp will write the output to stdout. Since Python's None object
doesn't have an extension, it will write the molecules as uncompressed
SMILES. If you want to write to stdout in SDF format you will have to
specify the output format, like the following::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> mol = T.parse_molecule("O=O molecular oxygen", "smi")
>>> with T.open_molecule_writer(None, "sdf") as writer:
... writer.write_molecule(mol)
...
molecular oxygen
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0
M END
$$$$
>>> with T.open_molecule_writer(None, "inchikey") as writer:
... writer.write_molecule(mol)
...
MYMOFIZGZYHOMD-UHFFFAOYSA-N molecular oxygen
Write molecules to a string (and a bit of InChI)
================================================
In this section you'll learn how to write toolkit molecules into
memory, and when finished to get the result as a string.
The previous sections showed examples of writing molecules to a file
or to stdout. Sometimes you want to save the records as a string;
perhaps to send a response for a web request or display the contents
in a text pane of a GUI. The function
:func:`chemfp.toolkit.open_molecule_writer_to_string` creates a
:class:`.MoleculeStringWriter` which stores the output records into
memory. Once the writer is closed, the memory contents can be
retrieved as a string with :meth:`.MoleculeStringWriter.getvalue`.
For a bit of variation, the following example uses the "inchi" output
format, and the :mod:`openbabel_toolkit `::
>>> from chemfp import openbabel_toolkit as T # use your toolkit of choice
>>> alanine = T.parse_molecule("O=C(O)[C@@H](N)C alanine", "smi")
>>> glycine = T.parse_molecule("C(C(=O)O)N glycine", "smi")
>>> writer = T.open_molecule_writer_to_string("inchi")
>>> writer.write_molecules([alanine, glycine])
>>> writer.close()
>>> print(writer.getvalue())
InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 alanine
InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5) glycine
You should know that there's no well-defined "inchi" file format, only
an InChI string. I decided to follow Open Babel's lead and say that
the "inchi" format has one record per line, where each line contains
the InChI string followed by a delimiter, followed by the id (if
available) on the rest of the line.
The InChI output *writer_args* supports an "include_id" parameter. The
default, True, includes the id, while the following example sets it to
False to have only the InChI string on the line::
>>> with T.open_molecule_writer_to_string("inchi",
... writer_args={"include_id": False}) as writer:
... writer.write_molecule(alanine)
... writer.write_molecule(glycine)
...
>>> print(writer.getvalue())
InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1
InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)
I also used the context manager so the code would be a bit shorter
and, I think, clearer. It's up to you to decide if
``write_molecules()`` with a 2-element list is clear than two
``write_molecule()`` lines.
.. _error_handling_in_molecule_reader:
Handling errors when reading molecules from a string
====================================================
In this section you'll learn how to ignore errors and improve error
reporting when reading from a string, rather then accept the default
of raising an exception and stopping. The examples will use a string
containing SMILES records, but the same principles apply to any
format.
If you've used the chemfp readers on real-world data sets you might
have noticed that the RDKit and Open Babel ones sometimes raise an
exception, saying that a given record could not be parsed. I'll
demonstrate with a string containing four SMILES records::
>>> content = ("C methane\n" +
... "CN(C)(C)(C)C pentavalent nitrogen\n" +
... "Q Q-ane\n" +
... "[U] uranium\n")
>>>
RDKit doesn't like the pentavalent nitrogen, and chemfp's
rdkit_toolkit stops processing at that record::
>>> from chemfp import rdkit_toolkit
>>> with rdkit_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
... for id, mol in reader:
... print(id)
...
methane
[16:11:12] Explicit valence for atom # 1 N, 5, is greater than permitted
Traceback (most recent call last):
File "", line 2, in
File "chemfp/_rdkit_toolkit.py", line 286, in _iter_read_smiles_ids_and_molecules
error_handler.error("RDKit cannot parse the SMILES %s" % (_compat.myrepr(smiles),), location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: RDKit cannot parse the SMILES 'CN(C)(C)(C)C',
file '', line 2, record #2: first line is 'CN(C)(C)(C)C pentavalent nitrogen'
Open Babel doesn't care about the too-high valence on the nitrogen,
but doesn't like the non-SMILES in the third record::
>>> from chemfp import openbabel_toolkit
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
Traceback (most recent call last):
File "", line 2, in
File "chemfp/_openbabel_toolkit.py", line 894, in _iter_column_records
% (format_name, myrepr(smi_string)), location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: Open Babel cannot parse the SMILES 'Q', file '', line 3, record #3: first line is 'Q Q-ane'
To round things out, OEChem accepts pentavalent nitrogen and skips the
bad SMILES at a lower level than what chemfp uses, so there's no
exception::
>>> from chemfp import openeye_toolkit
>>> with openeye_toolkit.read_ids_and_molecules_from_string(content, "smi") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
Warning: Problem parsing SMILES:
Warning: Q Q-ane
Warning: ^
Warning: Error reading molecule "" in Canonical stereo SMILES format.
uranium
I'll emphasize that point. The openeye_toolkit uses OEChem's
high-level reader, which provides no information about if OEChem
skipped a record with a failure. Chemfp therefore cannot provide more
information about the failures, whether as an exception or an improved
error message.
I'm certain that nearly everyone wants the reader to ignore the few
records that can't be parsed by the underlying toolkit. The readers
and writers support the *errors* option. The default value of "strict"
tells chemfp to raise an exception when it detects a parse failure,
and "ignore" tells it to ignore the error and go on to the next
record::
>>> with rdkit_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="ignore") as reader:
... for id, mol in reader:
... print(id)
...
methane
[16:13:45] Explicit valence for atom # 1 N, 5, is greater than permitted
[16:13:45] SMILES Parse Error: syntax error for input: 'Q'
uranium
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="ignore") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
uranium
The "strict" default comes from my long-held belief that it's better
to be strict first, and detect problems early, than to let them
intrude. My resolve is weakening, because it's been rare to find that
I can make use of that information. The biggest counter-example is
when I specify one format but the file is actually in another format,
in which case the reader skips a lot of garbage. For example, a SMILES
reader, pointed to a SD file or a compressed SMILES file, will try
hard to make sense of the data and end up ignoring almost
everything. I haven't decided if I will change the default policy.
I've also found that the toolkits aren't that helpful at identifying
which record failed. Take a look at the RDKit warning::
[16:13:45] Explicit valence for atom # 1 N, 5, is greater than permitted
It says that I did this in the late afternoon, and the reason for the
failure, but says very little about the record with the problem.
To help improve this, and to send still more garbage, err, I mean
helpful messages to stderr, chemfp supports a "report" *errors*
value. It's the same as "ignore" except that it also displays more
details about the failure location::
>>> with rdkit_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="report") as reader:
... for id, mol in reader:
... print(id)
...
methane
[16:14:52] Explicit valence for atom # 1 N, 5, is greater than permitted
ERROR: RDKit cannot parse the SMILES 'CN(C)(C)(C)C', file '', line 2, record #2: first line is 'CN(C)(C)(C)C pentavalent nitrogen'. Skipping.
[16:14:52] SMILES Parse Error: syntax error for input: 'Q'
ERROR: RDKit cannot parse the SMILES 'Q', file '', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium
>>> with openbabel_toolkit.read_ids_and_molecules_from_string(
... content, "smi", errors="report") as reader:
... for id, mol in reader:
... print(id)
...
methane
pentavalent nitrogen
ERROR: Open Babel cannot parse the SMILES 'Q', file '', line 3, record #3: first line is 'Q Q-ane'. Skipping.
uranium
The quality of the error message depends on the toolkit and the
format. The best messages are for the Open Babel and RDKit SMILES
readers and InChI readers, because I decided to have chemfp identify
the records for those formats itself, instead of using the underlying
toolkits to read the file. Chemfp still uses the underlying toolkit to
convert the individual record into a native toolkit molecule.
I did this because I found the the SMILES and InChI reader performance
was the same, and by writing my own parsers I had the ability to
report line numbers and improve the error messages.
The examples so far used the ``read_ids_and_molecules_from_string``
function. The ``read_molecules_from_string`` function also supports the
``errors`` option, with the same meaning.
>>> sizes = []
>>> with openbabel_toolkit.read_molecules_from_string(
... content, "smi", errors="report") as reader:
... for mol in reader:
... sizes.append(mol.NumAtoms())
...
ERROR: Open Babel cannot parse the SMILES 'Q', file '', line 3, record #3: first line is 'Q Q-ane'. Skipping.
>>> sizes
[1, 6, 1]
Handling errors when reading molecules from a file
==================================================
In this section you'll learn how to how to ignore errors and improve
error reporting when reading from SD file, rather then accept the
default of raising an exception and stopping. The examples will use an
SD file, but the same principles apply to any format.
In the previous section you learned that when the readers encounter a
error, the default behavior is to raise a Python exception and how to
use the ``error`` parameter to ignore those errors or to provide a
more detailed error report.
The file-based readers, :func:`chemfp.toolkit.read_molecules` and
:func:`chemfp.toolkit.read_ids_and_molecules`, can be configured the
same way, that is:
.. code-block:: python
# When there is an error, raise an exception and stop (this is the default)
T.read_molecules(filename)
T.read_molecules(filename, errors="strict")
T.read_ids_and_molecules(filename)
T.read_ids_and_molecules(filename, errors="strict")
# When there is an error, go on to the next record
T.read_molecules(filename, errors="ignore")
T.read_ids_and_molecules(filename, errors="ignore")
# When there is an error, print an error message to stderr then
# go on to the next record
T.read_molecules(filename, errors="report")
T.read_ids_and_molecules(filename, errors="report")
To show it in action, I'll construct an SD file with three
records. The first will contain a trivalent oxygen, the second a
corrupt record, and the third will be atomic nitrogen. I'll use OEChem
to help me make the file.
.. code-block:: python
from chemfp import openeye_toolkit as T
mol1 = T.parse_molecule("O#C trivalent", "smi") # RDKit won't like this
mol2 = T.parse_molecule("[U] Q-record", "smi") # I'll corrupt this record
mol3 = T.parse_molecule("[N] nitrogen", "smi") # This one is fine
with T.open_molecule_writer_to_string("sdf") as writer:
writer.write_molecules([mol1, mol2, mol3])
content = writer.getvalue()
# replace the "U" with the nonsense "Qq"
content = content.replace("U ", "Qq")
# Save
open("bad_data.sdf", "w").write(content)
Here's what the output file ``bad_data.sdf`` it looks like, so you can
copy&paste if you wish:
.. code-block:: none
trivalent
-OEChem-04251716112D
2 1 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 0 0 0 0
M END
$$$$
Q-record
-OEChem-04251716112D
1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 Qq 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
nitrogen
-OEChem-04251716112D
1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 N 0 0 0 0 0 15 0 0 0 0 0 0
M END
$$$$
I'll try to read that file using the native RDKit reader, which skips
records it can't parse::
>>> from rdkit import Chem
>>> reader = Chem.ForwardSDMolSupplier("bad_data.sdf")
>>> ids = [mol.GetProp("_Name") for mol in reader if mol is not None]
[04:27:44] Explicit valence for atom # 0 O, 3, is greater than permitted
[04:27:44] ERROR: Could not sanitize molecule ending on line 8
[04:27:44] ERROR: Explicit valence for atom # 0 O, 3, is greater than permitted
[04:27:44]
****
Post-condition Violation
Element 'Qq' not found
Violation occurred on line 90 in file /Users/dalke/ftps/rdkit-Release_2016_09_3/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****
[04:27:44] Unexpected error hit on line 14
[04:27:44] ERROR: moving to the begining of the next molecule
>>> ids
['nitrogen']
As expected, RDKit could only extract one record of the three. It
helpfully points out the line number of the records it couldn't parse
(lines 8 and 14)
Now I'll do the same using chemfp's ``rdkit_toolkit`` interface and
the default error handler, which is ``strict``::
>>> from chemfp import rdkit_toolkit
>>> ids = []
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules("bad_data.sdf"):
... ids.append(id)
...
[15:41:07] Explicit valence for atom # 0 O, 3, is greater than permitted
Traceback (most recent call last):
File "", line 1, in
File "chemfp/_rdkit_toolkit.py", line 1252, in _iter_read_sdf_structures
error_handler.error("Could not parse molecule block", location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: Could not parse molecule block, file 'bad_data.sdf', line 1, record #1: first line is 'trivalent'
It stops at the first error and raise an exception. The exception
contains some information about the error location, including the
filename, line number, record number, and the contents of the first
line of the file.
How does chemfp get that information? Under the covers chemfp uses its
own parser, from the :mod:`.text_toolkit` to read each record, then
passes that record to RDKit to turn the record into a molecule. This
gives chemfp a bit more control over error reporting. Originally this
was also faster than using RDKit's own ``ForwardSDMolSupplier``, but
now chemfp is about 10% slower. A future implementation may offer a
run-time choice of which implementation to use, in case you want
better performance at the expense of less detailed error information.
Pass in either "``ignore``" or "``report``" as the ``errors`` option if
you want chemfp to skip records with an error keep on processing. I'll
use "``report``" to show what the error reporting looks like::
>>> from chemfp import rdkit_toolkit
>>> ids = []
>>> for id, mol in rdkit_toolkit.read_ids_and_molecules(
... "bad_data.sdf", errors="report"):
... ids.append(id)
...
[15:50:23] Explicit valence for atom # 0 O, 3, is greater than permitted
ERROR: Could not parse molecule block, file 'bad_data.sdf', line 1, record #1: first line is 'trivalent'. Skipping.
[15:50:23]
****
Post-condition Violation
Element 'Qq' not found
Violation occurred on line 90 in file /Users/dalke/ftps/rdkit-Release_2016_09_3/Code/GraphMol/PeriodicTable.h
Failed Expression: anum > -1
****
ERROR: Could not parse molecule block, file 'bad_data.sdf', line 10, record #2: first line is 'Q-record'. Skipping.
>>> ids
[u'nitrogen']
RDKit's own error messages from ForwardSDMolSupplier, like "Unexpected
error hit on line 14" / "moving to the begining of the next molecule",
have disappeared, because chemfp handles record extraction. The
sanitization error message about explicit valence remains because
RDKit still does that work.
Note also that under Python 2.7 chemfp returns a Unicode string for
the id, rather than the byte string that the native RDKit API returns.
That was RDKit. What about Open Babel?
>>> from __future__ import print_function
>>> from chemfp import openbabel_toolkit
>>> with openbabel_toolkit.read_ids_and_molecules(
... "bad_data.sdf", "sdf", errors="strict") as reader:
... for id, mol in reader:
... print("Read", repr(id), "first atom:", mol.GetAtom(1).GetAtomicNum())
...
Read u'trivalent' first atom: 8
==============================
*** Open Babel Warning in GetAtomicNum
Cannot understand the element label Qq.
Read u'Q-record' first atom: 0
Read u'nitrogen' first atom: 7
Open Babel reads all three records even in strict mode, though that
odd "Qq" atom causes it to print a warning message. (It is hard for
chemfp to capture that warning message, so I don't.) Interestingly, it
turns that atom into a "*" atom, with atomic number 0. To double
check, I'll read the list of molecules, then write them all out as
SMILES::
>>> mols = []
>>> with openbabel_toolkit.read_molecules("bad_data.sdf") as reader:
... mols = [openbabel_toolkit.copy_molecule(mol) for mol in reader]
...
==============================
*** Open Babel Warning in GetAtomicNum
Cannot understand the element label Qq.
>>> len(mols)
3
>>> with openbabel_toolkit.open_molecule_writer(None, "smi") as writer:
... writer.write_molecules(mols)
...
C#[O] trivalent
* Q-record
[N] nitrogen
OEChem also parses that "Qq" record as an atom with atomic number of
0, though it doesn't even give me a warning message::
>>> from chemfp import openeye_toolkit
>>> with openeye_toolkit.read_ids_and_molecules(
... "bad_data.sdf", errors="strict") as reader:
... for id, mol in reader:
... print("Read", repr(id), [a.GetAtomicNum() for a in mol.GetAtoms()])
...
Read u'trivalent' [8, 6]
Read u'Q-record' [0]
Read u'nitrogen' [7]
I totally didn't expect the toolkits to parse an unknown atom type
like "Qq"!
In any case, OEChem will skip records which it could not parse, and
there's no easy way for chemfp to get that information, so in practice
the "strict" and "report" options are meaningless.
Ignore errors in create_string() and create_bytes()
===================================================
In this section you'll learn how to ignore errors when converting a
molecule into a string or byte record.
Some molecules cannot be represented in some formats. The easiest
example is the molecule from the SMILES "\*", which contains a single
atom with the atomic number 0 and cannot be represented in InChI::
>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> T.create_string(mol, "smistring")
u'[*]'
>>> T.create_string(mol, "inchistring")
Traceback (most recent call last):
File "", line 1, in
File "chemfp/rdkit_toolkit.py", line 428, in create_string
return _toolkit.create_string(mol, format, id, writer_args, errors)
....
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: RDKit cannot create the InChI string
By default the :func:`chemfp.toolkit.create_string()` and
:func:`chemfp.toolkit.create_bytes()` functions will raise an
exception if the molecule cannot be converted into the given record
format. Use the ``errors`` parameter to specify that behavior. Just
like with file reading, the default value is "``strict``", "``ignore``"
will return None if there was an error, and "``report``" will return
None and also print some information about the failure to
stderr. (Error reporting is more useful for writing
The following uses "``ignore``"::
>>> from __future__ import print_function
>>> import chemfp
>>> for toolkit in ("openbabel", "rdkit", "openeye"):
... T = chemfp.get_toolkit(toolkit)
... mol = T.parse_molecule("*", "smistring")
... result = T.create_string(mol, "inchistring", errors="ignore")
... print(toolkit, "returned", repr(result))
...
==============================
*** Open Babel Warning in InChI code
#0 :Unknown element(s): Xx
==============================
*** Open Babel Error in InChI code
InChI generation failed
openbabel returned None
[18:10:46] ERROR: Unknown element(s): *
rdkit returned None
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
openeye returned None
The following uses "``report``". You can see the only addition is the
new line 'ERROR: Open Babel cannot create the InChI string. Skipping.'
For a bit of variation, I also changed things to use ``create_bytes``
instead of ``create_string``::
>>> from chemfp import openbabel_toolkit as T
>>> mol = T.parse_molecule("*", "smistring")
>>> result = T.create_bytes(mol, "inchistring", errors="report")
==============================
*** Open Babel Warning in InChI code
#0 :Unknown element(s): Xx
==============================
*** Open Babel Error in InChI code
InChI generation failed
ERROR: Open Babel cannot create the InChI string. Skipping.
>>> result is None
True
Ignore errors when writing molecules
====================================
In this section you'll learn how to ignore errors and improve error
reporting when writing a file, rather than accept the default of
raising an exception and stopping. You will need a copy of
`ChEBI_lite.sdf.gz
`_.
It's not unusal for there to be a few input records which cannot be
parsed into a molecule. It's much less common to come across a
molecule which cannot be turned into a record. The SMILES and SD file
formats are able to handle a wide range of chemistry. Even R-groups,
which can't directly be expressed as SMILES, can be represented in one
of several conventions, like ``[*:1]`` for R1.
There are no such conventions for InChI. As you saw in the previous
section, it's easy to make a molecule to InChI converter fail if the
structure contains a "\*" atom.
The functions :func:`chemp.toolkit.open_molecule_writer`,
:func:`chemp.toolkit.open_molecule_writer_to_string`, and
:func:`chemp.toolkit.open_molecule_writer_to_bytes` return a molecule
writer. This can be used to write a single molecule at a time, or to
write molecule multiples from an iterator.
What happens if I try to convert the ChEBI file into an InChI file?
>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> writer = T.open_molecule_writer("chebi.inchi")
>>> writer.write_molecules(reader)
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
Traceback (most recent call last):
File "", line 1, in
File "chemfp/base_toolkit.py", line 265, in write_molecules
_compat.raise_tb(err[0], err[1])
File "chemfp/_openeye_toolkit.py", line 354, in _gen_write_inchi_structures
error_handler.error(errmsg, location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: OEChem cannot create the InChI string, file 'chebi.inchi', record #3
The third record could not be converted to an InChI string, and the
warning message that OEChem printed to the termal shows that the
molecule contained a wildcard atom, that is, the "*" atom. But, did it
really?
::
>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> for i in range(3):
... mol = next(reader)
... print(T.create_string(mol, "smistring"))
...
c1cc(c(cc1[C@@H]2[C@@H](Cc3c(cc(cc3O2)O)O)O)O)O
C[C@]12CC[C@H](C1)C(C2=O)(C)C
*C(=O)OC(CO)CO[R1]
That shows "R1", not an R-group. What's going on? "R1" isn't even a
valid SMILES.
This is an OEChem extension to SMILES. The default output SMILES
flavor includes the flag "RGroups", which
[c]ontrols whether atoms with atomic number zero (as determined by
the OEAtomBase::GetAtomicNum method), and a non-zero map index (as
determined by the OEAtomBase::GetMapIdx method) should be displayed
using the [R1] notation. In this notation, the integer value
following the R corresponds to the atom’s map index. When this flag
isn’t set, such atoms are written in the Daylight convention [\*:1].
-- `OEChem documenation `_
I'll redo the loop but this time disable the RGroup using the
writer_args option to set the flavor to "``Default,-RGroups``", that is,
the default value but without RGroups being set::
>>> reader = T.read_molecules("ChEBI_lite.sdf.gz")
>>> for i in range(5):
... mol = next(reader)
... print(T.create_string(mol, "smistring",
... writer_args={"flavor": "Default,-RGroups"}))
...
c1cc(c(cc1[C@@H]2[C@@H](Cc3c(cc(cc3O2)O)O)O)O)O
C[C@]12CC[C@H](C1)C(C2=O)(C)C
*C(=O)OC(CO)CO[*:1]
C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1C[C@H](C2=O)O)O
c1cc(c(c(c1)Cl)C#N)Cl
That indeed gives ``[*:1]`` which is the wildcard atom that InChI
complains about.
The molecule writers support the same ``errors`` option as the
molecule readers. The default value is "``strict``", which means to
raise an exception. To ignore errors, use "``ignore``", and to ignore
errors except to report a message to standard out, use "``report``".
>>> from chemfp import openeye_toolkit as T # use your toolkit of choice
>>> reader = T.read_ids_and_molecules("ChEBI_lite.sdf.gz", id_tag="ChEBI ID", errors="ignore")
>>> writer = T.open_molecule_writer("chebi.inchi", errors="report")
>>> writer.write_ids_and_molecules(reader)
The first few and last few lines of output are:
.. code-block:: none
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #3. Skipping.
Warning: Unsupported Sgroup information ignored
Warning: Unsupported Sgroup information ignored
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #13. Skipping.
Warning: Stereochemistry corrected on atom number 2 of
Warning: Unable to create InChI from molecule '' with wild card atoms: OEAtomBase::GetAtomicNum() == 0.
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #133. Skipping.
...
ERROR: OEChem cannot create the InChI string, file 'chebi.inchi', record #94443. Skipping.
Warning: Stereochemistry corrected on atom number 8 of
Warning: Stereochemistry corrected on atom number 13 of
Warning: Stereochemistry corrected on atom number 36 of
where the lines starting "ERROR: OEChem" come from chemfp, and the
others come from OEChem at a lower-level. (Alas, the "report" isn't as
helpful as it should be. I would like it to include the output id in
the error message, but all it gives is the record number. Perhaps it
will be in the next release?)
All told, there were 94633 of which 88839 could be written out. I got
these numbers from the writer's ``location`` property (see
:ref:`location_recno`, below). Its ``recno`` is the number of records
sent to the writer, and ``output_recno`` is the number of records
actually written::
>>> writer.location.recno
94633
>>> writer.location.output_recno
88839
Reader and writer format metadata
=================================
In this section you'll learn about the format ``metadata`` attribute
of the readers and writers. You will need
`Compound_014550001_014575000.sdf.gz
`_
from PubChem if you want to reproduce this for yourself.
Each reader and writer has a ``metadata`` attribute, which stores some
information about the parameters used to open it::
>>> from chemfp import rdkit_toolkit as T
>>> reader = T.read_molecules("Compound_014550001_014575000.sdf.gz")
>>> reader.metadata
FormatMetadata(filename='Compound_014550001_014575000.sdf.gz',
record_format='sdf', args={'strictParsing': True, 'removeHs': True, 'sanitize': True})
>>> writer = T.open_molecule_writer(None, "sdf")
>>> writer.metadata
FormatMetadata(filename='', record_format='sdf', args={'kekulize': True, 'includeStereo': False})
The metadata for a structure reader and writer is a
:class:`chemfp.base_toolkit.FormatMetadata` instances, and not the
:class:`chemfp.Metadata` for a fingerprint reader and writer.
The :attr:`~.FormatMetadata.filename` attribute is best effort at a
string representation of the source or destination. It can either be
the original filename (if there is one), the strings "" or
"" for stdin/stout, the string "" if reading or
writing to memory, the source or destination's "name" attribute if a
file object, or None if all else fails.
The :attr:`~.FormatMetadata.record_format` attribute is the format
name for the record, which is the same as the input file format except
without any compression. As you can see in the above example, the
"sdf.gz" reader has a ``record_format`` of "sdf". This parameter is
useful when you want use the :mod:`.text_toolkit` to extract records
because you pass the text reader's record format as the *format* for
the chemistry toolkit's :func:`.toolkit.parse_molecule`.
The :attr:`~.FormatMetadata.args` attribute is the processed
reader_args or writer_args, without any namespacing. For now it's
mostly available for debugging purposes, so you can see how the
toolkit layer actually processed your arguments. In the future there
will be a way to turn this into a text settings dictionary.
.. _location_recno:
Location information: filename, record_format, recno and output_recno
=====================================================================
In this section you'll learn the basics of the
:class:`chemfp.io.Location` API, you'll learn how to get the location
object for each reader and writer, and you'll learn about the
``recno`` and ``output_recno`` location attributes.
(See the next section for details about the ``lineno``, ``offsets``,
``record``, and other location properties which are not available for
every toolkit format.)
The reader and writers track information about the current state of
the reader and writer. Some of this information is more generally
useful, and available through the ``location`` attribute of each
reader and writer::
>>> from chemfp import rdkit_toolkit as T # use your toolkit of choice
>>> content = "C methane\nO=O oxygen\n"
>>> reader = T.read_molecules_from_string(content, "smi")
>>> reader.location
Location('')
>>> loc = reader.location
>>> loc.filename
''
>>> loc.record_format
'smi'
If there is no actual filename then ``filename`` is "" for
string-based I/O, "" when reading from stdin, and ""
when writing to stdout. (The latter two occur when the source or
destination parameter, respectively, are None.) The ``record_format``
is the record format name, without any compression suffix::
>>> writer = T.open_molecule_writer("example.sdf.gz")
>>> writer.location.filename
'example.sdf.gz'
>>> writer.location.record_format
'sdf'
>>> writer.close()
All of the toolkit readers and writers support the ``recno`` location
property, which is the number of records which have been read or
written. A recno of 0 means that no records have been read::
>>> reader = T.read_molecules_from_string(content, "smi")
>>> loc = reader.location
>>> loc.recno
0
>>> next(reader)
>>> loc.recno
1
>>> next(reader)
>>> loc.recno
2
While you could use the ``recno`` property for simple enumeration, as
in the folllowing::
>>> from __future__ import print_function
>>> with T.read_ids_and_molecules_from_string(content, "smi") as reader:
... loc = reader.location
... for id, mol in reader:
... print("record number:", loc.recno, "id:", id)
...
record number: 1 id: methane
record number: 2 id: oxygen
I would prefer that you write it with the "`enumerate()
`_"
function, as in::
>>> with T.read_ids_and_molecules_from_string(content, "smi") as reader:
... for recno, (id, mol) in enumerate(reader, 1):
... print("record number:", recno, "id:", id)
...
record number: 1 id: methane
record number: 2 id: oxygen
The enumerate() function is both faster and more expected for this
sort of code. The ``recno`` property exists more to help with error
reporting, and to report summary information, like::
>>> print("Read", reader.location.recno, "records")
Read 2 records
The output writers distinguish between ``recno``, which is the number
of molecules that chemfp tried to save, and ``output_recno``, which is
the number of molecules that could actually be saved. This occurs
because some molecules cannot be written to a given format, like the
SMILES "\*" which has no InChI representation::
>>> from chemfp import openbabel_toolkit
>>> writer = openbabel_toolkit.open_molecule_writer("example.inchi")
>>> parse_molecule = openbabel_toolkit.parse_molecule
>>> writer.write_molecule(parse_molecule("c1ccccc1O", "smistring"))
>>> writer.location.recno
1
>>> writer.location.output_recno
1
>>> writer.write_molecule(parse_molecule("*", "smistring"))
==============================
*** Open Babel Warning in InChI code
#0 :Unknown element(s): Xx
==============================
*** Open Babel Error in InChI code
InChI generation failed
Traceback (most recent call last):
File "", line 1, in
File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/base_toolkit.py", line 253, in write_molecule
_compat.raise_tb(err[0], err[1])
File "", line 1, in raise_tb
File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/_openbabel_toolkit.py", line 1314, in _gen_write_delimited_structures
% (format_name,), location)
File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "", line 1, in raise_tb
chemfp.ParseError: Open Babel cannot create the InChI string, file 'example.inchi', record #2
>>> writer.location.recno
2
>>> writer.location.output_recno
1
.. _location_record_position:
Location information: record position and content
=================================================
In this section you'll learn how to get position information for each
record and information about the content of each record. You will need
the RDKit toolkit or Open Babel toolkit. (Unfortunately for me, OEChem
doesn't have a way to get this information, and my hybrid parser with
improved error reporting proved to be much slower than OEChem's native
performance.) You will also need `Compound_014550001_014575000.sdf.gz
`_
from PubChem.
(See the previous section for details about the ``filename``,
``record_format``, ``recno`` and ``output_recno`` location properties,
which are available for every toolkit format.)
Sometimes you want to know where a record is located in a file. You
might want to report that the unusable record started on line 12345 of
a given file, or you might want to index a file to implement random
access lookup.
The underlying toolkits do not implement this functionality. Instead,
chemfp includes its own SMILES and SDF file readers. These know enough
about the formats to extract a single record, then pass the record to
the toolkit to turn into a molecule. This lets chemfp track the line
number of the start of the record, its byte range, the text of the
current record, and other details.
Timings show that the hybrid parser for the SMILES formats are no
slower than the native RDKit and Open Babel readers, and that the
hybrid SDF parser a bit slower than RDKit's native parser (about 10%)
and slower than Open Babel's native parser. In all cases, OEChem
native parsers leave chemfp in the dust.
As a consequence, the :mod:`.rdkit_toolkit` and
:mod:`.openbabel_toolkit` SMILES readers track more detailed record
information, but the :mod:`.openeye_toolkit` one does not. (The
:mod:`.text_toolkit` of course always tracks that information.) Here
is an example which works for rdkit_toolkit and openbabel_toolkit::
>>> from __future__ import print_function
>>> from chemfp import openbabel_toolkit as T # or rdkit_toolkit
>>> content = "C methane\nO=O oxygen\n"
>>> reader = T.read_ids_and_molecules_from_string(content, "smi")
>>> loc = reader.location
>>> for id, mol in reader:
... print("id:", repr(id), "lineno:", loc.lineno, "byte range:", loc.offsets)
... print(" record content:", repr(loc.record))
... print(" first line:", repr(loc.first_line))
...
id: 'methane' lineno: 1 byte range: (0, 10)
record content: b'C methane\n'
first line: 'C methane'
id: 'oxygen' lineno: 2 byte range: (10, 21)
record content: b'O=O oxygen\n'
first line: 'O=O oxygen'
>>> content[0:10]
'C methane\n'
>>> content[10:21]
'O=O oxygen\n'
(Note: if the input record is a Unicode string then it will be
converted into a UTF-8 encoded byte string. The start and end
positions are coordinates in the encoded byte string, not the text
string.)
The :class:`location <.Location>` instance of the rdkit_toolkit SDF
reader gives access to many details about the current parser state::
>>> from chemfp import rdkit_toolkit
>>> reader = rdkit_toolkit.read_molecules("Compound_014550001_014575000.sdf.gz")
>>> next(reader)
>>> reader.location.lineno
1
>>> reader.location.offsets
(0, 4227)
>>> reader.location.first_line
'14550001'
>>> next(reader)
>>> reader.location.lineno
166
>>> reader.location.offsets
(4227, 8399)
>>> reader.location.first_line
'14550002'
The openbabel_toolkit and openeye_toolkit implementations by default
don't track this level of detail, because their native readers are
faster than when I can manage in a hybrid reader. Consequently, those
values are None::
>>> from chemfp import openbabel_toolkit
>>> reader = openbabel_toolkit.read_molecules("Compound_014550001_014575000.sdf.gz")
>>> next(reader)
>
>>> print(reader.location.lineno)
None
>>> print(reader.location.offsets)
None
>>> print(reader.location.first_line)
None
There is experimental support to use Open Babel in hybrid mode. The
*reader_args* supports an "implementation" option. The default of
None, or "openbabel", tells chemfp to use Open Babel's native parser,
while specifying "chemfp" tells it to use chemfp's own SDF record
parser::
>>> openbabel_toolkit.get_format("sdf").get_default_reader_args()
{'implementation': None, 'perceive_0d_stereo': False, 'perceive_stereo': False, 'options': None}
>>> reader = openbabel_toolkit.read_molecules("Compound_014550001_014575000.sdf.gz",
... reader_args={"implementation": "chemfp"})
>>> next(reader)
>
>>> reader.location.lineno
1
>>> reader.location.offsets
(0, 4227)
>>> reader.location.first_line
'14550001'
If user-defined selection of the back-end implementation works well, I
may add similar support for the openeye_toolkit, for those who want
the increased level of location detail despite the large performance
impact.
The RDKit "sdf" reader always uses the hybrid. This is for historical
reasons. The hybrid solution was once always faster than the native
ForwardSDMolSupplier. That has since changed, and ForwardSDMolSupplier
is about 10% faster. At some point I will add an 'implementation'
option so you can switch between performance and improved error
reporting.
.. _toolkit_error_handler:
Writing your own error handler (Experimental)
=============================================
In this section you'll learn how to write your own error handler. This
is an advanced topic. Bear in mind that this is highly experimental
and very likely to change. I hope you can provide feedback about how
to improve it.
In earlier sections you learned that when the errors parameter is
"strict", the parser will raise an exception if there's a problem with
a record. When it's "ignore", the record parsers return None as the
molecule, while the file and string readers skip the failing
record. When it's "report", the result is the same as "ignore" except
that extra information about the failure is written to stderr.
The errors parameter can also take an object which implements the
"errors()" method as in the following:
.. code-block:: python
import sys
class OopsHandler(object):
def error(self, msg, location=None):
if location is None:
sys.stderr.write("Oops! %s. Skipping.\n" % (msg,))
else:
sys.stderr.write("Oops! %s, %s. Skipping.\n" % (msg, location.where()))
The ``msg`` is a string describing the error, and ``location``
contains the :class:`chemfp.io.Location` for the given record. Here's
what it looks like in action::
>>> from __future__ import print_function
>>> import sys
>>> from chemfp import rdkit_toolkit as T
>>> T.parse_molecule("Q", "smistring", errors=OopsHandler())
>>> T.parse_molecule("Q", "smistring", errors=OopsHandler())
[13:21:58] SMILES Parse Error: syntax error for input: 'Q'
Oops! RDKit cannot parse the SMILES string 'Q'. Skipping.
>>> for mol in T.read_molecules_from_string("Q Q-ane\nC methane\n", "smi",
... errors=OopsHandler()):
... print("Processed", mol)
...
[13:22:47] SMILES Parse Error: syntax error for input: 'Q'
Oops! RDKit cannot parse the SMILES 'Q', file '', line 1, record #1: first line is 'Q Q-ane'. Skipping.
Processed
The location's :meth:`~.chemfp.io.Location.where` method tries to give useful
information based on the location's filename, line number, record
number, and the first line of the record (up to the first 40
characters).
It's easy to see how to modify this to send the errors to a logger, or
save them up to display in a GUI.
For the hybrid parsers, which give access to the raw record, you can
do more advanced processing, like extract the title lines of any SDF
record which RDKit can't handle. The following will make an
SDF-formatted string containing three records, where the second record
is a 5-valent nitrogren that RDKit can't parse. It will then try to
parse the string, and store the ids for records which couldn't be
parsed.
.. code-block:: python
from __future__ import print_function
from rdkit import Chem
from chemfp import rdkit_toolkit
# Use RDKit to make an SD file which RDKit cannot parse.
methane = rdkit_toolkit.parse_molecule("C methane", "smi")
# Bypass normal sanitization so RDKit will read 5-valent nitrogens
pentavalent_n = rdkit_toolkit.parse_molecule("CN(C)(C)(C)C pentavalent N",
"smi", reader_args={"sanitize": False})
Chem.SanitizeMol(pentavalent_n, Chem.SanitizeFlags.SANITIZE_SETHYBRIDIZATION)
oxygen = rdkit_toolkit.parse_molecule("O=O oxygen", "smi")
# Use the three molecules to make an SD file as a string
with rdkit_toolkit.open_molecule_writer_to_string("sdf") as writer:
writer.write_molecules([methane, pentavalent_n, oxygen])
sdf_content = writer.getvalue()
# User-defined error handler
class CaptureIds(object):
def __init__(self):
self.ids = []
def error(self, msg, location):
self.ids.append(location.first_line)
capture_ids = CaptureIds()
for mol in rdkit_toolkit.read_molecules_from_string(sdf_content, "sdf",
errors=capture_ids):
pass
print("Could not parse:", capture_ids.ids)
The content of ``sdf_content`` is:
.. code-block:: none
methane
RDKit
1 0 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M END
$$$$
pentavalent N
RDKit
6 5 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
2 4 1 0
2 5 1 0
2 6 1 0
M END
$$$$
oxygen
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0
M END
$$$$
and the output from the above is:
.. code-block:: none
[13:26:45] Explicit valence for atom # 1 N, 5, is greater than permitted
Could not parse: [u'pentavalent N']
The :ref:`fingerprint type documentation `
includes another example of how to write an error handler.
A Babel-like structure format converter
=======================================
In this section you'll learn how to use the chemfp toolkit API to
create a Babel-like structure file format converter. This section goes
into more details of how to develop real-world software using chemfp.
Pat Walters and Matt Stahl started Babel in the 1990s as a
command-line program to convert from one chemical structure format to
another. This developed over the years, and after a major rewrite
became the LGPL toolkit "OELib", OpenEye's first commercial chemistry
toolkit. OpenEye's next rewrite lead to OEChem, a proprietary
chemistry toolkit. OELib was still available, and others continued to
develop it. It became Open Babel, and structure file format conversion
is still Open Babel's forte.
A full Babel-like program includes features to add and remove
hydrogens of different sorts, select or reject structures based on
substructure or other features, add 2D or 3D coordinates, and
more. You cannot use chemfp for that. All chemfp can do is read
structure files into a given toolkit's molecule object, and write
molecule objects to a given format.
Even that basic ability is useful. I'll explain how to write such a
converter yourself. I'll use as my example file the following,
"example.smi":
.. code-block:: none
c1ccccc1O phenol
C methane
O=O molecular oxygen
Here's a minimal conversion program to convert the above into "example.sdf"::
from chemfp import rdkit_toolkit as T # use your toolkit of choice
reader = T.read_molecules("example.smi")
writer = T.open_molecule_writer("example.sdf")
writer.write_molecules(reader)
That code depends on Python's garbage collection to close the output
file handle. This is fine for a script, but a longer running program
may want to have more explicit control over closing the file handle
and use a context manager (see :ref:`context_managers`)::
from chemfp import rdkit_toolkit as T # use your toolkit of choice
with T.read_molecules("example.smi") as reader:
with T.open_molecule_writer("example.sdf") as writer:
writer.write_molecules(reader)
With that we have enough to build our first Babel program, which takes
the input and output filenames on the command-line. I'll call this
program "cbabel.py", for "chemfp babel", and have it implement the
command-line
.. code-block:: none
usage: cbabel.py [-h] input_filename output_filename
I'll use `argparse
`_ from Python's
standard library to handle command-line argument processing. The
"``nargs=1``" in the following says that the *input_filename* and
*output_filename* must exist, and only one filename is
allowed. Argparse will save those in a list of size 1, which is why I
use [0] to get the actual string I'm interested in::
import argparse
from chemfp import rdkit_toolkit as T
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
with T.read_molecules(args.input_filename[0]) as reader:
with T.open_molecule_writer(args.output_filename[0]) as writer:
writer.write_molecules(reader)
I'll convert the SMILES into canonical SMILES:
.. code-block:: console
% python cbabel.py example.smi example.can
% cat example.can
Oc1ccccc1 phenol
C methane
O=O molecular oxygen
The only change is that the phenol went from ``c1ccccc1O`` to
``Oc1ccccc1``.
I'll add the ability to read from stdin and stdout. I'll say that if
the input filename is "-" then it will read from stdin, and if the
output filename is "-" then it will write to stdout. (If you have a
file named "-" then you'll have to specify "./-" to read or write to
it.):
.. code-block:: python
import argparse
from chemfp import rdkit_toolkit as T
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
input_filename = None
output_filename = args.output_filename[0]
if output_filename == "-":
output_filename = None
with T.read_molecules(input_filename) as reader:
with T.open_molecule_writer(output_filename) as writer:
writer.write_molecules(reader)
There's a problem with this! When the input or output format is None,
chemfp can't figure out the format based on the filename, so will
assume that it's a SMILES file. When I run the above I get SMILES
output:
.. code-block:: console
% python cbabel.py example.smi -
Oc1ccccc1 phenol
C methane
O=O molecular oxygen
But what if I want SDF output? I need a way to specify the input and
output file formats on the command-line. I'll use the :option:`-i` and
:option:`-o` options to specify those:
.. code-block:: python
from __future__ import print_function
import argparse
from chemfp import rdkit_toolkit as T
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("-i", metavar="FORMAT", dest="input_format",
help="input format name", default=None)
parser.add_argument("-o", metavar="FORMAT", dest="output_format",
help="output format name", default=None)
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
input_filename = None
output_filename = args.output_filename[0]
if output_filename == "-":
output_filename = None
with T.read_molecules(input_filename, args.input_format) as reader:
with T.open_molecule_writer(output_filename, args.output_format) as writer:
writer.write_molecules(reader)
Now I can specify that I want stdout to be in SDF format:
.. code-block:: console
% python x.py -o sdf example.smi - | head -5
phenol
RDKit
7 7 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
In practice, required command-line arguments make life more
difficult. For a simple program like this, required arguments are not
a problem, but what if I want to add a command to list the available
formats? That option doesn't need an input or output filename, but
argparse will enforce that requirement anyway.
There are a couple of ways to solve the problem, but the easiest is to
let "-" be the default input and output filename. That's easily done
by changing:
.. code-block:: python
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
to:
.. code-block:: python
parser.add_argument("input_filename", nargs="?", default="-",
help="input filename")
parser.add_argument("output_filename", nargs="?", default="-",
help="output filename")
As a result I can add a new :option:`--list-formats` argument::
parser.add_argument("--list-formats", action="store_true",
help="list the available file formats")
along with the handler for it, which prints information about the
toolkit (its name and version string) and each of the formats. Some of
the formats, like "smistring", don't have an I/O format (for that, use
"smi"), so I need to filter those out. Also, some of the formats, like
"inchikey", are output only, and some of the toolkit have formats that
they read but don't write, so I give more details about those::
args = parser.parse_args()
if args.list_formats:
print("Available I/O formats for toolkit %s (%s)" % (T.name, T.software))
for format in T.get_formats():
if not format.supports_io: # skip formats like "smistring" and "inchistring"
continue
if not format.is_output_format:
msg = " (input only)"
elif not format.is_input_format:
msg = " (output only)"
else:
msg = ""
print(" %s%s" % (format.name, msg))
raise SystemExit(0)
For my version of RDKit I get:
.. code-block:: console
% python cbabel.py --list-formats
Available I/O formats for toolkit rdkit (RDKit/2016.09.3)
inchikey (output only)
usm
sdf
can
smi
inchi
If I used openeye_toolkit instead of rdkit_toolkit I get:
.. code-block:: none
Available I/O formats for toolkit openeye (OEChem/20170208)
mol2h
xyz
mopac (output only)
cdx
usm
smi
inchikey (output only)
skc (input only)
mol2
sdf
mmod
inchi (output only)
oeb
mf (output only)
sln (output only)
can
pdb
The code so far requires RDKit, but chemfp supports OEChem and Open
Babel. Why not add the command-line argument :option:`--toolkit` to
specify an alternate toolkit?
I 'll tell argparse that there's a new :option:`--toolkit` argument,
which defaults to "rdkit" and also allows "openeye" and "openbabel"::
parser.add_argument("--toolkit", metavar="NAME", choices=("rdkit", "openeye", "openbabel"),
help="toolkit name", default="rdkit")
I can no longer import the toolkit directly, which I did as::
from chemfp import rdkit_toolkit as T
because that line requires that RDKit be installed. Otherwise it will
raise an ImportError exception. While that might be reasonable if the
user wanted to use the rdkit toolkit, it's not reasonable if the user
wanted to use the Open Babel toolkit and didn't care to know that
RDKit isn't available.
Instead of a direct import, I'll use :func:`chemfp.get_toolkit` to get
the named toolkit. It raises a ValueError with a useful error message
if the toolkit isn't available or is unknown. If that happens, I'll
exit, and use that message as the explanation::
import chemfp
# ... skipped many lines
try:
T = chemfp.get_toolkit(args.toolkit)
except ValueError as err:
raise parser.error(str(err))
After a bit of experimentation I found a small SMILES string which
gives a different canonicalization for each of the supported toolkits,
which I present as evidence that it really is using a different
toolkit:
.. code-block:: console
% echo "NCC(N)O example" | python cbabel.py --toolkit openbabel
NCC(O)N example
% echo "NCC(N)O example" | python cbabel.py --toolkit rdkit
NCC(N)O example
% echo "NCC(N)O example" | python cbabel.py --toolkit openeye
C(C(N)O)N example
Here's the final code, so you can see how everything works in
context:
.. code-block:: python
import argparse
import chemfp
parser = argparse.ArgumentParser(
description = "A minimial chemical structure file converter"
)
parser.add_argument("--toolkit", metavar="NAME", choices=("rdkit", "openeye", "openbabel"),
help="toolkit name", default="rdkit")
parser.add_argument("-i", metavar="FORMAT", dest="input_format",
help="input format name", default=None)
parser.add_argument("-o", metavar="FORMAT", dest="output_format",
help="output format name", default=None)
parser.add_argument("input_filename", nargs="?", default="-",
help="input filename")
parser.add_argument("output_filename", nargs="?", default="-",
help="output filename")
parser.add_argument("--list-formats", action="store_true",
help="list the available file formats")
args = parser.parse_args()
try:
T = chemfp.get_toolkit(args.toolkit)
except ValueError as err:
raise parser.error(str(err))
if args.list_formats:
print("Available I/O formats for toolkit %s (%s)" % (T.name, T.software))
for format in T.get_formats():
if not format.supports_io: # skip formats like "smistring" and "inchistring"
continue
if not format.is_output_format:
msg = " (input only)"
elif not format.is_input_format:
msg = " (output only)"
else:
msg = ""
print(" %s%s" % (format.name, msg))
raise SystemExit(0)
# Support "-" as stdin/stdout by mapping it to None,
# which tells chemfp to use stdin/stout
input_filename = args.input_filename[0]
if input_filename == "-":
input_filename = None
output_filename = args.output_filename[0]
if output_filename == "-":
output_filename = None
with T.read_molecules(input_filename, args.input_format) as reader:
with T.open_molecule_writer(output_filename, args.output_format) as writer:
writer.write_molecules(reader)
Amazing how the original four lines of code expands to 55. It would be
even more if I added full error reporting instead of letting Python
throw an exception on errors.
Speaking of errors, you may want to use hard-coded values of
``errors="ignore"`` or ``errors="report"`` to have the parser skip
records that the toolkit doesn't understand, or perhaps pass in that
information as a command-line argument named :option:`--errors`, with
the possible choices of "strict", "report", or "ignore".
You might also add the :option:`-R` and :option:`-W` options to set
the reader args and writer args for the formats, but that's more
complicated than I wanted to describe in this context. See the next
section for a description of how to do it.
.. _argparse_text_settings:
argparse text settings to reader and writer args
================================================
In this section you'll learn how to use argparse to handle reader args
and writer args in the same style that chemfp does.
The previous section showed how to create a Babel-like structure
format conversion program and how to use Python's `argparse
`_ library for
command-line processing. That section was getting too long to describe
how to support command-line configuration of the reader args and
writer args. In this section I'll start with a smaller version of the
same code. This one requires an input filename and an output filename
on the command-line, and lets the user specify the toolkit::
# I put this into a file named "convert.py"
import argparse
import chemfp
parser = argparse.ArgumentParser(
description = "Experiment with -R and -W options"
)
parser.add_argument("--toolkit", metavar="NAME", choices=("rdkit", "openeye", "openbabel"),
help="toolkit name", default="rdkit")
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
args = parser.parse_args()
T = chemfp.get_toolkit(args.toolkit)
source = args.input_filename[0]
destination = args.output_filename[0]
with T.read_molecules(source) as reader:
with T.open_molecule_writer(destination) as writer:
writer.write_molecules(reader)
I'll walk through the process of how to add support for the
:option:`-R` and :option:`-W` options, to make it possible to say:
.. code-block:: console
python convert.py example.smi example.can --toolkit rdkit -R delimiter=space -W allBondsExplicit=true
How to get from the command-line to reader and writer arguments
---------------------------------------------------------------
This requires a few conversions. I need to turn the command-line
arguments into reader and writer text settings dictionaries, then
convert the text settings into reader_args and and writer_args
dictionaries, before finally passing the reader_args and writer_args
to the molecule readers and writers. (See
:ref:`text_settings_to_reader_and_writer_args` for more details about
converting text settings to reader and writer arguments.)
I'll use argparse to place the :option:`-R` and :option:`-W` option
values into separate lists of ``KEY=VALUE`` strings, and create a new
function which splits them apart on the "``=``" to get a dictionary of
text settings. Then I'll use the :class:`.Format` object to convert the
text settings into the correct *reader_args* and *writer_args*. The steps
will look something like this::
>>> from chemfp import rdkit_toolkit as T
>>>
>>> format = T.get_format("smi") # Specify the format and user-defined settings
>>> reader_settings = ["delimiter=space"]
>>> writer_settings = ["allBondsExplicit=true"]
>>>
>>> # Using a function yet to be defined, convert the list of
... # reader_settings into a dictionary of string values
...
>>> reader_text_settings = parse_text_settings("-R", reader_settings)
>>> reader_text_settings
{'delimiter': 'space'}
>>>
>>> # Ask the format to turn the string values into string objects
...
>>> format.get_reader_args_from_text_settings(reader_text_settings)
{'delimiter': 'space'}
>>>
>>> # Do the same for the writer arguments
...
>>> writer_text_settings = parse_text_settings("-W", writer_settings)
>>> writer_text_settings
{'allBondsExplicit': 'true'}
>>> format.get_writer_args_from_text_settings(writer_text_settings)
{'allBondsExplicit': True}
For the actual code the input format may be different than the output
format. By the way, if you look closely you'll see how
"allBondsExplicit" in the text settings has a value of "true", and the
string was converted to the Python object True to be a writer_arg.
To start! First, I need a way to read the list of :option:`-R` and :option:`-W`
options. I'll ask argparse to save them into a list, for later
post-processing to get the right values::
parser.add_argument("-R", metavar="KEY=VALUE", dest="reader_settings", action="append",
help="specify a reader argument", default=[])
parser.add_argument("-W", metavar="KEY=VALUE", dest="writer_settings", action="append",
help="specify a writer argument", default=[])
This will parse all of the :option:`-R` terms, like "``-R delimiter=space``", into
the *reader_settings* list, and "``-W allBondsExplicit=true``" into the
*writer_settings* list, such that::
args.reader_settings == ["delimiter=space"]
args.writer_settings == ["allBondsExplicit=true"]
For that matter, it will also support "``-R abc``", and put it into
the *reader_settings* list even though it doesn't have a "``=``" in
it. I also need to go through and figure out if any terms are
incorrect, and report the problem. I'll make a function for this,
along with a parameter so any error message can report if a problem
comes from the :option:`-R` or :option:`-W` command-line flag::
def parse_text_settings(flag, terms):
text_settings = {}
for term in terms:
left, mid, right = term.partition("=")
if mid != "=":
parser.error("%s setting %r must be of the form KEY=VALUE" %
(flag, term))
text_settings[left] = right
return text_settings
reader_text_settings = parse_text_settings("-R", args.reader_settings)
writer_text_settings = parse_text_settings("-W", args.writer_settings)
This gives me two text settings dictionaries, where the keys and
values are both strings. I'll use the respective :class:`.Format` object
to convert a text setting dictionary into the correct reader and
writer arguments dictionary::
input_format = T.get_input_format_from_source(source)
reader_args = input_format.get_reader_args_from_text_settings(reader_text_settings)
output_format = T.get_output_format_from_destination(destination)
writer_args = input_format.get_writer_args_from_text_settings(writer_text_settings)
All that's left is to pass the *reader_args* and *writer_args* to the
reader and writer. Since I already have the input and output format
objects, I'll pass those in as well, rather than have them guess again
based on the source and destination names::
with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
writer.write_molecules(reader)
Converter with -R and -W support
--------------------------------
Here's how it looks when I put it all together:
.. code-block:: python
# I put this into a file named "convert.py"
import argparse
import chemfp
parser = argparse.ArgumentParser(
description = "Experiment with -R and -W options"
)
parser.add_argument("--toolkit", metavar="NAME", choices=("rdkit", "openeye", "openbabel"),
help="toolkit name", default="rdkit")
parser.add_argument("-R", metavar="KEY=VALUE", dest="reader_settings", action="append",
help="specify a reader argument", default=[])
parser.add_argument("-W", metavar="KEY=VALUE", dest="writer_settings", action="append",
help="specify a writer argument", default=[])
parser.add_argument("input_filename", nargs=1, help="input filename")
parser.add_argument("output_filename", nargs=1, help="output filename")
def parse_text_settings(flag, terms):
text_settings = {}
for term in terms:
left, mid, right = term.partition("=")
if mid != "=":
parser.error("%s setting %r must be of the form KEY=VALUE" %
(flag, term))
text_settings[left] = right
return text_settings
args = parser.parse_args()
T = chemfp.get_toolkit(args.toolkit)
source = args.input_filename[0]
destination = args.output_filename[0]
input_format = T.get_input_format_from_source(source)
reader_text_settings = parse_text_settings("-R", args.reader_settings)
reader_args = input_format.get_reader_args_from_text_settings(reader_text_settings)
output_format = T.get_output_format_from_destination(destination)
writer_text_settings = parse_text_settings("-W", args.writer_settings)
writer_args = input_format.get_writer_args_from_text_settings(writer_text_settings)
with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
writer.write_molecules(reader)
Let's see it in action. I'll ask RDKit to include all of the bonds in
the output SMILES, including the aromatic bonds, and I'll ask it to
use the space character as the SMILES delimiter::
% python convert.py example.smi example_output.smi --toolkit rdkit -R delimiter=space -W allBondsExplicit=true
% cat example_output.smi
O-c1:c:c:c:c:c:1 phenol
C methane
O=O molecular
The lack of "oxygen" in "molecular oxygen" shows that the input SMILES
reader used the "space" delimiter instead of the default "to-eol"
delimiter, just as I requested.
The :option:`-R` and :option:`-W` settings can also be qualified. (See
:ref:`qualified_parameter_names`.) I'll have Open Babel and OEChem use
different delimiter styles to get different results::
% python convert.py example.smi example_ob_output.smi --toolkit openbabel \
-R "openbabel.*.delimiter=to-eol" -R "openeye.*.delimiter=whitespace"
% cat example_ob_output.smi
Oc1ccccc1 phenol
C methane
O=O molecular oxygen
%
% python convert.py example.smi example_oe_output.smi --toolkit openeye \
-R "openbabel.*.delimiter=to-eol" -R "openeye.*.delimiter=whitespace"
% cat example_oe_output.smi
c1ccc(cc1)O phenol
C methane
O=O molecular
List the reader and writer arguments for the given formats
----------------------------------------------------------
Finally, it's difficult to remember all of the available settings for
each input and output format. I'll add a :option:`--list-args`
command-line option which shows the available options, and for each
option show the current setting, along with an indicator if the
current setting is the default value for that format or if the setting
comes from the command-line option.
I need argparse to know about the new option::
parser.add_argument("--list-args", action="store_true",
help="list the available input and output options")
and for the rest I replace the last three lines of the earlier code
with:
.. code-block:: python
if args.list_args:
# Make a helper function to display the arguments
def report_args(format, msg, default_args, specified_args):
print("%s %s:" % (format.name, msg))
# Merge the arguments; command-line overrides defaults;
all_args = default_args.copy()
all_args.update(specified_args)
for name, value in sorted(all_args.items()):
# Was the name specified via -R/-W or is it a default?
where = "from command-line" if name in specified_args else "default value"
print(" %s: %r (%s)" % (name, value, where))
report_args(input_format, "reader arguments (-R)",
input_format.get_default_reader_args(), reader_args)
report_args(output_format, "writer arguments (-W)",
output_format.get_default_writer_args(), writer_args)
else:
with T.read_molecules(source, input_format, reader_args=reader_args) as reader:
with T.open_molecule_writer(destination, output_format, writer_args=writer_args) as writer:
writer.write_molecules(reader)
(See :ref:`get_default_reader_and_writer_args` for more details on the
default reader and writer arguments.)
With those changes, the output using the new :option:`--list-args` is:
.. code-block:: console
% python convert.py example.smi example_output.smi --toolkit rdkit \
? -R delimiter=space -W allBondsExplicit=true --list-args
smi reader arguments (-R):
delimiter: 'space' (from command-line)
has_header: False (default value)
sanitize: True (default value)
smi writer arguments (-W):
allBondsExplicit: True (from command-line)
canonical: True (default value)
delimiter: None (default value)
isomericSmiles: True (default value)
kekuleSmiles: False (default value)
Creating a specialized record parser
====================================
In this section you'll learn how to make a specialized function to
parse an record into a toolkit molecule. This function is somewhat
faster than calling the more general purpose
:func:`.toolkit.parse_id_and_molecule` and might be used when you need
to convert a lot of individual records into a molecule.
Sometimes you need to parse a lot of records which don't come from a
file. For example, substructure search is typically split into a
screening stage based on substructure fingerprints, followed by the
atom-by-atom substructure search. The screening stage returns
identifiers and the substructure search takes molecules, so in between
them is code to look up a record based on its id and convert the
result to a molecule.
Assuming a database record API where "``db[id]``" returns the record
for a given id, that lookup function might look like this:
.. code-block:: python
def get_molecules(db, id_iter, toolkit, format, reader_args=None):
for id in id_iter:
record = db[id]
mol = toolkit.parse_molecule(record, format, reader_args=reader_args)
yield mol
(A more complex implementation should handle when the record id
doesn't exist, or can't be converted into a molecule.)
This isn't as fast as it could be. The :func:`.toolkit.parse_molecule`
function validates that the *format* and *reader_args* are correct and
figures out the right parameters for the underlying toolkit code. It's
a waste of time to redo those checks for every single call.
The function also promises that the caller will get a new molecule
each time. That promise isn't needed for substructure
screening. Timing tests with OEChem show that reusing the same
molecule is faster than creating a new one. For example, this OEChem
code::
mol = OEGraphMol()
for i in xrange(100000):
OEParseSmiles(mol, "c1ccccc1Oc1ccccc1")
mol.Clear()
takes about 60% of the time as this code::
for i in xrange(100000):
mol = OEGraphMol()
OEParseSmiles(mol, "c1ccccc1Oc1ccccc1")
(Bear in mind that this code isn't doing aromaticity perception, which
roughly halves the performance.)
The function :func:`.toolkit.make_id_and_molecule_parser` returns a
specialized function to parse records, based on the specified
parameters::
>>> from chemfp import rdkit_toolkit as T
>>> parser = T.make_id_and_molecule_parser("smi")
>>> parser("c1ccccc1O phenol")
(u'phenol', )
For RDKit it's about 10-15% faster to use the specialized function
instead of the general purpose :func:`.toolkit.parse_molecule`::
>>> from __future__ import print_function
>>> from chemfp import rdkit_toolkit as T
>>> import time
>>>
>>> smiles = "c1ccccc1Oc1ccccc1"
>>> if 1:
... t1 = time.time()
... for i in xrange(10000):
... mol = T.parse_molecule(smiles, "smi")
... print("Standard time:", time.time()-t1)
...
Standard time: 2.32303786278
>>> parser = T.make_id_and_molecule_parser("smi")
>>> if 1:
... t1 = time.time()
... for i in xrange(10000):
... id, mol = parser(smiles)
... print("Specialized time:", time.time()-t1)
...
Specialized time: 2.74086713791
The :func:`.toolkit.make_id_and_molecule_parser` function parameters
are almost identical to the ones in
:func:`.toolkit.parse_id_and_molecule`, and with the same meaning. The
only difference is that ``make_id_and_molecule_parser`` does not
support the *record* parameter. Instead, it returns a function which
takes the *record* and returns the (id, toolkit molecule) pair::
>>> from chemfp import rdkit_toolkit
>>> parser = rdkit_toolkit.make_id_and_molecule_parser(
... "smi", reader_args={"delimiter": "whitespace"}, errors="ignore")
>>> parser("c1ccccc1O methane 16.04246")
(u'methane', )
>>> parser("Q q-ane")
[14:23:02] SMILES Parse Error: syntax error for input: 'Q'
(None, None)
WARNING: The function that ``make_id_and_molecule_parser()`` returns
may reuse the underlying molecule object. Calling the function again
may change the molecule returned in previous call::
>>> from chemfp import openeye_toolkit
>>> parser = openeye_toolkit.make_id_and_molecule_parser("smi")
>>> id, mol = parser("C")
>>> mol.NumAtoms()
1
>>>
>>> parser("CCC")
(None, >)
>>> mol.NumAtoms()
3
RDKit doesn't support molecule reuse so rdkit_toolkit returns a new
molecule. Open Babel does support reuse and openbabel_toolkit will
reuse the molecule. However, my tests using Open Babel show a barely
detectable performance improvement if I reuse a molecule vs. creating
a new one each time. Future versions of chemfp may change the default,
and may add an implementation option to specify if a new molecule
should be returned each time.
In multithreaded code you should create a new parser for each thread.
You might have noticed there is no "``make_molecule_parser()``". While
it would be useful, it takes time to develop, test, and document, and
it wasn't useful enough for this release. Let me know if you would
like it in the future.
Molecule API: Get and set the molecule id
=========================================
In this section you'll learn how to get and set the molecule id for a
toolkit molecule.
Sometimes you want to get or set toolkit molecule id. This should be
pretty rare because the input routines all support a way to get the
identifier in parallel with the molecule, and the output routines all
support a way to specify an identifier.
One exception is if you read molecules from an SD file where you want
to use one of the SD tag values as the identifier rather than the
title line at the top of the record. This can occur with the ChEBI
data set::
>>> from chemfp import rdkit_toolkit as T
>>> reader = T.read_ids_and_molecules("ChEBI_lite.sdf.gz", id_tag="ChEBI ID")
>>> next(reader)
(u'CHEBI:90', )
>>> id, mol = _
>>> id
u'CHEBI:90'
>>> mol
>>> mol.GetProp("_Name")
''
>>> print(reader.location.record[:200])
Marvin 01211310252D
22 24 0 0 0 0 999 V2000
-2.8644 -0.2905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.8656 -1.1176 0.0000 C 0 0 0 0 0 0 0
>>>
Note: in Python 3, the location.record is a byte string and the last
output line is:
>>> print(reader.location.record[:200])
b'\n Marvin 01211310252D \n\n 22 24 0 0 0 0 999 V2000
\n -2.8644 -0.2905 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n -2.
8656 -1.1176 0.0000 C 0 0 0 0 0 0 0 '
The above used the RDKit-specific way to get the special "_Name"
property and show that it's the empty string. The location object for
the rdkit_toolkit SDF reader is able to show the raw text for the
current record. In the above I used the location.record to show that
the record indeed has no title line.
I might want to tie that id directly to the molecule. For example, a
lot of people write code which assume that the molecule's name or
title is the identifier, because only ChEBI and a scant handful of
other databases use an alternative convention. You can use chemfp to
get the appropriate id, then set the correct molecular property.
If I know it's an RDKit molecule then I could do::
mol.SetProp("_Name", id)
This is not portable. OEChem and Open Babel call this a "title", and
use the molecule's "GetTitle()" and "SetTitle()" accession methods to
get and set it. For those toolkits I would need to do::
mol.SetTitle(id)
As part of chemfp's limited molecule API, each chemfp toolkit layer
implements a portable helper function named "get_id()" to get the
toolkit-appropriate "identifier", and "set_id()" to set it. The
following shows an example of converting the title of a SMILES record
to upper-case, and generating the corresponding canonical SMILES::
>>> from __future__ import print_function
>>> import chemfp
>>> for toolkit_name in ("rdkit", "openeye", "openbabel"):
... T = chemfp.get_toolkit(toolkit_name)
... mol = T.parse_molecule("c1ccccc1O phenol", "smi")
... T.set_id(mol, T.get_id(mol).upper())
... smiles = T.create_string(mol, "smi")
... print(toolkit_name, "->", repr(smiles))
...
rdkit -> u'Oc1ccccc1 PHENOL\n'
openeye -> u'c1ccc(cc1)O PHENOL\n'
openbabel -> u'Oc1ccccc1 PHENOL\n'
Please note that this could be written more succinctly by passing the
``id`` directly to the :func:`chemfp.toolkit.create_string()`
function, as::
>>> from __future__ import print_function
>>> import chemfp
>>> for toolkit_name in ("rdkit", "openeye", "openbabel"):
... T = chemfp.get_toolkit(toolkit_name)
... id, mol = T.parse_id_and_molecule("c1ccccc1O phenol", "smi")
... smiles = T. create_string(mol, "smi", id=id.upper())
... print(toolkit_name, "->", repr(smiles))
...
rdkit -> 'Oc1ccccc1 PHENOL\n'
openeye -> 'c1ccc(cc1)O PHENOL\n'
openbabel -> 'Oc1ccccc1 PHENOL\n'
Note: I may add support for an optional id_tag, as in::
T.get_id(mol, id_tag="ChEBI id") # Currently not valid chemfp code!
If you think this would be useful, please let me know about your use
case.
Finally, if you want the output record as a UTF-8 encoded byte string
rather than a Unicode string then use
:func:`chemfp.toolkit.create_bytes()` instead of ``create_string()``.
Molecule API: Copy a molecule
==============================
In this section you'll learn how to make a copy of a native toolkit
molecule.
The chemfp file readers may clear and reuse the underlying toolkit
molecule. This is a problem if you want to load all of the molecules
from a data set into memory::
>>> from chemfp import openeye_toolkit
>>> content = "C methane\nO=O oxygen\n"
>>> mols = list(openeye_toolkit.read_molecules_from_string(content, "smi"))
>>> mols
[ >,
>]
>>> mols[0] is mols[1]
True
>>> openeye_toolkit.create_string(mols[0], "smistring")
''
>>> openeye_toolkit.create_string(mols[1], "smistring")
''
You can see that the openeye_toolkit reader reuses the same
OEGraphMol, and that the molecule is cleared at the end of parsing.
In the future there may be a *reader_args* parameter to tell the reader
to make a new molecule for each term. Until that possible future
happens, one work-around is to make a copy of the molecule using the
respective chemfp toolkit's :func:`.toolkit.copy_molecule` function::
>>> from chemfp import openeye_toolkit as T
>>> mols = [T.copy_molecule(mol) for mol in openeye_toolkit.read_molecules_from_string(content, "smi")]
>>> mols
[ >,
>]
>>> mols[0] is mols[1]
False
>>> T.create_string(mols[0], "smistring")
'C'
>>> T.create_string(mols[1], "smistring")
'O=O'
The various writers may also modify the molecule, for example, by
temporarily changing the molecule id or by reperceiving
aromaticity. If this is a problem then you can use the copy_molecule()
as a way to work around it.
This is definitely a work-around solution because it's currently
impossible to know if a copy is needed or not. The fail-safe solution
is to always copy, which will lead to extra copies and slower code
when using the rdkit_toolkit. Other more complicated workarounds might
be faster, but the real solution that I hope to implement in the
future is to specify the requested behavior as a parameter.
Molecule API: Working with SD tags
==================================
In this section you'll learn how to work with SD tag data.
Chemfp supports a limited cross-toolkit API for working with SD
tags. You can get a value for a single tag, the list of all tags and
values, and add (and potentially replace) a tag with a given name.
NOTE: This is not a general-purpose SD tag API.
The two main goals of the SD tag API are to get a tag's value (most
likely for use as an identifier) and to add a fingerprint or
similarity search result to a molecule. This can be done with the
toolkit's :func:`.toolkit.add_tag` and :func:`.toolkit.get_tag`
functions::
>>> from chemfp import rdkit_toolkit as T
>>> mol = T.parse_molecule("O=O oxygen", "smi")
>>> T.add_tag(mol, "score", "0.9851")
>>> T.get_tag(mol, "score")
u'0.9851'
>>> print(T.create_string(mol, "sdf"))
oxygen
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0
M END
>
0.9851
$$$$
If a given tag already exists then ``add_tag()`` may replace the
existing value, or it may add a second tag with the same name. (Eg,
rdkit_toolkit currently replaces an existing tag while openeye_toolkit
creates a second entry.) Chemfp does no additional error checking, so
please be careful about the use of ">" and newline characters in the
tag value.
It is sometimes useful to get all of the tags and corresponding
values. The toolkit's :func:`.toolkit.get_tag_pairs` function returns
these as a list of 2-element tuples, where the first term is the tag
name and the second is the value::
>>> T.add_tag(mol, "best_id", "ABC00000123")
>>> T.add_tag(mol, "text", "This continues\nacross multiple\nlines")
>>> T.get_tag_pairs(mol)
[(u'score', u'0.9851'), (u'best_id', u'ABC00000123'), (u'text', u'This continues\nacross multiple\nlines')]
>>> print(T.create_string(mol, "sdf"))
oxygen
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0
M END
>
0.9851
>
ABC00000123
>
This continues
across multiple
lines
$$$$
If there are multiple tags with the same name then ``get_tag()``
arbitrarily decides which value to return. The ``get_tag_pairs()``
function includes duplicates if the underlying toolkit supports it.
.. _add_fingerprint_as_sd_tag:
Add fingerprints to an SD file using a toolkit
==============================================
In this section you'll learn how to add a fingerprint as a tag to the
structures in an SD file using a chemistry toolkit.
The FPS and FPB fingerprint file formats store the record id and the
fingerprint, but not the original structure. The most common way to
tie the structure to a fingerprint is to use an SD file, and store the
fingerprint as one of the tag values. (Another is to create a SMILES
file variant, also called a CSV file, with the fingerprint as a new
column.)
The following will parse an SD file, and for each molecule it will
compute the MACCS fingerprints and add the base64-encoded fingerprint
to the molecule using the unimaginative tag name "FP". It will save
the results to the file named "example.sdf", which is equally
unimaginative:
.. code-block:: python
import sys
import base64
import chemfp
# Portable code to convert a fingerprint to a string
# which the underlying toolkits will accept.
#
# b64encode returns a byte string, which is fine for
# all toolkits under Python 2.
if sys.version_info.major == 2:
b64encode = base64.b64encode
else:
# Under Python 3, RDKit and Open Babel accept a byte string.
# OEChem does not. Always convert to Unicode.
def b64encode(s):
return base64.b64encode(s).decode("ascii")
# Select your toolkit of choice
fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
#fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
#fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
T = fptype.toolkit
reader_args = {"rdkit.sdf.removeHs": False}
with T.read_molecules("Compound_014550001_014575000.sdf.gz",
reader_args=reader_args) as reader:
with T.open_molecule_writer("example.sdf") as writer:
for mol in reader:
fp = fptype.compute_fingerprint(mol)
T.add_tag(mol, "FP", b64encode(fp))
writer.write_molecule(mol)
This is a very general purpose solution. It's easy to change the
fingerprint type, or switch the input to a SMILES file or other
supported structure file format.
(Unfortunately, there is no general purpose base64 encoder which works
across all toolkits and both Python 2 and Python 3. Hence the
complicated code to do the right thing.)
What it doesn't do is preserve all of the details of the input
records. It converts the input record into a molecule, and back out to
a new record, and the intermediate record doesn't keep all of the
details.
For example, if I use OpenEye-MACCS166 and compare the first
record of the original with the first record of the transformed output
then the diff comparison is:
.. code-block:: diff
2c2
< -OEChem-03291708342D
---
> -OEChem-04301702502D
164a165,167
> >
> AACACAAAgUBgAOImoJBhQt+OKnwb
>
This says that the second line changed, and three new lines were added
at line 164.
The second line contains a date stamp, so this isn't a big change, and
the three new lines are the ones I requested. This doesn't look like
much of a change, but that's because OEChem was used to make the
record in the first place. Open Babel and RDKit have their own set of
differences from the OEChem output defaults. For example, RDKit will
sort the SD tags alphabetically.
I wanted to compare the original OEChem-based PubChem record to the
output record from RDKit. I commented/uncommented the fingerprint
names to use RDKit instead of OEChem. When I did this originally
(since fixed), I noticed that the atom and bond counts line changed.
The first problem I noticed,
before I fixed it, is that the atom and bond counts line changed. The
original record has 26 atoms and bonds:
.. code-block:: none
26 26 0 0 0 0 0 0 0999 V2000
while the RDKit output said there are only 15 atoms and bonds:
.. code-block:: none
15 15 0 0 0 0 0 0 0 0999 V2000
What happened is that RDKit by default will convert explicit hydrogens
to implicit hydrogens as part of the input process, while OEChem does
not.
I can disable that in RDKit using the *removeHs* reader_arg, which is
in the code I showed earlier::
reader_args = {"rdkit.sdf.removeHs": False}
With *removeHs* disabled, the RDKit atom counts match the original
atom counts. There are still a few differences in the molblock.
* RDKit places a "0" in the obsolete 4th field of the counts line,
while OEChem leaves it empty.
* RDKit uses the CHG property block and does not include duplicate
charge information in the atom line. The PubChem file only stores
charge information in the atom line.
* RDKit leaves the last three fields empty, while PubChem
uses 0. These fields are respectively 'obsolete', used for "SSS
queries only", and used for "Reaction, Query".
That aside, the MACCS fingerprints should be the same, right?
They are not. The RDKit (and Open Babel and CDK) MACCS keys
implementations assume that all hydrogens are implicit. If there are
explicit hydrogens then they will likely give a different
fingerprint. If you run the above code using RDKit, with and without
*removeHs*, you'll see two different values for FP::
AACAAAAAgUBgAKImoBBhSl/Orn0f # RDKit, removeHs=True
AACAAAAAgUBgAKMmoJBhal/Orn0f # RDKit, removeHs=False
See :ref:`maccs_and_hydrogens` for a more detailed description of the
problem.
I'm left with the unfortunate situation where I can't preserve the
explicit hydrogens without affecting the MACCS fingerprints. I think
the right solution is to fix the SMARTS patterns that RDKit and others
use (which is a goal of chemfp's own RDMACCS fingerprints).
Another solution for this is to use the :ref:`text_toolkit
` to preserve the input SDF record syntax, and
combine it with a chemistry toolkit to get the molecule you want.