.. _fingerprint_family_and_type_examples:
.. highlight:: pycon

====================================
Fingerprint family and type examples
====================================

This chapter describes how to use the fingerprint family and
fingerprint type API added in chemfp 2.0.


Fingerprint families and types
==============================

In this section you'll learn the difference between a fingerprint
family and a fingerprint type. You will need
`Compound_014550001_014575000.sdf.gz
<ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_014550001_014575000.sdf.gz>`_
from PubChem to work though all of the examples.


Chemfp distinguishes between a "fingerprint family" and a "fingerprint
type." A fingerprint family describes the general approach for doing a
fingerprint, like "the OpenEye path-based fingerprint method", while a
fingerprint type describes the specific parameters used for a given
approach, such as "the OpenEye path-based fingerprint method using
path lengths between 0 and 5 bonds, where the atom types are based on
the atomic number and aromaticity, and the bond type is based on the
bond order, mapped to a 256 bit fingerprint."

(In object-oriented terms, a fingerprint family is the class and a
fingerprint type is an instance of the class.)

I'll use :func:`chemfp.get_fingerprint_family` to get the
:class:`.FingerprintFamily` for "OpenEye-Path". On the laptop where I'm
writing the documentation, this resolves to what chemfp calls version
"2"::

    >>> from __future__ import print_function
    >>> import chemfp 
    >>> family = chemfp.get_fingerprint_family("OpenEye-Path")
    >>> family
    FingerprintFamily(<OpenEye-Path/2>)

The fingerprint family can be called like a function to return a
:class:`.FingerprintType`. If you call it with no arguments it will
use the defaults parameters for that family. I'll do that, then use
:meth:`~.FingerprintType.get_type` to get the fingerprint type string,
which is the canonical representation of the fingerprint family name,
version, and parameters::

    >>> fptype = family()
    >>> fptype.get_type()
    'OpenEye-Path/2 numbits=4096 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral'

A 4096 bit fingerprint is rather large. I'll make a new OpenEye-Path
fingerprint type, but this time with only 256 bits. That's small
enough that the resulting fingerprint will fit on a line of
documentation. All of the other parameters will be unchanged::

  >>> fptype = family(numbits=256)
  >>> fptype
  <chemfp.openeye_types.OpenEyePathFingerprintType_v2 object at 0x10b9c4e90>
  >>> print(fptype.get_metadata())
  #num_bits=256
  #type=OpenEye-Path/2 numbits=256 minbonds=0 maxbonds=5 atype=Arom|AtmNum|Chiral|EqHalo|FCharge|HvyDeg|Hyb btype=Order|Chiral
  #software=OEGraphSim/2.2.6 (20170208) chemfp/3.1
  #date=2017-09-16T13:56:20

This time I used :meth:`.FingerprintType.get_metadata` to give
information about the fingerprint. This returns a new
:class:`.Metadata` instance which describes the fingerprint type, and
if you print a Metadata it displays the metadata information as an FPS
header.

Once you have the fingerprint type you can create fingerprints,
including directly from a SMILES string, as in the following::

  >>> from chemfp import bitops
  >>> fp = fptype.parse_molecule_fingerprint("c1ccccc1O", "smistring")
  >>> bitops.hex_encode(fp)
  '0012250160901000080c002810000400201000900054880442000e8040201000'

and from a structure file::

  >>> for id, fp in fptype.read_molecule_fingerprints("Compound_014550001_014575000.sdf.gz"):
  ...   print(id, bitops.hex_encode(fp))
  ... 
  14550001 5ae8f4bbfcda6a66fdbfc2ab9045ecde36b055e3ca56f10477a18df6fd1ebb06
  14550002 5ac8f4fafcce6b657d3f82a79145aacca65015e34a56c00777880db27d8ef006
  14550003 78c8f17a7cce6b657d3782a59105a2c4a64115c34a5ec04773a80fb2758cd006
  14550004 2683e056c28a20882ba8d410304184514213c0300209c3e0eb8241b280008102
           ...

For more examples of using ``get_metadata`` see
:ref:`merging_multiple_fingerprints`.

Even though I used the fingerprint family to get the type, I did that
more for pedagogical reasons. Most times you can get the fingerprint
type directly using :func:`chemfp.get_fingerprint_type`. You can call
it using a fingerprint type string or by passing in the parameters in
the optional second parameter:: ::

    >>> fptype = chemfp.get_fingerprint_type("OpenEye-Path numbits=256")
    >>> fptype = chemfp.get_fingerprint_type("OpenEye-Path", {"numbits": 256})

See :ref:`get_fingerprint_type_section` for examples on how to use
``get_fingerprint_type``.


Fingerprint family
==================

In this section you'll learn about the attributes and methods of a
fingerprint family.

The :func:`.get_fingerprint_family` function takes the fingerprint
family name (with or without a version) and returns a
:class:`.FingerprintFamily` instance::

    >>> import chemfp
    >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")

It will raise a ValueError if you ask for a fingerprint family or
version which doesn't exist::

  >>> chemfp.get_fingerprint_family("whirl")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/__init__.py", line 1912, in get_fingerprint_family
      return _family_registry.get_family(family_name)
    File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/types.py", line 1205, in get_family
      raise err
  ValueError: Unknown fingerprint type 'whirl'
  >>> chemfp.get_fingerprint_family("RDKit-Fingerprint/1")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/__init__.py", line 1912, in get_fingerprint_family
      return _family_registry.get_family(family_name)
    File "/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/types.py", line 1205, in get_family
      raise err
  ValueError: Unable to use RDKit-Fingerprint/1: This version of RDKit does not support the RDKit-Fingerprint/1 fingerprint

The fingerprint family has several attributes to ask for the name or
parts of the name::

    >>> family
    FingerprintFamily(<RDKit-Fingerprint/2>)
    >>> family.name
    'RDKit-Fingerprint/2'
    >>> (family.base_name, family.version)
    ('RDKit-Fingerprint', '2')

It also has a ``toolkit`` attribute, which is the underlying chemfp
toolkit that can create molecules for this fingerprint::

    >>> family.toolkit
    <module 'chemfp.rdkit_toolkit' from 'chemfp/rdkit_toolkit.pyc'>
    >>> family.toolkit.name
    'rdkit'

See the chapter :ref:`toolkit_chapter` for many examples of how to
use a toolkit.

The :meth:`~.FingerprintFamily.get_defaults` method returns the
default arguments used to create a fingerprint type, which is handy
when you've forgotten what all of the arguments are::

    >>> family.get_defaults()
    {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}

If you call the family as a function, you'll get a
:class:`.FingerprintType`. You can check to see that the fingerprint
type's keyword arguments match the defaults::

    >>> fptype = family()
    >>> fptype.fingerprint_kwargs
    {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}

Call the fingerprint family with keyword arguments to use something
other than the default parameters::

    >>> fptype = family(fpSize=1024, maxPath=6)
    >>> fptype.fingerprint_kwargs
    {'maxPath': 6, 'fpSize': 1024, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}

If you have the keyword arguments as a dictionary you can use the
"\*\*" syntax to apply the dictionary as keyword arguments, but I
think it's clearer to call the :meth:`.FingerprintFamily.from_kwargs`
method to create the fingerprint type::

    >>> kwargs = {"fpSize": 512, "maxPath": 5}
    >>> fptype = family(**kwargs) # Acceptable
    >>> fptype.get_type()
    'RDKit-Fingerprint/2 minPath=1 maxPath=5 fpSize=512 nBitsPerHash=2 useHs=1'
    >>> fptype = family.from_kwargs(kwargs)  # Better
    >>> fptype.get_type()
    'RDKit-Fingerprint/2 minPath=1 maxPath=5 fpSize=512 nBitsPerHash=2 useHs=1'

(Currently ``family(**kwargs)`` forwards the the call to
``family.from_kwargs(kwargs)`` so there is a slight performance
advantage to using ``from_kwargs()``.)

Sometimes the fingerprint parameters come from a string, for example,
from command-line arguments or a web form. In chemfp a dictionary of
text keys and values are called "text settings". The fingerprint
family has a helper function to process them and create a kwargs
dictionary with the correct data types as values::

    >>> family.get_kwargs_from_text_settings({
    ...    "fpSize": "128",
    ...    "nBitsPerHash": "1",
    ... })
    {'maxPath': 7, 'fpSize': 128, 'nBitsPerHash': 1, 'minPath': 1, 'useHs': 1}


Note: This method is not as advanced as the :meth:`corresponding code
in the toolkit Format API <.Format.get_reader_args_from_text_settings>`. 
It does not understand namespaces. It will also raise an exception if
called with an unsupported parameter::

    >>> family.get_kwargs_from_text_settings({
    ...    "unsupported parameter": "-12.34",
    ... })
    Traceback (most recent call last):
        ....
    ValueError: Unsupported fingerprint parameter name 'unsupported parameter'

If you have text settings then you probably want to call
:func:`chemfp.get_fingerprint_type_from_text_settings` directly instead of
going through the fingerprint family::

    >>> fptype = chemfp.get_fingerprint_type_from_text_settings("RDKit-Fingerprint",
    ...       {"fpSize": "512", "nBitsPerHash": "3", "maxPath": "6"})
    >>> fptype.get_type()
    'RDKit-Fingerprint/2 minPath=1 maxPath=6 fpSize=512 nBitsPerHash=3 useHs=1'

See :ref:`get_fingerprint_type_from_text_settings` for more examples
of how to use this function.


Fingerprint family discovery
============================

In this section you'll learn how to get the available fingerprint
families, both as a set of name strings and a list of
FingerprintFamily instances.

Even though chemfp knows about the OpenEye fingerprints, those
fingerprints might not be available on your system if you don't have
OEChem and OEGraphSim installed and licensed. Chemfp has a discovery
system which will probe to see which fingerprint types are available
and determine their version numbers.

If you just want the available family names, use
:func:`chemfp.get_fingerprint_family_names`::

  >>> import chemfp
  >>> chemfp.get_fingerprint_family_names()
  set(['RDKit-Torsion', 'OpenEye-Path', 'OpenBabel-FP2',
  'OpenBabel-FP3', 'OpenBabel-FP4', 'RDKit-Avalon', 'RDMACCS-RDKit',
  'RDKit-Morgan', 'OpenEye-MACCS166', 'RDMACCS-OpenEye',
  'RDKit-MACCS166', 'OpenBabel-MACCS', 'ChemFP-Substruct-RDKit',
  'ChemFP-Substruct-OpenEye', 'OpenEye-Circular', 'RDKit-Fingerprint',
  'OpenEye-Tree', 'ChemFP-Substruct-OpenBabel', 'RDMACCS-OpenBabel',
  'RDKit-AtomPair', 'RDKit-Pattern'])

Bear in mind that this might take a few seconds to run, since it will
try to load the Python packages for each supported toolkit.  (Once
done, that list is cached so subsequent calls are fast.)

The function returns a set of base names, which don't contain the
version information. Most likely you want to sort it before displaying
it more nicely::

  >>> from __future__ import print_function 
  >>> for name in sorted(chemfp.get_fingerprint_family_names()):
  ...   print(name)
  ... 
  ChemFP-Substruct-OpenBabel
  ChemFP-Substruct-OpenEye
  ChemFP-Substruct-RDKit
  OpenBabel-FP2
  OpenBabel-FP3
  OpenBabel-FP4
  OpenBabel-MACCS
  OpenEye-Circular
  OpenEye-MACCS166
  OpenEye-Path
  OpenEye-Tree
  RDKit-AtomPair
  RDKit-Avalon 
  RDKit-Fingerprint
  RDKit-MACCS166
  RDKit-Morgan
  RDKit-Pattern
  RDKit-Torsion
  RDMACCS-OpenBabel
  RDMACCS-OpenEye
  RDMACCS-RDKit

On my desktop, where I do all of the testing, I have many `virturalenv
<https://virtualenv.pypa.io/en/latest/>`_ installations so I can test
different combinations of Python and toolkit versions. I'll run chemfp
in one of the OpenEye-only virtualenv installations and show that it
only knows about the OEChem/OEGraphSim fingerprint types::

  >>> from __future__ import print_function
  >>> import chemfp 
  >>> print("\n".join(sorted(chemfp.get_fingerprint_family_names())))
  ChemFP-Substruct-OpenEye
  OpenEye-Circular
  OpenEye-MACCS166
  OpenEye-Path
  OpenEye-Tree
  RDMACCS-OpenEye

It's still possible to get a list of all fingerprint family names,
including those which aren't actually available for the given Python
installation, by setting the *include_unavailable* parameter to True::

  >>> print("\n".join(sorted(chemfp.get_fingerprint_family_names(include_unavailable=True))))
  ChemFP-Substruct-OpenBabel
  ChemFP-Substruct-OpenEye
  ChemFP-Substruct-RDKit
  OpenBabel-FP2
  OpenBabel-FP3
  OpenBabel-FP4
  OpenBabel-MACCS
  OpenEye-Circular
  OpenEye-MACCS166
  OpenEye-Path
  OpenEye-Tree
  RDKit-AtomPair
  RDKit-Avalon 
  RDKit-Fingerprint
  RDKit-MACCS166
  RDKit-Morgan
  RDKit-Pattern
  RDKit-Torsion
  RDMACCS-OpenBabel
  RDMACCS-OpenEye
  RDMACCS-RDKit


The list of base names is pretty useful, but sometimes you want more
details, like the specific version number, and the default number of
bits. The :class:`.FingerprintFamily` includes the attributes to get
the :attr:`~.FingerprintFamily.name` and
:attr:`~.FingerprintFamily.version` but it doesn't have a way to get
the default number of bits. Instead, I'll use the FingerprintFamily to
make a :class:`.FingerprintType` with the default parameters, then ask
the new fingerprint type its :attr:`number of bits <.FingerprintType.num_bits>`.

This means I need a list of FingerprintFamily instances, which is
conveniently available from
:func:`chemfp.get_fingerprint_families`. (Remember, this may take a
few seconds the first time it's called, because it tries to load all
of the available fingerprints. Once determined, this information is
cached.)

As a result, you can make a list of all available fingerprint methods
and their default number of bits with the following::

  >>> for family in chemfp.get_fingerprint_families():
  ...   print(family.name, family().num_bits)
  ... 
  ChemFP-Substruct-OpenBabel/1 881
  ChemFP-Substruct-OpenEye/1 881
  ChemFP-Substruct-RDKit/1 881
  OpenBabel-FP2/1 1021
  OpenBabel-FP3/1 55
  OpenBabel-FP4/1 307
  OpenBabel-MACCS/2 166
  OpenEye-Circular/2 4096
  OpenEye-MACCS166/3 166
  OpenEye-Path/2 4096
  OpenEye-Tree/2 4096
  RDKit-AtomPair/2 2048
  RDKit-Avalon/1 512
  RDKit-Fingerprint/2 2048
  RDKit-MACCS166/2 166
  RDKit-Morgan/1 2048
  RDKit-Pattern/2 2048
  RDKit-Torsion/2 2048
  RDMACCS-OpenBabel/2 166
  RDMACCS-OpenEye/2 166
  RDMACCS-RDKit/2 166

The output here is a bit fancy. If you only want the version
information then you could just look at the list, since a family's
`repr <https://docs.python.org/2/library/functions.html#func-repr>`_
shows the versioned name::

  >>> chemfp.get_fingerprint_families()
  [FingerprintFamily(<ChemFP-Substruct-OpenBabel/1>),  FingerprintFamily(<ChemFP-Substruct-OpenEye/1>),
  FingerprintFamily(<ChemFP-Substruct-RDKit/1>),  FingerprintFamily(<OpenBabel-FP2/1>),
  FingerprintFamily(<OpenBabel-FP3/1>),  FingerprintFamily(<OpenBabel-FP4/1>),
  FingerprintFamily(<OpenBabel-MACCS/2>),  FingerprintFamily(<OpenEye-Circular/2>),
  FingerprintFamily(<OpenEye-MACCS166/3>),  FingerprintFamily(<OpenEye-Path/2>),
  FingerprintFamily(<OpenEye-Tree/2>),  FingerprintFamily(<RDKit-AtomPair/2>),
  FingerprintFamily(<RDKit-Avalon/1>),  FingerprintFamily(<RDKit-Fingerprint/2>),
  FingerprintFamily(<RDKit-MACCS166/2>),  FingerprintFamily(<RDKit-Morgan/1>),
  FingerprintFamily(<RDKit-Pattern/2>),  FingerprintFamily(<RDKit-Torsion/2>),
  FingerprintFamily(<RDMACCS-OpenBabel/2>),  FingerprintFamily(<RDMACCS-OpenEye/2>),
  FingerprintFamily(<RDMACCS-RDKit/2>)]

On the other hand, that's a rather dense block of text.

Finally, use :func:`chemfp.has_fingerprint_family` to test if a
fingerprint family is available::

  >>> chemfp.has_fingerprint_family("OpenEye-Tree")
  True
  >>> chemfp.has_fingerprint_family("OpenEye-Tree/2")
  True
  >>> chemfp.has_fingerprint_family("OpenEye-Tree/1")
  False

It understands both version and unversioned names.

.. _get_fingerprint_type_section:

get_fingerprint_type() and get_type()
=====================================

In this section you'll learn how to get a fingerprint type given its
type string, and how to specify fingerprint parameters as a
dictionary.

The easiest way to get a specific :class:`.FingerprintType` is with
:func:`chemfp.get_fingerprint_type`::

    >>> import chemfp
    >>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint")
    >>> fptype
    <chemfp.rdkit_types.RDKitFingerprintType_v2 object at 0x10cfedb10>

The fingerprint type has a :meth:`.FingerprintType.get_type` method,
which returns the canonical fingerprint type string::

    >>> fptype.get_type()
    'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1'

This is canonical because chemfp ensures that all fingerprint type
strings with the same parameter values have the same type string.

I left out the version number in the fingerprint name, so chemfp gives
me the most recent supported version. I could have included the version
in the name, which is useful if you want to prevent a version mismatch
between your data sets. If the version doesn't exist, the function
will raise a ValueError::

    >>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint/2")
    >>> chemfp.get_fingerprint_type("RDKit-Fingerprint/1")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "chemfp/__init__.py", line 1984, in get_fingerprint_type
        return types.registry.get_fingerprint_type(type, fingerprint_kwargs)
      File "chemfp/types.py", line 1233, in get_fingerprint_type
        raise ValueError("Unable to use %s: %s" % (name, reason))
    ValueError: Unable to use RDKit-Fingerprint/1: This version of
    RDKit does not support the RDKit-Fingerprint/1 fingerprint

I can also specify some or all of the parameters myself in the type
string, instead of accepting the default values::

    >>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=1024 maxPath=6")
    >>> fptype.get_type()
    'RDKit-Fingerprint/2 minPath=1 maxPath=6 fpSize=1024 nBitsPerHash=2 useHs=1'

You can also pass in the parameters as a Python dictionary, though you
still need at least the base name of the fingerprint family::

    >>> fp_kwargs = {
    ...   "maxPath": 6,
    ...   "fpSize": 512,
    ... }
    >>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint", fp_kwargs)
    >>> fptype.get_type()
    'RDKit-Fingerprint/2 minPath=1 maxPath=6 fpSize=512 nBitsPerHash=2 useHs=1'

If a parameter is specified in both the type string and the dictionary
then the dictionary value will be used::

    >>> fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=1024 minPath=2",
    ...                                      {"fpSize": 128})
    >>> fptype.get_type()
    'RDKit-Fingerprint/2 minPath=2 maxPath=7 fpSize=128 nBitsPerHash=2 useHs=1'

.. _get_fingerprint_type_from_text_settings:

Create a fingerprint using text settings
========================================

In this section you'll learn how to get a fingerprint type using text
settings.

The fingerprint keywords arguments ("kwargs") are a dictionary whose
keys are fingerprint parameter names and whose values are native
Python objects for those parameters. Here is a fingerprint kwargs
dictionary for the RDKit-Fingerprint::

  {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}

Text settings are a dictionary where the dictionary keys are still
parameter names but where the dictionary values are string-encoded
parameter values. Here is the equivalent text settings for the above
kwargs dictionary::

  {'maxPath': '7', 'fpSize': '2048', 'nBitsPerHash': '2', 'minPath': '1', 'useHs': '1'}

A text settings dictionary typically comes from command-line
parameters or a configuration file, where everything is a string. The
fingerprint family has a method to convert text settings to kwargs::

  >>> import chemfp
  >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
  >>> kwargs = family.get_kwargs_from_text_settings({"fpSize": "4096"})
  >>> kwargs
  {'maxPath': 7, 'fpSize': 4096, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}

The kwargs can then be used to get the specified fingerprint type from
the family::

  >>> fptype = family.from_kwargs(kwargs)
  >>> fptype
  <chemfp.rdkit_types.RDKitFingerprintType_v2 object at 0x100f68610>
  >>> fptype.get_type()
  'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

It's a bit tedious to go through all those steps to process some text
settings. Instead, call
:func:`chemfp.get_fingerprint_type_from_text_settings`::

  >>> fptype = chemfp.get_fingerprint_type_from_text_settings(
  ...                     "RDKit-Fingerprint", {"fpSize": "4096"})
  >>> fptype.get_type()
  'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

The parameters in the text settings have priority should the
fingerprint type string and the text settings both specify the same
parameter name, as in this example where the fingerprint type string
specifies a 1024 bit fingerprint while the text settings specifies a
4096 bit fingerprint::

  >>> fptype = chemfp.get_fingerprint_type_from_text_settings("RDKit-Fingerprint fpSize=1024")
  >>> fptype.get_type()
  'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1'
  >>> 
  >>> fptype = chemfp.get_fingerprint_type_from_text_settings(
  ...            "RDKit-Fingerprint fpSize=1024", {"fpSize": "4096"})
  >>> fptype.get_type()
  'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1'

At present there is no support for parameter namespaces, and unknown
parameter names will raise an exception::

  >>> fptype = chemfp.get_fingerprint_type_from_text_settings(
  ...            "RDKit-Fingerprint", {"fpSize": "4096", "spam": "eggs"})
  Traceback (most recent call last):
    File "<stdin>", line 2, in <module>
    File "chemfp/__init__.py", line 1329, in get_fingerprint_type_from_text_settings
      return types.registry.get_fingerprint_type_from_text_settings(type, settings)
    File "chemfp/types.py", line 868, in get_fingerprint_type_from_text_settings
      raise ValueError("Error with type %r: %s" % (type, err))
  ValueError: Error with type 'RDKit-Fingerprint': Unsupported fingerprint parameter name 'spam'

This may change in the future; let me know what's best for you.

For now, if you want to remove unexpected names from a dictionary then
use the fingerprint family's :meth:`~.FingerprintFamily.get_defaults`
to get the default kwargs as a dictionary, and use the keys to filter
out the unknown parameters::

  >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint")
  >>> defaults = family.get_defaults()
  >>> defaults
  {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1}
  >>> settings = {"maxPath": "8", "unknown": "mystery"}
  >>> new_settings = dict((k, v) for (k,v) in settings.items() if k in defaults)
  >>> new_settings
  {'maxPath': '8'}


FingerprintType properties and methods
======================================

In this section you'll learn about the :class:`.FingerprintType`
properties and methods.

I'll start by getting OpenEye's tree fingerprint using the default parameters::

  >>> fptype = chemfp.get_fingerprint_type("OpenEye-Tree")
  >>> fptype
  <chemfp.openeye_types.OpenEyeTreeFingerprintType_v2 object at 0x10a64be10>
  >>> fptype.get_type()
  'OpenEye-Tree/2 numbits=4096 minbonds=0 maxbonds=4 atype=Arom|AtmNum|Chiral|FCharge|HvyDeg|Hyb btype=Order'

The "OpenEye-Tree/2" is the fingerprint :attr:`~.FingerprintType.name`,
which is decomposed into the :attr:`~.FingerprintType.base_name` "OpenEye-Tree"
and the :attr:`~.FingerprintType.version` "2"::

  >>> fptype.name
  'OpenEye-Tree/2'
  >>> fptype.base_name, fptype.version
  ('OpenEye-Tree', '2')

The number of bits for the fingerprint is :attr:`~.FingerprintType.num_bits`, and
:attr:`~.FingerprintType.fingerprint_kwargs` is a fingerprint
parameters as a dictionary of Python values::

    >>> fptype.num_bits
    4096
    >>> fptype.fingerprint_kwargs
    {'maxbonds': 4, 'numbits': 4096, 'atype': 63, 'minbonds': 0, 'btype': 1}

Each fingerprint type has a :attr:`~.FingerprintType.toolkit`, which
is the chemfp toolkit that can make molecules used as input to the
fingerprint type. (This would be None if there were no toolkit.) Given
a fingerprint type it's easy to figure out the :attr:`.toolkit.name`
of the toolkit it's associated with::

    >>> fptype.toolkit.name
    'openeye'

The :attr:`~.FingerprintType.software` attribute gives information
about the software used to generate the fingerprint. For RDKit and
Open Babel this is the same as the :attr:`.toolkit.software`
string. On the other hand, OpenEye distributes OEChem and OEGraphSim
as two different libraries. These map quite naturally to chemfp's
concepts of fingerprint type and toolkit, so the "software" field for
its fingerprint type and toolkit differ:

    >>> fptype.software
    'OEGraphSim/2.2.6 (20170208) chemfp/3.1'    
    >>> fptype.toolkit.software
    'OEChem/20170208'

Finally, :meth:`.FingerprintType.get_fingerprint_family` returns the
fingerprint family for a given fingerprint type::

    >>> fptype.get_fingerprint_family()
    FingerprintFamily(<OpenEye-Tree/2>)


Convert a structure record to a fingerprint
===========================================

In this section you'll learn how to use a fingerprint type to convert
a structure record into a fingerprint.

The :class:`.FingerprintType` method
:meth:`~.FingerprintType.parse_molecule_fingerprint` parses a
structure record and returns the fingerprint as a byte string. The
following uses Open Babel to get the MACCS fingerprint for phenol::

    >>> import chemfp 
    >>> from chemfp import bitops
    >>> fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
    >>> fptype
    <chemfp.openbabel_types.OpenBabelMACCSFingerprintType_v2 object at 0x10cfedc10>
    >>> fp = fptype.parse_molecule_fingerprint("c1ccccc1O", "smistring")
    >>> fp
    '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e'
    >>> bitops.hex_encode(fp)
    '00000000000000000000000000000140004480101e'

(Under Python 3 the fingerprint is a byte string and the
second-to-last output line will be shown with the b'' prefix.)

The parameters to ``parse_molecule_fingerprint()`` are identical to
the toolkit's :func:`~.toolkit.parse_molecule` function. For example,
the following shows that the SMILES "Q" raises a
:class:`chemfp.ParseError` with the default *errors* mode, and returns
None when *errors* is "ignore"::

  >>> fptype.parse_molecule_fingerprint("Q", "smistring")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "chemfp/types.py", line 984, in parse_molecule_fingerprint
      mol = self.toolkit.parse_molecule(content, format, reader_args=reader_args, errors=errors)
           .... lines omitted ... 
    File "chemfp/io.py", line 87, in error
      _compat.raise_tb(ParseError(msg, location), None)
    File "<string>", line 1, in raise_tb
  chemfp.ParseError: Open Babel cannot parse the SMILES 'Q'
  >>> fptype.parse_molecule_fingerprint("Q", "smistring", errors="ignore") is None
  True

See :ref:`about_smiles` for information about using
``parse_molecule()`` and the distinction between "smistring", "smi"
and other SMILES formats. See :ref:`parse_molecule_with_error` for
more about the *errors* parameter.


Convert a structure record to an id and fingerprint
===================================================

In this section you'll learn how to use a fingerprint type to extract
the id from a structure record, convert the structure record into a
fingerprint, and return the (id, fingerprint) pair.

The previous section showed how to convert a structure record into a
fingerprint. Sometimes you'll also want the identifier. The
:class:`.FingerprintType` method
:meth:`~.FingerprintType.parse_id_and_molecule_fingerprint` does both
in the same call.

  >>> fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
  >>> fptype.parse_id_and_molecule_fingerprint("c1ccccc1O phenol", "smi")
  (u'phenol', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00\x04\x00\x10\x1a')

(If the identifier is not present then the function may return None or
the empty string, depending on the format and underlying
implementation.)

The parameters to ``parse_id_and_molecule_fingerprint`` are identical
to the :func:`.toolkit.parse_id_and_molecule` function. For example,
the following shows the difference in using two different delimiter
types in the *reader_args*::

  >>> record = "C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a"
  >>> fptype.parse_id_and_molecule_fingerprint(record, "smi", reader_args={"delimiter": "to-eol"})
  (u'vitamin a', '\x00\x00\x00\x08\x00\x00\x02\x00\x02\n\x02\x80\x04\x98\x0c\x00\x00\x140\x14\x18')
  >>> fptype.parse_id_and_molecule_fingerprint(record, "smi", reader_args={"delimiter": "space"})
  (u'vitamin', '\x00\x00\x00\x08\x00\x00\x02\x00\x02\n\x02\x80\x04\x98\x0c\x00\x00\x140\x14\x18')

The *id_tag* and *errors* parameters are also supported, though I
won't give examples. See :ref:`read_with_id_tag` to learn how to use
the *id_tag* and :ref:`smiles_delimiter_reader_arg` and
:ref:`multitoolkit_reader_and_writer_args` for examples of using
*reader_args*.


Make a specialized id and molecule fingerprint parser
=====================================================

In this section you'll learn how to make a specialized function for
computing the fingerprints given many individual structure records.

Sometimes the structure input comes as a set of individual strings,
with one record per string. For example, the input might come from a
database query, where the cursor returns each field of each row as its
own term, and you want to convert each of them into a fingerprint.

One way to do this through successive calls to
:meth:`.FingerprintType.parse_molecule_fingerprint`::

    >>> from __future__ import print_function
    >>> import chemfp 
    >>> from chemfp import bitops
    >>> 
    >>> smiles_list = ["C", "O=O", "C#N"]
    >>> 
    >>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
    >>> for smiles in smiles_list:
    ...     fp = fptype.parse_molecule_fingerprint(smiles, "smistring")
    ...     print(bitops.hex_encode(fp), smiles)
    ... 
    000000000000000000000000000000000000008000 C
    000000000000000000000000200000080000004008 O=O
    000000000001000000000000000000000000000001 C#N

There is some overhead in this because the parameters, like *format*
("smistring" in this case) are (re)validated for each call, and
sometimes extra work is done to ensure that the call is
thread-safe. (The overhead is higher if there are complex reader args,
and if the underlying fingerprinter is very fast.)

Another solution is to use
:meth:`.make_id_and_molecule_fingerprinter_parser` to create a
specialized parser function for a given set of parameters. The
parameters are only validated once, and the returned parser function
takes only the record as input and returns the (id, fingerprint)
pair::

    >>> import chemfp
    >>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
    >>> id_and_fp_parser = fptype.make_id_and_molecule_fingerprint_parser("smi")
    >>> id_and_fp_parser("c1ccccc1O phenol")
    (u'phenol', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e')

The parameters to ``make_id_and_molecule_fingerprint_parser`` are
identical to :func:`.toolkit.make_id_and_molecule_parser`.

I'll use the new function to parse the ``smiles_list`` from earlier::

    >>> from __future__ import print_function
    >>> import chemfp
    >>> from chemfp import bitops
    >>> 
    >>> smiles_list = ["C", "O=O", "C#N"]
    >>> 
    >>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
    >>> id_and_fp_parser = fptype.make_id_and_molecule_fingerprint_parser("smistring")
    >>> 
    >>> for smiles in smiles_list:
    ...     id, fp = id_and_fp_parser(smiles)
    ...     print(bitops.hex_encode(fp), smiles)
    ... 
    000000000000000000000000000000000000008000 C
    000000000000000000000000200000080000004008 O=O
    000000000001000000000000000000000000000001 C#N

For OpenEye-MACCS166, creating and using a specialized parser is about
15% faster than using the parse_molecule_fingerprint() when the query
is isocane (C20H42). For OpenBabel-MACCS it's about 5%, and for
RDKit-MACCS166 it's around 1%.

You might be tempted to assume there's a constant Python overhead and
use the above numbers to judge the performance of the underlying
toolkit. This won't give accurate answers. Chemfp makes certain
threading guarantees which aren't always directly mapped to the
underlying toolkit. This can require extra overhead.

In addition, RDKit's native MACCS implementation maps key 1 to bit 1,
while the other toolkits and chemfp map key 1 to bit 0. Chemfp
normalizes RDKit-MACCS by shifting all of the bits left, and this
translation code hasn't yet been optimized.

You may have noticed that there's a ``parse_molecule_fingerprint()``
and a ``make_id_and_molecule_fingerprint_parser()`` but there isn't a
``parse_id_and_molecule_fingerprint()`` or
``make_molecule_fingerprint_parser()``. This is simply a matter of
time. I haven't needed those functions, they are quite easy to emulate
given what's available, and I was getting bored of writing test cases.

Let me know if they would be useful for your code.


Read a structure file and compute fingerprints
==============================================

In this section you'll learn how to use a fingerprint type to read a
structure file, compute fingerprints for each one, and iterate over
the resulting (id, fingerprint) pairs. You will need
`Compound_027575001_027600000.sdf.gz
<ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_027575001_027600000.sdf.gz>`_
from PubChem.

The :meth:`~.FingerprintType.read_molecule_fingerprints` method of a
:class:`.FingerprintType` reads a structure file and computes the
fingerprint for each molecule. It will also extract the record
identifier. It returns an iterator of the (id, fingerprint) pairs. For
example, the following uses OEChem/OEGraphSim to compute the MACCS166
fingerprint for a PubChem file, and prints the identifier, the number
of keys set in the fingerprint, and the hex-encoded fingerprint:

.. code-block:: python 

    from __future__ import print_function 
    import chemfp
    from chemfp import bitops
    
    # Uncomment the fingerprint type you want to use.
    fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
    #fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
    #fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
    for id, fp in fptype.read_molecule_fingerprints("Compound_014550001_014575000.sdf.gz"):
        print("%s %3d %s" % (id, bitops.byte_popcount(fp), bitops.hex_encode(fp)))

The first few lines of ouput (excluding OEChem warnings) are:

.. code-block:: none

  14550001  46 00008008000081406000e226a0906142df8e2a7c1b
  14550002  35 00000008000000000000ea06809021425d862a7c1b
  14550003  25 00000008000000000000c812008005425084283c1b
  14550004  23 0000000000000800118204a00080800300b0780813
  14550005  16 00000000040001000000000010010004800803523a

However, in most cases you should use the top-level helper function
:func:`chemfp.read_molecule_fingerprints`, which does the fingerprint
type lookup and the call to ``read_molecule_fingerprints``:

.. code-block:: python

    from __future__ import print_function 
    import chemfp
    from chemfp import bitops
    
    for id, fp in chemfp.read_molecule_fingerprints("OpenEye-MACCS166",
                                                    "Compound_014550001_014575000.sdf.gz"):
        print("%s %3d %s" % (id, bitops.byte_popcount(fp), bitops.hex_encode(fp)))

The helper function accepts both a type string, as shown here, and a
Metadata object. On the other hand, the helper function does not
support fingerprint kwargs, so in that case you have to go through the
FingerprintType.

The ``read_molecule_fingerprints`` method takes the same parameters as
the :func:`toolkit.read_ids_and_molecules`, including *id_tag*,
*errors*, and *location*. I won't cover those details again here. 
Instead, see :ref:`read_ids_and_molecules`.

Structure-based fingerprint reader location
===========================================

In this section you'll learn more about the ``location`` attribute of
the structure-based fingerprint iterator returned by
read_molecule_fingerprints and read_molecule_fingerprints_from_string.

Four related functions implement structure-based fingerprint readers:

- :func:`chemfp.read_molecule_fingerprints`
- :func:`chemfp.read_molecule_fingerprints_from_string`
- :meth:`.FingerprintType.read_molecule_fingerprints`
- :meth:`.FingerprintType.read_molecule_fingerprints_from_string`

They all return a :class:`.FingerprintIterator`. Just like with the
:class:`.BaseMoleculeReader` classes, the FingerprintIterator has a
:class:`location <.Location>` attribute that can be used to get more
information about the internal reader state. The toolkit section has
more details about how to get the current record number (see
:ref:`location_recno`) and, if supported by the parser implementation
for a format, the line number and byte ranges for the record (see
:ref:`location_record_position`).

It's also possible to get the current molecule object using the
location's "mol" attribute. This isn't so important for the toolkit
API since all of the molecule readers return the molecule object. It's
more useful in the fingerprint iterator, which doesn't.

NOTE: accessing the molecule this way is somewhat slow, because it
requires several Python function calls. It should mostly be used for
error reporting; the following is meant as an example of use, and not
a recommended best practice.

The following uses the location's ``mol`` to report the SMILES string
for every molecule whose MACCS fingerprint sets fewer than 5 keys:

.. code-block:: python 
  
    from __future__ import print_function
    import chemfp
    from chemfp import bitops
    
    from openeye.oechem import OECreateSmiString
    
    fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
    with fptype.read_molecule_fingerprints("Compound_014550001_014575000.sdf.gz") as reader:
        location = reader.location
        for id, fp in reader:
            popcount = bitops.byte_popcount(fp)
            if popcount >= 5:
                continue
            smiles = OECreateSmiString(location.mol)
            print("%s %3d %s" % (id, popcount, smiles))

The output from the above is:

.. code-block:: none 

  14550474   3 [Mg+2].[Ca+2]
  14567810   4 [B]=CO
  14574228   4 F[In]
  14574262   3 [Ga].[Ga].[Ga].[Ga].[Ga].[Ir].[Ir].[Ir]
  14574264   3 [Co].[Ga]
  14574265   3 [Ga].[Ga].[Pt]
  14574267   3 [Ga].[Pt]
  14574635   4 [Mg+2].[Al+3]
  14574653   4 [Na+].[Na+].[Na+].[PH2-]

The above code imports and calls OECreateSmiString directly. The
cross-toolkit solution is only slightly more complicated. I need to
use the fingerprint type object to get the underlying "toolkit", which
is a portability layer on top of the actual cheminformatics toolkit
with functions to parse a string into a molecule and vice versa::

  >>> import chemfp
  >>> fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
  >>> fptype.toolkit
  <module 'chemfp.openeye_toolkit' from '/Users/dalke/cvses/cfp-3x/docs/tmp/chemfp/openeye_toolkit.py'>
  >>> T = fptype.toolkit
  >>> mol = T.parse_molecule("OC", "smistring") 
  >>> T.create_string(mol, "smistring")
  'CO'

I'll use the toolkit's ``create_string()`` method to make the SMILES
string for each molecule which passes the filter:

.. code-block:: python

  from __future__ import print_function
  import chemfp
  from chemfp import bitops
    
  fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
  T = fptype.toolkit
    
  with fptype.read_molecule_fingerprints("Compound_014550001_014575000.sdf.gz") as reader:
      location = reader.location
      for id, fp in reader:
          popcount = bitops.byte_popcount(fp)
          if popcount >= 5:
              continue
          smiles = T.create_string(location.mol, "smistring")
          print("%s %3d %s" % (id, popcount, smiles))

When should you use a toolkit-specific API and when to use the
portable one?

That depends on you. There's definitely a portability vs. performance
tradeoff because the new ``create_string`` function will always
require an extra function call over the native API. If you work with a
given toolkit a lot then you're going to be more familiar with it than
this brand new chemfp API. Plus, calling a function to create another
function is somewhat unusual.

On the other hand, it's trivial to change the above code to work with
any of the fingerprint types that chemfp supports.


Read fingerprints from a string containing structures
=====================================================
 
In this section you'll learn how to use a fingerprint type to read a
string containing a set of structure records, compute fingerprints for
each one, and iterate over the resulting (id, fingerprint) pairs.

The :meth:`~.FingerprintType.read_molecule_fingerprints_from_string`
method of the :class:`.FingerprintType` takes as input a string
containing structure records and returns an iterator over the (id,
fingerprint) pairs.

    >>> from __future__ import print_function
    >>> import chemfp
    >>> from chemfp import bitops
    >>> fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
    >>> content = "C methane\n" + "CC ethane\n"
    >>> reader = fptype.read_molecule_fingerprints_from_string(content, "smi")
    >>> for (id, fp) in reader:
    ...   print(id, bitops.hex_encode(fp))
    ... 
    methane 000000000000000000000000000000000000008000
    ethane 000000000000000000000000000000000000108000
    >>> 

In most cases you should use the top-level helper function
:func:`chemfp.read_molecule_fingerprints_from_string`, which is
slightly easier to call:

.. code-block:: python

    from __future__ import print_function
    import chemfp
    from chemfp import bitops
    content = ("C methane\n"
               "CC ethane\n")
    reader = chemfp.read_molecule_fingerprints_from_string("OpenBabel-MACCS",
                                                           content, "smi")
    for (id, fp) in reader:
        print(id, bitops.hex_encode(fp)) 

The helper function accepts both a type string, as shown here, and a
:class:`Metadata` object. The helper function does not support
fingerprint kwargs so in that case you must go through the fingerprint
type.

The method takes the same parameters as
:func:`toolkit.read_ids_and_molecules_from_string`, including the
*id_tag*, *errors*, *location*, and *reader_args*. See
:ref:`read_molecules_from_string` for more about that function.


Structure-based fingerprint reader errors
=========================================

In this section you'll learn how to use the *errors* option for the
"read molecule fingerprints" functions, including how to use the
experimental support for a callback error handler.

The four structure reader functions
(:func:`chemfp.read_molecule_fingerprints`,
:func:`chemfp.read_molecule_fingerprints_from_string`,
:meth:`.FingerprintType.read_molecule_fingerprints`, and
:meth:`.FingerprintType.read_molecule_fingerprints_from_string`) take
the standard *errors* option. By default it is "strict", which means
that it raises an exception when there are errors, and stops
processing.

  >>> from __future__ import print_function
  >>> import chemfp
  >>> from chemfp import bitops
  >>> content = ("C methane\n" +
  ...            "Q Q-ane\n" +
  ...            "O=O molecular oxygen\n")
  >>> with chemfp.read_molecule_fingerprints_from_string(
  ...           "RDKit-MACCS166", content, "smi") as reader:
  ...   for (id, fp) in reader:
  ...     print(id, bitops.hex_encode(fp))
  ... 
  methane 000000000000000000000000000000000000008000
  [02:19:12] SMILES Parse Error: syntax error for input: 'Q'
  Traceback (most recent call last):
          ....
    File "chemfp/io.py", line 87, in error
      _compat.raise_tb(ParseError(msg, location), None)
    File "<string>", line 1, in raise_tb
  chemfp.ParseError: RDKit cannot parse the SMILES 'Q', file '<string>', line 2, record #2: first line is 'Q Q-ane'

The default is "strict" because you should be the one to decide if you
really want to ignore errors, not me. Specify ``errors="ignore"`` to
ignore errors, or use "report" to have chemfp write its own error
messages to stderr::

  >>> with chemfp.read_molecule_fingerprints_from_string(
  ...           "RDKit-MACCS166", content, "smi", errors="ignore") as reader:
  ...   for (id, fp) in reader:
  ...     print(id, bitops.hex_encode(fp))
  ...
  methane 000000000000000000000000000000000000008000
  [02:21:36] SMILES Parse Error: syntax error for input: 'Q'
  molecular oxygen 000000000000000000000000200000080000004008

Of course, this depends on the underlying toolkit implementation. Some
toolkit/format combinations don't let chemfp know there was an error,
such as most of the OEChem-based formats.

.. _fingerprint_error_handler:

Experimental error handler
==========================

In this section you'll learn about the experimental API for writing
your own error handler.

In the previous section you learned about the "strict", "report", and
"ignore" error handlers. What if you want something different? Chemfp
has an experimental feature where the *errors* can be any object with
the method "error(message, location)". You might send the results to a
log file, or display it in a GUI, ... or send it to a speech
synthesizer and hear all of the error messages go by.

NOTE: This error handler API is experimental and may change in the
future.

The following creates an error handler which counts the number of
errors, and for each one reports the error number, the filename (which
is "<string>" if the input is from a string), and the error message::

  >>> class ErrorCounter(object): 
  ...     def __init__(self):
  ...         self.num_errors = 0
  ...     def error(self, message, location):
  ...         self.num_errors += 1
  ...         print("Failure #%d from file %r: %s" % (
  ...                self.num_errors, location.filename, message))
  ... 
  >>> error_handler = ErrorCounter()
  >>> # ... use  'content' from the previous section 
  >>> with chemfp.read_molecule_fingerprints_from_string(
  ...           "RDKit-MACCS166", content, "smi", errors=error_handler) as reader:
  ...     for (id, fp) in reader:
  ...         print(id, bitops.hex_encode(fp))
  ... 
  methane 000000000000000000000000000000000000008000
  [02:26:02] SMILES Parse Error: syntax error for input: 'Q'
  Failure #1 from file '<string>': RDKit cannot parse the SMILES 'Q'
  molecular oxygen 000000000000000000000000200000080000004008

Let me know if you use the API and have ideas for improvements.

The :ref:`toolkit documentation <toolkit_error_handler>` includes
another example of how to write an error handler.


Compute a fingerprint for a native toolkit molecule
===================================================

In this section you'll learn how to compute a fingerprint given a
toolkit molecule.

All of the previous sections assumed the inputs were structure
record(s), either as a string or from a file. What if you already have
a native toolkit molecule and want to compute its fingerprint?  In
that case, use the :meth:`.FingerprintType.compute_fingerprint`
method::

    >>> import chemfp
    >>> fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS")
    >>> mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
    >>> mol
    <openbabel.OBMol; proxy of <Swig Object of type 'OpenBabel::OBMol *' at 0x10d2bf510> >
    >>> fptype.compute_fingerprint(mol)
    '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e'

This can be useful when you want to compute multiple fingerprint types
for the same molecule. For example, I'll compare Open Babel's MACCS
implementation with chemfp's own MACCS implementation for Open Babel:

.. code-block:: python

    from __future__ import print_function
    import chemfp
    from chemfp import openbabel_toolkit as T
    from chemfp import bitops
    
    fptype1 = chemfp.get_fingerprint_type("OpenBabel-MACCS")
    fptype2 = chemfp.get_fingerprint_type("RDMACCS-OpenBabel")
    
    with T.read_ids_and_molecules("Compound_014550001_014575000.sdf.gz") as reader:
        for id, mol in reader:
            fp1 = fptype1.compute_fingerprint(mol)
            fp2 = fptype2.compute_fingerprint(mol)
            if fp1 != fp2:
                bits1 = set(bitops.byte_to_bitlist(fp1))
                bits2 = set(bitops.byte_to_bitlist(fp2))
                print(id, "in OB:", sorted(bits1-bits2), "in RDMACCS:", sorted(bits2-bits1))
            else:
                print(id, "equal")

Almost half (2186 of 5167) of the output were lines of the form::

    14574962 in OB: [] in RDMACCS: [124]

I was curious, so I investigated the differences. Key 125 (the MACCS
keys start at 1 while chemfp bit indexing starts at 0) is defined as
"Aromatic Ring > 1". Open Babel doesn't support this bit because it
only allows key definitions based on SMARTS, and this query cannot be
represented as SMARTS.

This is also why there are 90 records where chemfp's RDMACCS finds bit
165/key 166 ("more than one fragment"). That can be expressed as the
SMARTS "(\*).(\*)" but when the MACCS definitions were added to Open
Babel it didn't understand component level groupings, so that pattern
was omitted, and Open Babel will always generate a 0 for it. Always,
that is, until someone implements it. (Might that be you?)

For the record, 2756 of the records matched exactly, 2186 set bit 124
in RDMACCS, 90 set bit 165 in RDMACCS, 123 set both bit 124 and 165 in
RDMACCS, and 1 set bit 111 in Open Babel's MACCS but *not* in RDMACCS
while setting bit 124 in RDMACCS but *not* in Open Babel. I haven't
investigated when PubChem record 14559073 gives this difference.

Note: ``compute_fingerprint()`` is thread-safe. If an underlying
chemistry toolkit object is not thread-safe then chemfp will duplicate
that object before computing the fingerprint.


Fingerprint many native toolkit molecules
=========================================

In this section you'll learn how to generate a fingerprint given many
native toolkit molecules.

Sometimes you have a list of molecules and you want to compute
fingerprints for each one. In the following I'll load 4378 molecules
from an SD file using OEChem::

    >>> import chemfp
    >>> 
    >>> fptype = chemfp.get_fingerprint_type("OpenEye-MACCS166")
    >>> T = fptype.toolkit
    >>> 
    >>> with T.read_molecules("Compound_014550001_014575000.sdf.gz") as reader:
    ...     mols = [T.copy_molecule(mol) for mol in reader]
    ...
                 ... various OEChem warnings omitted ...
   >>> len(mols)
   5167

NOTE: for performance reasons, some of the toolkit implementations
will reuse a molecule object. I call :func:`.toolkit.copy_molecule` to
force a copy of each one. A future version of chemfp will likely
support a new *reader_args* parameter to ask the reader implementation
to always return a new molecule.

You know from the previous section how to compute the fingerprint one
molecule at a time using
:func:`.FingerprintType.compute_fingerprint`::

    >>> fps = [fptype.compute_fingerprint(mol) for mol in mols]

You can also process all of them at once using
:func:`.FingerprintType.compute_fingerprints`::

    >>> fps = list(fptype.compute_fingerprints(mols))

The plural in the name ``compute_fingerprints()`` is the hint that it
can take multiple molecules. It returns a generator, so I used
Python's ``list()`` to convert it to an actual list.

Why call ``compute_fingerprints`` instead of ``compute_fingerprint``?
The main reason is that it expresses your intent more clearly than
setting up a for-loop. But to be honest, the original reason was that
I expected it would be faster than calling the ``compute_fingerprint``
many times, because the underlying code could skip some overhead.

By design, ``compute_fingerprint`` is thread-safe, which means chemfp
sometimes makes extra objects to keep that promise. On the other hand,
``compute_fingerprints``, which processes a sequential series of
molecules, can reuse internal objects across the series instead of
creating new ones. In principle this should be a bit faster. In
practice, nearly all of the time is spent in generating the
fingerprint. Even with a faster fingerprint like OpenEye-Path, the
timing difference is well under 1%, and not enough to be interesting.


Make a specialized molecule fingerprinter
=========================================

In this section you'll learn how to make a specialized function to
compute a fingerprint for a molecule. However, there is very little
reason for you to use this function.

The :meth:`FingerprintType.compute_fingerprint` method is
thread-safe. Some of the underlying toolkit implementations can use
code which isn't thread-safe. For example, OEGraphSim writes its
fingerprint information to an OEFingerPrint instance, and replaces its
previous value. A thread-safe implementation would make a new
OEFingerPrint for each call, which a non-thread-safe implementation
could reuse it, and save a small bit of allocation overhead.

The :meth:`FingerprintType.make_fingerprinter` method returns a
non-thread-safe fingerprinter function, which is potentially faster
beause it doesn't need to keep the thread-safe promise.

Here's an example of the two APIs. First, a bit of preamble to get
things set up with a couple of molecules::

    >>> import chemfp
    >>> from chemfp import bitops
    >>> 
    >>> fptype = chemfp.get_fingerprint_type("OpenBabel-FP2")
    >>> mol1 = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring")
    >>> mol2 = fptype.toolkit.parse_molecule("O=O", "smistring")

The thread-safe API calls the ``compute_fingerprint()`` method::

    >>> bitops.byte_popcount(fptype.compute_fingerprint(mol1))
    12
    >>> bitops.byte_popcount(fptype.compute_fingerprint(mol2))
    1

The non-thread-safe version uses ``make_fingerprinter`` to create a
new fingerprinter function, which I've assigned to *calc_fingerprint*,
and then call directly::

    >>> calc_fingerprint = fptype.make_fingerprinter()
    >>> bitops.byte_popcount(calc_fingerprint(mol1))
    12
    >>> bitops.byte_popcount(calc_fingerprint(mol2))
    1

The keen-eyed will note that I could have written the first code as:

    >>> compute_fingerprint = fptype.compute_fingerprint
    >>> bitops.byte_popcount(compute_fingerprint(mol1))
    12
    >>> bitops.byte_popcount(compute_fingerprint(mol2))
    1

and gotten the same answer, which means there is little API need for a
special "make_fingerprinter()" function, except for performance.

I timed the performance. Even in the worst case that I could find
(Open Babel's FP2 fingerprint), the performance boost was a paltry
2.5%. Otherwise it was about 1%. This is not enough to warrant using
this function.

(Why do I leave it in? Probably because of the hard work I put into
writing it, and because I like the principle behind it. Perhaps I have
hopes that the performance difference will be more apparent on
multi-threaded benchmarks, which I haven't evaluated.)