.. highlight:: pycon

.. _using_api:

############################################
Fingerprints and fingerprint search examples
############################################

The chemfp command-line programs use a Python library called
chemfp. Portions of the API are in flux and subject to change. The
stable portions of the API which are open for general use are
documented in :ref:`chemfp-api`.

The API includes:

- low-level Tanimoto, Tversky, and popcount operations

- Tanimoto and Tversky search algorithms based on threshold and/or k-nearest neighbors

- routines for reading and writing fingerprints
   
- a cross-toolkit molecule I/O API
   
- a cross-toolkit fingerprint type API

The following chapters give examples of how to use the API, starting
with fingerprints, fingerprint I/O, and fingerprint search.

Python 2 vs. Python 3
=========================

Python 2.7 support by the core Python developers ended at the start
of 2020. The Python 2 series has reached its effective
end-of-life. It's time for you to migrate code to Python 3.

If you are writing new code which uses chemfp then you really should
start using Python 3. OpenEye stopped shipping a Python 2.7 version of
OEChem by the end of 2017, and Open Babel and RDKit stopped Python 2.7
support by 2019. Chemfp 3.5 is the last version of the commercial
chemfp development track which will support Python 2.

If you have code which works under Python 2 and you want it to work on
Python 3, then there are two main options. In some cases you can
re-write all the incompatible code, so the result works under Python 3
but not Python 2. However, that can be too big of a step.

Another option is to port your code to the subset of Python which
works under both Python 2 and Python 3. While this is more work
overall, the steps are smaller, and it's possible to develop new
features while gradually doing the port.

A goal of the chemfp 3 series is to help with that migration. It
supports both Python 2.7 and Python 3.6 or later, with the same
API.

This documentation is written with that second option in mind. The
examples are shown in Python 3, but the same code will work under
Python 2.7. The only differences are in the output, which I'll detail
in the next section.

Unicode and byte strings
=========================

In chemfp 3.x, the record identifier is a Unicode string while the
fingerprint is a byte string. Earlier versions of chemfp treated both
identifiers and fingerprints as byte strings. To make things more
confusing, Python 2 and Python 3 use different ways to input and
denote Unicode and binary strings.

Under Python 2, normal strings are byte strings, while Unicode strings
are represented with the ``u""`` syntax::

  >>> "This is a byte string"     # Python 2
  'This is a byte string'
  >>> u"This is a Unicode string"
   u'This is a Unicode string'  

Under Python 3, normal strings are Unicode strings, while byte strings
are represented with the ``b""`` syntax::
     
  >>> b"This is a byte string"    # Python 3
  b'This is a byte string'
  >>> "This is a Unicode string"
  'This is a Unicode string'     
     
Python 2.7 understands the ``b""`` notation, and Python 3 understands
the ``u""`` notation, so the portable way to represent a Unicode
identifier and binary fingerprint is to be explicit about the string
type::

  >>> id = u"España"   # Works in Python 2.7 and Python 3
  >>> fp = b"\x00A!\xff"

While the data types are the same, the output representations are
different on the two versions of Python::

  >>> (id, fp)    # Python 2.7
  (u'Espa\xf1a', '\x00A!\xff')
  
  >>> (id, fp)    #  Python 3
  ('España', b'\x00A!\xff')

The output in these examples will be from Python 3. Unless otherwise
stated, the equivalent output in Python 2.7 differs only in the
prefix.
  

Hex representation of a binary fingerprint
============================================

In Python 2 it is easy to turn a byte string into a hex-encoded
string::

  >>> fp = b"\x00A!\xff"  # Python 2.7
  >>> fp.encode("hex")
  '004121ff'

The more direct route (and faster) is to use the `binascii.hexlify
<https://docs.python.org/2/library/binascii.html#binascii.hexlify>`_
function::

  >>> import binascii   # Python 2.7
  >>> binascii.hexlify(fp)
  '004121ff'

In Python 3 it's even easier to turn a byte string into a hex-encoded
string::

  >>> fp = b"\x00A!\xff"  # Python 3
  >>> fp.hex()
  '004121ff'

but that is not portable. Nor does ``fp.encode("hex")`` work, because
in Python 3 byte strings do not have an ``encode()`` method::

  >>> fp.encode("hex")  # Python 3
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  AttributeError: 'bytes' object has no attribute 'encode'

If you want a byte string as output then the portable solution is to
use hexlify::

  >>> import binascii  # Python 3
  >>> binascii.hexlify(fp)
  b'004121ff'

However, on Python 2.7 I often want the hex-encoded version as a byte
("normal") string, while on Python 3 I want it as a ("normal") Unicode
string, because I use hex strings for text output.

Python does not offer a portable solution, but chemfp does, in the
:mod:`chemfp.bitops` module, named :func:`hex_encode <.hex_encode>` ::

  >>> from chemfp import bitops  # Python 2 and Python 3
  >>> bitops.hex_encode(b"\x00A!\xff")
  '004121ff'

The variant :func:`hex_encode_as_bytes <.hex_encode_as_bytes>` returns
a byte string, and I think is easier to remember than
``binascii.hexlify``::

  >>> bitops.hex_encode_as_bytes(b"\x00A!\xff")
  b'004121ff'  

Byte and hex fingerprints
=========================

In this section you'll learn how chemfp stores fingerprints and some
of the low-level bit operations on those fingerprints.

chemfp stores fingerprints as byte strings. Here are two 8 bit
fingerprints::

  >>> fp1 = b"A"
  >>> fp2 = b"B"

The :mod:`chemfp.bitops` module contains functions which work on byte
fingerprints. Here's the :func:`byte Tanimoto <.byte_tanimoto>` of those
two fingerprints::

  >>> from chemfp import bitops
  >>> bitops.byte_tanimoto(fp1, fp2)
  0.3333333333333333

To understand why, you have to know that ASCII character "A" has the
value 65, and "B" has the value 66. The bit representation is::

   "A" = 01000001   and   "B" = 01000010

so their intersection has 1 bit and the union has 3, giving a Tanimoto
of 1/3 or 0.3333333333333333 as stored in Python's 64 bit floating
point value.

You can compute the Tanimoto between any two byte strings with the
same length, as in::

  >>> bitops.byte_tanimoto(b"apples&", b"oranges")
  0.58333333333333334

You'll get a ValueError if they have different lengths::

  >>> bitops.byte_tanimoto(b"ABC", b"A")
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: byte fingerprints must have the same length

     
The :func:`Tversky index <.byte_tversky>` is also available. The
default values for alpha and beta are 1.0, which is identical to the
Tanimoto::

  >>> bitops.byte_tversky(b"apples&", b"oranges")
  0.5833333333333334 
  >>> bitops.byte_tversky(b"apples&", b"oranges", 1.0, 1.0)
  0.5833333333333334

Using alpha = beta = 0.5 gives the Dice index::
    
  >>> bitops.byte_tversky(b"apples&", b"oranges", 0.5, 0.5)
  0.7368421052631579

In chemfp, the alpha and beta may be between 0.0 and 100.0,
inclusive. Values outside that range will raise a ValueError::

  >>> bitops.byte_tversky(b"A", b"B", 0.2, 101)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: beta must be between 0.0 and 100.0, inclusive

.. highlight:: none

Most fingerprints are not as easy to read as the English ones I showed
above. They tend to look more like::

    P1@\x84K\x1aN\x00\n\x01\xa6\x10\x98\\\x10\x11

which is hard to read. I usually show hex-encoded fingerprints. The above
fingerprint in hex is::

    503140844b1a4e000a01a610985c1011

.. highlight:: pycon 
    
which is simpler to read. I'll use :func:`hex_encode <.hex_encode>` as
the portable way to convert a byte fingerprint to a string under
Python 2 and Python 3::

  >>> bitops.hex_encode(b"apples&")   # Portable (returns a native string)
  '6170706c657326'
  >>> bitops.hex_encode(b"oranges")
  '6f72616e676573'    
  >>> bitops.hex_decode(b"416e64726577")  # (returns a byte string)
  b'Andrew'

If you do not need to support Python 2.7 then it's easier to use the
Python3 specific ".hex()" and "fromhex()" methods of byte strings::

  >>> b"apples&".hex()   # Python 3 only!
  '6170706c657326'
  >>> b"oranges".hex()   # Python 3 only!
  '6f72616e676573'
  >>> bytes.fromhex("416e64726577")   # Python 3 only!
  b'Andrew'  


Most of the byte functions in the bitops module have an equivalent hex
version, like :func:`.bitops.hex_tanimoto` which is the hex equivalent for
:func:`.bitops.byte_tanimoto`::

  >>> bitops.hex_tanimoto("6170706c657326", "6f72616e676573")
  0.5833333333333334 
  >>> bitops.hex_tanimoto(u"6170706c657326", u"6f72616e676573")
  0.5833333333333334 
  >>> bitops.hex_tanimoto(b"6170706c657326", b"6f72616e676573")
  0.5833333333333334

These functions accept both byte strings and Unicode strings.

Even though hex-encoded fingerprints are easier to read than raw
bytes, it can still be hard to figure out that which bit is set in the
hex fingerprint "00001000" (which is the byte fingerprint
"``\x00\x00\x10\x00``"). For what it's worth, bit number 20 is set, where
bit 0 is the first bit.

You can get the list of "on" bits with the
:func:`.bitops.byte_to_bitlist` function::

  >>> bitops.byte_to_bitlist(b"P1@\x84K\x1aN\x00\n\x01\xa6\x10\x98\\\x10\x11")
  [4, 6, 8, 12, 13, 22, 26, 31, 32, 33, 35, 38, 41, 43, 44, 49, 50,
  51, 54, 65, 67, 72, 81, 82, 85, 87, 92, 99, 100, 103, 106, 107,
  108, 110, 116, 120, 124]

That's a lot of overhead if you only want to tell if, say, bit 41 is
set. For that case use :func:`.bitops.byte_contains_bit`::

  >>> bitops.byte_contains_bit(b"P1@\x84K\x1aN\x00\n\x01", 41)
  True
  >>> bitops.byte_contains_bit(b"P1@\x84K\x1aN\x00\n\x01", 42)
  False

The :func:`.bitops.byte_from_bitlist` function creates a fingerprint given a
list of 'on' bits. By default it generates a 1024 bit fingerprint, which
is a bit too long for this documentation. I'll use 64 bits instead::

  >>> bitops.byte_from_bitlist([0], 64)
  b'\x01\x00\x00\x00\x00\x00\x00\x00'

The bit positions folded based on the modulo of the fingerprint size,
so bit 65 is mapped to bit 1, as in the following::

  >>> bitops.byte_from_bitlist([0, 65], 64)
  b'\x03\x00\x00\x00\x00\x00\x00\x00'
  >>> bitops.byte_to_bitlist(bitops.byte_from_bitlist([0, 65], 64))
  [0, 1]


The bitops module includes other low-level functions which work on
byte fingerprints, as well as corresponding functions which work on
hex fingerprints. (Hex-encoded fingerprints are decidedly second-class
citizens in chemfp, but they are citizens.) The byte-based functions
are:

* :func:`byte_contains <.byte_contains>` - test if the first fingerprint is contained in the second
* :func:`byte_contains_bit <.byte_contains_bit>` - test if a specified fingerprint bit is on
* :func:`byte_difference <.byte_difference>` - return a fingerprint which is the difference (xor) of two fingerprints
* :func:`byte_from_bitlist <.byte_from_bitlist>` - create a fingerprint given 'on' bit positions
* :func:`byte_intersect <.byte_intersect>` - return a fingerprint which is the intersection of two fingerprints
* :func:`byte_intersect_popcount <.byte_intersect_popcount>` - intersection popcount between two fingerprints
* :func:`byte_popcount <.byte_popcount>` - fingerprint popcount
* :func:`byte_tanimoto <.byte_tanimoto>` - Tanimoto similarity between two fingerprints 
* :func:`byte_tversky <.byte_tversky>` - Tversky index between two fingerprints 
* :func:`byte_to_bitlist <.byte_to_bitlist>` - get a list of the 'on' bit positions
* :func:`byte_union <.byte_union>` - return a fingerprint which is the union of two fingerprints
* :func:`hex_encode <.hex_encode>` - hex encode a byte string, returns the native string type
* :func:`hex_encode_as_bytes <.hex_encode>` - hex encode a byte string, returns a byte string 
 
The hex-based functions are:

* :func:`hex_contains <.hex_contains>` - test if the first hex fingerprint is contained in the second
* :func:`hex_contains_bit <.hex_contains>` - test if  a specified hex fingerprint bit is on
* :func:`hex_difference <.byte_difference>` - return a fingerprint which is the difference (xor) of two hex fingerprints
* :func:`hex_from_bitlist <.hex_from_bitlist>` - create a fingerprint given 'on' bit positions  in a hex fingerprint
* :func:`hex_intersect <.hex_intersect>` - return a fingerprint which is the intersection of two hex fingerprints
* :func:`hex_intersect_popcount <.hex_intersect_popcount>` - intersection popcount between two hex fingerprints
* :func:`hex_isvalid <.hex_isvalid>` - test if the string is a hex-encoded fingerprint
* :func:`hex_popcount <.hex_popcount>` - hex fingerprint popcount
* :func:`hex_tanimoto <.hex_tanimoto>` - Tanimoto similarity between two hex fingerprints 
* :func:`hex_tversky <.hex_tversky>` - Tversky index between two hex fingerprints
* :func:`hex_to_bitlist <.hex_to_bitlist>` - get a list of the 'on' bit positions  in a hex fingerprint
* :func:`hex_union <.byte_union>` - return a fingerprint which is the union of two hex fingerprints
* :func:`hex_decode <.hex_decode>` - convert a hex-encoded string into a byte string

There are two functions which compare a byte fingerprint to a hex
fingerprint. These are somewhat faster than the pure hex version
because they don't need to verify that the query fingerprint contain
only hex characters:

* :func:`byte_hex_tanimoto <.byte_hex_tanimoto>` - Tanimoto similarity between a byte and a hex fingerprint
* :func:`byte_hex_tversky <.byte_hex_tversky>` - Tversky index between a byte and a hex fingerprint


Fingerprint reader and metadata
===============================

In this section you'll learn the basics of the fingerprint reader
classes and fingerprint metadata.

A fingerprint record is the fingerprint plus an identifier. In chemfp,
a :class:`fingerprint reader <.FingerprintReader>` is an object which
supports iteration through fingerprint records. There some fingerprint
readers, like the :class:`.FingerprintArena` also support direct
record lookup.

That's rather abstract, so let's work with a few real examples. You'll
need to create a copy of the "pubchem_targets.fps" file generated in
:ref:`pubchem_fingerprints` in order to follow along.

Here's how to open an FPS file::

  >>> import chemfp
  >>> reader = chemfp.open("pubchem_targets.fps")

Every fingerprint collection has a metadata attribute with details
about the fingerprints. It comes from the header of the FPS file. You
can view the metadata in Python repr format:

  >>> reader.metadata
  Metadata(num_bits=881, num_bytes=111, type='CACTVS-E_SCREEN/1.0 extended=2',
  aromaticity=None, sources=['Compound_048500001_049000000.sdf.gz'],
  software='CACTVS/unknown', date='2020-05-11T14:35:11')

In chemfp 3.x the ``type``, ``software``, ``date`` and the source
filenames are Unicode strings. In earlier versions of chemfp these
were byte strings.
 
I added a few newlines to make that easier to read, but I think it's
easier still to view it in string format, which matches the format of
the FPS header::

  >>> from __future__ import print_function  # Only needed in Python 2
  >>> print(reader.metadata)
  #num_bits=881
  #type=CACTVS-E_SCREEN/1.0 extended=2
  #software=CACTVS/unknown
  #source=Compound_048500001_049000000.sdf.gz
  #date=2020-05-11T14:35:11

(The ``print`` statement in Python 2 was replaced with a ``print``
function in Python 3. The special `future statement
<https://docs.python.org/2/reference/simple_stmts.html#future>`_ tells
Python 2 to use the new print function syntax of Python 3.)

All fingerprint collections support iteration. Each step of the
iteration returns the fingerprint identifier and the fingerprint byte
string. Since I know the 6th record has the id 14550010, I can write a
simple loop which stops with that record::

  >>> from chemfp import bitops
  >>> for (id, fp) in reader:
  ...   print(id, "starts with", bitops.hex_encode(fp)[:20])
  ...   if id == u"48500199":
  ...     break
  ...
  48500020 starts with 07de0500000000000000
  48500053 starts with 07de0c00000000000000
  48500091 starts with 07de8c00000000000000
  48500092 starts with 07de0d00020000000000
  48500110 starts with 075e0c00000000000000
  48500164 starts with 07de0c00000000000000
  48500177 starts with 03de0500000800000000
  48500199 starts with 07de0c00000000000000

Fingerprint collections also support :meth:`iterating via arenas
<.FingerprintReader.iter_arenas>`, and several support Tanimoto search
methods.


Working with a FingerprintArena
===============================

In this section you'll learn about the FingerprintArena fingerprint
collection and how to iterate through subarenas in a collection.

Chemfp supports two format types. The FPS format is designed to be
easy to read and write, but searching through it requires a linear
scan of the disk, which can only be done once. If you want to do many
queries then it's best to load the FPS data into memory as a
:class:`.FingerprintArena`.

Use :func:`chemfp.load_fingerprints` to load fingerprints into an
arena::

  >>> from __future__ import print_function # Only needed for Python 2
  >>> import chemfp
  >>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> print(arena.metadata)
  #num_bits=881
  #type=CACTVS-E_SCREEN/1.0 extended=2
  #software=CACTVS/unknown
  #source=Compound_048500001_049000000.sdf.gz
  #date=2020-05-11T14:35:11
    
The fingerprints can come from an FPS file, as in this example, or
from an FPB file. The FPB format is much more complex internally, but
can be loaded directly and quickly into a FingerprintArena, also with
the same function::

  >>> arena = chemfp.load_fingerprints("pubchem_targets.fpb")

An arena implements the fingerprint collection API, so you can do
things like iterate over an arena and get the id/fingerprint pairs::

  >>> from chemfp import bitops
  >>> for id, fp in arena:
  ...     print(id, "with popcount", bitops.byte_popcount(fp))
  ...     if id == u"48656867":
  ...         break
  ...
  48942244 with popcount 33
  48941399 with popcount 39
  48940284 with popcount 40
  48943050 with popcount 40
  48656359 with popcount 41
  48656867 with popcount 42    

If you look closely you'll notice that the fingerprint record order
has changed from the previous section, and that the population counts
are suspiciously non-decreasing. By default :func:`.load_fingerprints`
on an FPS file reorders the fingerprints into a data structure which
is faster to search, though you can disable that with the *reorder*
parameter if you want the fingerprints to be the same as the input
order.

The :class:`.FingerprintArena` has new capabilities. You can ask it
how many fingerprints it contains, get the list of identifiers, and
look up a fingerprint record given an index::

  >>> len(arena)
  14967
  >>> list(arena.ids[:5])
  ['48942244', '48941399', '48940284', '48943050', '48656359']
  >>> id, fp = arena[6]
  >>> id
  '48839855'
  >>> arena[-1][0]  # the identifier of the last record in the arena
  '48985180'
  >>> bitops.byte_popcount(arena[-1][1])  # its fingerprint
  253


An arena supports iterating through subarenas. This is like having a
long list and being able to iterate over sublists. Here's an example
of iterating over the arena to get subarenas of size 2000 (excepting
the last), and print information about each subarena::

  >>> for subarena in arena.iter_arenas(2000):
  ...   print(subarena.ids[0], len(subarena))
  ...
  48942244 2000
  48629741 2000
  48848217 2000
  48873983 2000
  48575094 2000
  48531270 2000
  48806978 2000
  48584671 967    
  >>> arena[0][0]
  '48942244'
  >>> arena[2000][0]
  '48629741'

To help demonstrate what's going on, I showed the first id of each
record along with the main arena ids for records 0 and 2000, so you
can verify that they are the same.

Arenas are a core part of chemfp. Processing one fingerprint at a time
is slow, so the main search routines expect to iterate over query
arenas, rather than query fingerprints.

That's why the FPSReaders -- and all chemfp fingerprint collections --
also support the :meth:`chemfp.FingerprintReader.iter_arenas`
method. Here's an example of reading 25 records at a time from the
targets file::

  >>> queries = chemfp.open("pubchem_queries.fps")
  >>> for arena in queries.iter_arenas(25):
  ...   print(len(arena))
  ...
  25
  25
        <deleted additional lines saying '25'>
  25
  25
  1

Those add up to 10826, which you can verify is the number of structures
in the original source file.

If you have a :class:`.FingerprintArena` then you can also use
Python's slice notation to make a subarena::

  >>> queries = chemfp.load_fingerprints("pubchem_queries.fps")
  >>> queries[10:15]
  <chemfp.arena.FingerprintArena object at 0x552c10>
  >>> queries[10:15].ids
  ['99110546', '99110547', '99123452', '99123453', '99133437']
  >>> queries.ids[10:15]  # a different way to get the same list
  ['99110546', '99110547', '99123452', '99123453', '99133437']

The big restriction is that slices can only have a step size
of 1. Slices like ``[10:20:2]`` and ``[::-1]`` aren't supported. If
you want something like that then you'll need to make a new arena
instead of using a subarena slice. (Hint: pass the list of indices
to the :meth:`arena's copy method <.FingerprintArena.copy>`.)

In case you were wondering, yes, you can use ``iter_arenas`` and the
the other FingerprintArena methods on a subarena::

  >>> queries[10:15][1:3].ids
  ['99110547', '99123452']
  >>> queries.ids[11:13]
  ['99110547', '99123452']


.. _create_arena:

Create an arena with user-specified fingerprints
================================================

In this section you'll learn how to create an arena containing
user-specified fingerprint data.

Most of the examples in this manual use fingerprints created by a
cheminformatics toolkit or extracted from an SD file. Chemfp accepts
any byte string as a fingerprint, which includes, for example, novel
fingerprint types which you have created for your own research.

The first parameter of the :func:`.load_fingerprints` function can be
any iterator which returns a sequence of Unicode identifier and byte
string fingerprint. For example, if you have three fingerprint records
where each fingerprint contains 32-bits of data, like this::

  >>> data = [(u"ID1", b"\xc4\xa7\xd2\x1e"),
  ...         (u"ID2", b"\x04\x82\xd6\x08"),
  ...         (u"ID3", b"\xc1\xa3\xd2\x1e")]

then you can pass the list directly to `load_fingerprints`, along with a
Metadata instance to tell chemfp the fingerprint size and type::

  >>> import chemfp
  >>> arena = chemfp.load_fingerprints(data,
  ...   chemfp.Metadata(num_bytes=4, type="Example/19"))
  >>> len(arena)
  3

What if the fingerprint data comes from a file which isn't in FPS
format? The :mod:`chemfp.bitops` and :mod:`chemfp.encodings` modules
contains functions which can help with the conversion. Suppose each
line in the file contains an id followed by a list of bit indices for
the on bits::

  >>> lines = ["ID1 0 1 9 10 11 14 15 16 18 19 43\n",
  ...          "ID2 0 1 2 9 10 11 12 14 18 19 20 43\n"]

The following function reads the lines, parses the id and bit list,
converts the bitlist into a 64-bit byte string, and yields the
id/fingerprint pairs::

  >>> def get_id_and_fp(lines):
  ...   for line in lines:
  ...     fields = line.split()
  ...     bitlist = [int(bit) for bit in fields[1:]]
  ...     yield fields[0], bitops.byte_from_bitlist(bitlist, 64)
  ...
  >>> for id, fp in get_id_and_fp(lines):
  ...   print(id, repr(fp))
  ...
  ID1 b'\x03\xce\r\x00\x00\x08\x00\x00'
  ID2 b'\x07^\x1c\x00\x00\x08\x00\x00'

Here's one way to use the function to create an arena::

  >>> arena = chemfp.load_fingerprints(get_id_and_fp(lines),
  ...           metadata=chemfp.Metadata(num_bits=64))
  >>>
  >>> len(arena)
  n2
  >>> arena.get_fingerprint(0)
  b'\x03\xce\r\x00\x00\x08\x00\x00'

It's a bit cumbersome to pass the metadata into `load_fingerprints`
when the parser already knows that information, but there's a better
way. If no metadata is passed to the `load_fingerprints` function then
the function will try to get it from the `metadata` attribute of the
first function. That's why you get an exception if you omit the
metadata::

  >>> arena = chemfp.load_fingerprints(get_id_and_fp(lines))
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "chemfp/__init__.py", line 550, in load_fingerprints
      alignment=alignment)
    File "chemfp/arena.py", line 849, in fps_to_arena
      metadata = fps_reader.metadata
  AttributeError: 'generator' object has no attribute 'metadata'

Instead, wrap the metadata and id/fingerprint iterator inside of
a :class:`.FingerprintIterator` utility class::
    
  >>> def read_bitlist_format(lines):
  ...   return chemfp.FingerprintIterator(
  ...       chemfp.Metadata(num_bits=64),
  ...       get_id_and_fp(lines))
  ...

The result can be passed directly to `load_fingerprints`::

  >>> arena = chemfp.load_fingerprints(read_bitlist_format(lines))
  >>> len(arena)
   2
  >>> arena[1]
  ('ID2', b'\x07^\x1c\x00\x00\x08\x00\x00')

The FingerprintIterator also implements the
:meth:`.FingerprintReader.save` method, which can be used to save the
fingerprints to an FPS or FPB file. See the next section for more details.
    
.. _save_fingerprints:

Save a fingerprint arena
========================

In this section you'll learn how to save an arena in FPS and FPB
formats.

This is probably the easiest section. If you have an arena (or any
:class:`.FingerprintReader`), like::

  >>> import chemfp
  >>> queries = chemfp.load_fingerprints("pubchem_queries.fps")

then you can save it to an FPS file using the
:meth:`.FingerprintReader.save` method and a filename ending with
".fps". (You'll also get an FPS file if you specify an unknown
extension.)::

  >>> queries.save("example.fps")

If the filename ends with ".fps.gz" then the file will be saved as a
gzip-compressed FPS file, and if the filename ends with ".fpb.zst" and
the zstandard Python package is installed, then the file will be saved
as a zstandard-compressed FPS file.

Finally, if the name ends with ".fpb", as in::

  >>> queries.save("example.fpb")

then the result will be in FPB format. The save() method can also save
gzip- and zstandard-compressed FPB files.

The ``save`` method supports a second option, *format*, should you for
some odd reason want the format to be different than what's implied by
the filename extension::

  >>> queries.save("example.fpb", "fps")  # save in FPS format

The ``save`` method supports a third option, *level*, which specifies
the compression level. This should be an integer appropriate for the
compression library. The string aliases "min", "default", and "max"
are mapped to the appropriate compression level for the given format:
"min" is 1; "default" is 9 for gzip and 3 for zstandard; "max" is 9
for gzip and 19 for zstandard.


How to use query fingerprints to search for similar target fingerprints
=======================================================================

In this section you'll learn how to do a Tanimoto search using the
previously created PubChem fingerprint files for the queries and the
targets from :ref:`pubchem_fingerprints`.

It's faster to search an arena, so I'll load the target fingerprints:

  >>> from __future__ import print_function   # Only for Python 2.7
  >>> import chemfp 
  >>> targets = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> len(targets)
  14967

and open the queries as an FPSReader.

  >>> queries = chemfp.open("pubchem_queries.fps")

I'll use :func:`chemfp.threshold_tanimoto_search` to find, for each
query, all hits which are at least 0.7 similar to the query.

  >>> for (query_id, hits) in chemfp.threshold_tanimoto_search(queries, targets, threshold=0.7):
  ...   print(query_id, len(hits), list(hits)[:2])
  ...
  99000039 641 [(3619, 0.7085714285714285), (4302, 0.7371428571428571)]
  99000230 373 [(2747, 0.703030303030303), (3608, 0.7041420118343196)]
  99002251 270 [(2512, 0.7006369426751592), (2873, 0.7088607594936709)]
  99003537 523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392)]
  99003538 523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392)]
  99005028 131 [(772, 0.7589285714285714), (796, 0.7522123893805309)]
  99005031 131 [(772, 0.7589285714285714), (796, 0.7522123893805309)]
  99006292 308 [(805, 0.7058823529411765), (808, 0.7)]
  99006293 308 [(805, 0.7058823529411765), (808, 0.7)]
  99006597 0 []
        # ... many lines omitted ...

I'm only showing the first two hits for the sake of space. It seems
rather pointless to show all 641 hits of query id 99000039.

However, there's a subtle problem here. The "list(hits)" returns a
list of (index, score) tuples when the targets are an arena, and (id,
score) tuples when the targets are a FPS reader. (I'll talk about that
more in the next section for how that works.) It's best to always
specify how you want the results. In my case I always want the
identifiers and the scores so I'll use
:meth:`hits.get_ids_and_scores() <.SearchResult.get_ids_and_scores>`,
like this:

.. code-block:: python
  
  from __future__ import print_function # Only for Python 2
  import chemfp
  targets = chemfp.load_fingerprints("pubchem_targets.fps")
  queries = chemfp.open("pubchem_queries.fps")
  for (query_id, hits) in chemfp.threshold_tanimoto_search(queries, targets, threshold=0.7):
      print(query_id, len(hits), hits.get_ids_and_scores()[:2])

which gives as output::
  
  99000039 641 [('48528698', 0.7085714285714285), ('48529189', 0.7371428571428571)]
  99000230 373 [('48737535', 0.703030303030303), ('48502523', 0.7041420118343196)]
  99002251 270 [('48857943', 0.7006369426751592), ('48846196', 0.7088607594936709)]
  99003537 523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392)]
  99003538 523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392)]
  99005028 131 [('48659090', 0.7589285714285714), ('48657042', 0.7522123893805309)]
  99005031 131 [('48659090', 0.7589285714285714), ('48657042', 0.7522123893805309)]
  99006292 308 [('48976796', 0.7058823529411765), ('48542022', 0.7)]
  99006293 308 [('48976796', 0.7058823529411765), ('48542022', 0.7)]
  99006597 0 []
          # ... many lines omitted ...

What you don't see in either case is that the implementation uses the
:meth:`chemfp.FingerprintReader.iter_arenas` interface on the queries
so that it processes one subarena at a time. There's a tradeoff
between a large arena, which is faster because it doesn't often go
back to Python code, or a small arena, which uses less memory and is
more responsive. You can change the tradeoff using the *arena_size*
parameter.

If all you need is the count of the hits at or above a given threshold
then use :func:`chemfp.count_tanimoto_hits`::

  >>> queries = chemfp.open("pubchem_queries.fps")
  >>> for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.7):
  ...     print(query_id, count)
  ... 
  99000039 641
  99000230 373
  99002251 270
  99003537 523
  99003538 523
  99005028 131
  99005031 131
  99006292 308
  99006293 308
  99006597 0
       # ... many lines omitted ...

Or, if you only want the k=2 nearest neighbors to each target within
that same threshold of 0.7 then use :func:`chemfp.knearest_tanimoto_search`::

  >>> queries = chemfp.open("pubchem_queries.fps")
  >>> for (query_id, hits) in chemfp.knearest_tanimoto_search(queries, targets, k=2, threshold=0.7):
  ...   print(query_id, hits.get_ids_and_scores())
  ...
  99000039 [('48503376', 0.8784530386740331), ('48503380', 0.8729281767955801)]
  99000230 [('48563034', 0.8588235294117647), ('48731730', 0.8522727272727273)]
  99002251 [('48798046', 0.8109756097560976), ('48625236', 0.8106508875739645)]
  99003537 [('48997075', 0.9035532994923858), ('48997697', 0.8984771573604061)]
  99003538 [('48997075', 0.9035532994923858), ('48997697', 0.8984771573604061)]
  99005028 [('48651160', 0.8288288288288288), ('48848576', 0.8166666666666667)]
  99005031 [('48651160', 0.8288288288288288), ('48848576', 0.8166666666666667)]
  99006292 [('48945841', 0.9652173913043478), ('48737522', 0.8793103448275862)]
  99006293 [('48945841', 0.9652173913043478), ('48737522', 0.8793103448275862)]
  99006597 []
           # ... many lines omitted ...


How to search an FPS file
=========================

In this section you'll learn how to search an FPS file directly,
without loading it into a FingerprintArena. You'll need the previously
created PubChem fingerprint files for the queries and the targets from
:ref:`pubchem_fingerprints`.

The previous example loaded the fingerprints into a
:class:`.FingerprintArena`. That's the fastest way to do multiple
searches. Sometimes you only want to do one or a couple of queries. It
seems rather excessive to read the entire targets file into an
in-memory data structure before doing the search when you could search
while processing the file.

For that case, use an FPSReader as the targets file. Here I'll get the
first two records from the queries file and use it to search the
targets file::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> query_arena = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
  >>> query_arena
  <chemfp.arena.FingerprintArena object at 0x11039c850>
  >>> len(query_arena)
  2
    
That first line is complicated. It opens the file and iterates over
its fingerprint records two at a time as arenas. The next() returns
the first of these arenas, so that line is a way of saying "get the
first two records as an arena".    

Here are the k=5 closest hits against the targets file::

  >>> targets = chemfp.open("pubchem_targets.fps")
  >>> for query_id, hits in chemfp.knearest_tanimoto_search(query_arena, targets, k=5, threshold=0.0):
  ...   print("** Hits for", query_id, "**")
  ...   for hit in hits.get_ids_and_scores():
  ...     print("", hit)
  ...
  ** Hits for 99000039 **
   ('48503376', 0.8784530386740331)
   ('48503380', 0.8729281767955801)
   ('48732162', 0.8595505617977528)
   ('48520532', 0.8540540540540541)
   ('48985130', 0.8449197860962567)
  ** Hits for 99000230 **
   ('48563034', 0.8588235294117647)
   ('48731730', 0.8522727272727273)
   ('48583483', 0.8411764705882353)
   ('48563042', 0.8352941176470589)
   ('48935653', 0.8333333333333334)    

To make it easier to see, here's the code in a single chunk:

.. code-block:: python
    
  from __future__ import print_function # Only for Python 2
  import chemfp
  query_arena = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
  targets = chemfp.load_fingerprints("pubchem_targets.fps")
  for query_id, hits in chemfp.knearest_tanimoto_search(query_arena, targets, k=5, threshold=0.0):
      print("**Hits for", query_id, "**")
      for hit in hits.get_ids_and_scores():
          print("", hit)

	    
Remember that the FPSReader reads an FPS file. Once you've done a
search, the file is read, and you can't do another search. (Well, you
*can*; but it will return empty results.) You'll need to reopen the
file to reuse the file, or reseek the file handle to the start
position and pass the handle to a new FPSReader.

Each search processes *arena_size* query fingerprints at a time. You
will need to increase that value if you want to search more than that
number of fingerprints with this method.

How do to a Tversky search using the Dice weights
=================================================

In this section you'll learn how to search a set of fingerprints using
the more general Tversky parameters, without loading it into a
FingerprintArena. You'll need the previously created PubChem
fingerprint files for the queries and the targets from
:ref:`pubchem_fingerprints`.

Chemfp-2.1 added support for Tversky searches. The Tversky index
supports weights for the superstructure and substructure terms to the
similarity. Some people like the Dice index, which is the Tversky
index with alpha = beta = 0.5, so here are a couple of ways to search
the targets based on the Dice index.

The previous two sections did a Tanimoto search by using
:func:`chemfp.knearest_tanimoto_search`. The Tversky search uses
:func:`chemfp.knearest_tversky_search`, which shouldn't be much of a
surprise. Just like the Tanimoto search code, it can take a
fingerprint arena or an FPS reader as the targets.

The first example loads all of the targets into an arena, then
searches using each of the queries:

.. code-block:: python
  
  from __future__ import print_function # Only for Python 2
  import chemfp
  queries = chemfp.open("pubchem_queries.fps")
  targets = chemfp.load_fingerprints("pubchem_targets.fps")
  for query_id, hits in chemfp.knearest_tversky_search(queries, targets, k=5, 
                                    threshold=0.0, alpha=0.5, beta=0.5):
      print("**Hits for", query_id, "**")
      for hit in hits.get_ids_and_scores():
          print("", hit)

The first two output records are::

  **Hits for 99000039 **
   ('48503376', 0.9352941176470588)
   ('48503380', 0.9321533923303835)
   ('48732162', 0.9244712990936556)
   ('48520532', 0.9212827988338192)
   ('48985130', 0.9159420289855073)
  **Hits for 99000230 **
   ('48563034', 0.9240506329113924)
   ('48731730', 0.9202453987730062)
   ('48583483', 0.9137380191693291)
   ('48563042', 0.9102564102564102)
   ('48935653', 0.9090909090909091)
 
On the other hand, the following reads the first two queries into an
arena, then searches the targets as an FPS file, *without* loading all
of the targets into memory at once::

  import chemfp
  queries = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
  targets = chemfp.open("pubchem_targets.fps")
  for query_id, hits in chemfp.knearest_tversky_search(queries, targets, k=5, 
                                    threshold=0.0, alpha=0.5, beta=0.5):
      print("** Hits for", query_id, "**")
      for hit in hits.get_ids_and_scores():
          print("", hit)

Not surprisingly, this gives the same output as before::
	    
  ** Hits for 99000039 **
   ('48503376', 0.9352941176470588)
   ('48503380', 0.9321533923303835)
   ('48732162', 0.9244712990936556)
   ('48520532', 0.9212827988338192)
   ('48985130', 0.9159420289855073)
  ** Hits for 99000230 **
   ('48563034', 0.9240506329113924)
   ('48731730', 0.9202453987730062)
   ('48583483', 0.9137380191693291)
   ('48563042', 0.9102564102564102)
   ('48935653', 0.9090909090909091)

  
FingerprintArena searches returning indices instead of ids
===========================================================

In this section you'll learn how to search a
:class:`.FingerprintArena` and use hits based on integer indices
rather than string ids.

The previous sections used a high-level interface to the Tanimoto and
Tversky search code. Those are designed for the common case where you
just want the query id and the hits, where each hit includes the
target id.

Working with strings is actually rather inefficient in both speed and
memory. It's usually better to work with indices if you can, and in
the next section I'll show how to make a distance matrix using this
interface.

The index-based search functions are in the :mod:`chemfp.search` module.
They can be categorized into three groups, with Tanimoto and Tversky
versions for each group:

  1. Count the number of hits:

    * :func:`chemfp.search.count_tanimoto_hits_fp` - search an arena using a single fingerprint (Tanimoto)

    * :func:`chemfp.search.count_tanimoto_hits_arena` - search an arena using another arena (Tanimoto)

    * :func:`chemfp.search.count_tanimoto_hits_symmetric` - search an arena using itself (Tanimoto)

    * :func:`chemfp.search.count_tversky_hits_fp` - search an arena using a single fingerprint (Tversky)

    * :func:`chemfp.search.count_tversky_hits_arena` - search an arena using another arena (Tversky)

    * :func:`chemfp.search.count_tversky_hits_symmetric` - search an arena using itself (Tversky)
      
  2. Find all hits at or above a given threshold, sorted arbitrarily:

    * :func:`chemfp.search.threshold_tanimoto_search_fp` - search an arena using a single fingerprint (Tanimoto)

    * :func:`chemfp.search.threshold_tanimoto_search_arena` - search an arena using another arena (Tanimoto)

    * :func:`chemfp.search.threshold_tanimoto_search_symmetric` - search an arena using itself (Tanimoto)

    * :func:`chemfp.search.threshold_tversky_search_fp` - search an arena using a single fingerprint (Tversky)

    * :func:`chemfp.search.threshold_tversky_search_arena` - search an arena using another arena (Tversky)

    * :func:`chemfp.search.threshold_tversky_search_symmetric` - search an arena using itself (Tversky)


  3. Find the k-nearest hits at or above a given threshold, sorted by decreasing similarity:

    * :func:`chemfp.search.knearest_tanimoto_search_fp` - search an arena using a single fingerprint (Tanimoto)

    * :func:`chemfp.search.knearest_tanimoto_search_arena` - search an arena using another arena (Tanimoto)

    * :func:`chemfp.search.knearest_tanimoto_search_symmetric` - search an arena using itself (Tanimoto)

    * :func:`chemfp.search.knearest_tversky_search_fp` - search an arena using a single fingerprint (Tversky)

    * :func:`chemfp.search.knearest_tversky_search_arena` - search an arena using another arena (Tversky)

    * :func:`chemfp.search.knearest_tversky_search_symmetric` - search an arena using itself (Tversky)

The functions ending "_fp" take a query fingerprint and a target
arena. The functions ending "_arena" take a query arena and a target
arena. The functions ending "_symmetric" use the same arena as both
the query and target.

In the following example, I'll use the first 5 fingerprints of a data
set to search the entire data set. To do this, I load the data set as
an arena, extract the first 5 records as a sub-arena, and do the
search.

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> from chemfp import search
  >>> queries = next(chemfp.open("pubchem_queries.fps").iter_arenas(5))
  >>> targets = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> results = search.threshold_tanimoto_search_arena(queries, targets, threshold=0.7)

The :func:`.search.threshold_tanimoto_search_arena` call finds the
target fingerprints which have a similarity score of at least 0.7
compared to the query.

You can iterate over the results (which is a :class:`.SearchResults`)
to get the list of hits for each of the queries. The order of the
results is the same as the order of the records in the query::

  >>> for hits in results:
  ...   print(len(hits), hits.get_ids_and_scores()[:3])
  ...
  641 [('48528698', 0.7085714285714285), ('48529189', 0.7371428571428571), ('48937990', 0.7039106145251397)]
  373 [('48737535', 0.703030303030303), ('48502523', 0.7041420118343196), ('48560268', 0.7)]
  270 [('48857943', 0.7006369426751592), ('48846196', 0.7088607594936709), ('48855282', 0.710691823899371)]
  523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392), ('48529584', 0.705)]
  523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392), ('48529584', 0.705)]

The results object don't store the query id. Instead, you have to know
that the results are in the same order as the input as the query
arena, so you can match the query arena's ``id`` attribute, which
contains the list of fingerprint identifiers, to each result::

  >>> for query_id, hits in zip(queries.ids, results):
  ...   print("Hits for", query_id)
  ...   for hit in hits.get_ids_and_scores()[:3]:
  ...     print("", hit)
  ... 
  Hits for 99000039
   ('48528698', 0.7085714285714285)
   ('48529189', 0.7371428571428571)
   ('48937990', 0.7039106145251397)
  Hits for 99000230
   ('48737535', 0.703030303030303)
   ('48502523', 0.7041420118343196)
   ('48560268', 0.7)
  Hits for 99002251
   ('48857943', 0.7006369426751592)
   ('48846196', 0.7088607594936709)
   ('48855282', 0.710691823899371)
  Hits for 99003537
   ('48542237', 0.7230769230769231)
   ('48739065', 0.7085427135678392)
   ('48529584', 0.705)
  Hits for 99003538
   ('48542237', 0.7230769230769231)
   ('48739065', 0.7085427135678392)
   ('48529584', 0.705)
 
What I really want to show is that you can get the same data only
using the offset index for the target record instead of its id. The
result from a Tanimoto search with a query arena is a
:class:`.SearchResults`. Iterating over the results gives a
:class:`.SearchResult` object, with methods like
:meth:`.SearchResult.get_indices_and_scores`,
:meth:`.SearchResult.get_ids`, and :meth:`.SearchResult.get_scores`::

  >>> for hits in results:
  ...   print(len(hits), hits.get_indices_and_scores()[:3])
  ... 
  641 [(3619, 0.7085714285714285), (4302, 0.7371428571428571), (4576, 0.7039106145251397)]
  373 [(2747, 0.703030303030303), (3608, 0.7041420118343196), (3777, 0.7)]
  270 [(2512, 0.7006369426751592), (2873, 0.7088607594936709), (3185, 0.710691823899371)]
  523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]
  523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]
  >>> 
  >>> targets.ids[0]
  '48942244'
  >>> targets.ids[1]
  '48941399'
  >>> targets.ids[3619]
  '48528698'
  >>> targets.ids[4302]
  '48529189'

I did a few id lookups given the target dataset to show you that the
index corresponds to the identifiers from the previous code.

These examples iterated over each individual :class:`.SearchResult` to
fetch the ids and scores, or indices and scores. Another possibility
is to ask the :class:`.SearchResults` collection to iterate directly
over the list of fields you
want. :meth:`.SearchResults.iter_indices_and_scores`, for example,
iterates through the ``get_indices_and_score`` of each SearchResult.

::

  >>> for row in results.iter_indices_and_scores():
  ...   print(len(row), row[:3])
  ... 
  641 [(3619, 0.7085714285714285), (4302, 0.7371428571428571), (4576, 0.7039106145251397)]
  373 [(2747, 0.703030303030303), (3608, 0.7041420118343196), (3777, 0.7)]
  270 [(2512, 0.7006369426751592), (2873, 0.7088607594936709), (3185, 0.710691823899371)]
  523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]
  523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]

This was added to get a bit more performance out of chemfp and because
the API is sometimes cleaner one way and sometimes cleaner the
other. Yes, I know that the Zen of Python recommends that "there
should be one-- and preferably only one --obvious way to do it." Oh
well.

Access the SearchResult scores and indices as a NumPy array
===========================================================

In this section you'll learn how to access the scores and indices for
the search result hits as a NumPy array. This returns a `structured
array <https://numpy.org/doc/stable/user/basics.rec.html>`_
view of the underlying search result hits. You will need to install `NumPy
<http://numpy.scipy.org/>`_ for the following to work. See the next
section if you want to use ctypes or Python's buffer API.

The following reads the targets file created in
:ref:`pubchem_fingerprints`, randomly selects a query fingerprint
(using the seed 123), then finds the Tanimoto score between that query
and all of the fingerprints in the targets file::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> targets = chemfp.load_fingerprints ("pubchem_targets.fps")
  >>> query_fp = targets.fingerprints.random_choice(rng=123)
  >>> result = targets.threshold_tanimoto_search_fp(query_fp, threshold=0.0)

The :meth:`.SearchResult.get_scores` method returns the result's
scores as a `Python array object
<https://docs.python.org/3/library/array.html>`_, which gives a
compact way to store a list of doubles using a stock Python
installation::

  >>> result.get_scores()[:5]
  array('d', [0.22772277227722773, 0.3, 0.31, 0.24761904761904763, 0.45054945054945056])

However, this array is a copy of the underlying scores. Making the
copy doubles the amont of memory used, and reduces the performance
somewhat as it must create the new array and copy the values over to
it.

Most people who use chemfp also use NumPy, which has its own array
type, along with a large number of functions which work on NumPy
arrays. The :meth:`.SearchResult.as_numpy_array` method returns a
NumPy array *view* of the underlying search result hits. The term
"view" means that the NumPy array doesn't have its own storage space
but uses the same storage space as the search result object.

.. WARNING::
    Do not use the NumPy view of a search result after the search
    result has been cleared or freed up as that will likely cause your
    program to segfault. IN ADDITION, do not modify the contents
    of the array as it may cause chemfp's optimized sort algorithms
    to fail.

The view is implemented as a `structured array
<https://numpy.org/doc/stable/user/basics.rec.html>`_ with access to
the `index` and `score` arrays::

  >>> arr = result.as_numpy_array()  
  >>> arr
  array([(    0, 0.22772277), (    1, 0.3       ), (    2, 0.31      ), ...,
         (14964, 0.30350195), (14965, 0.33333333), (14966, 0.30798479)],
        dtype=[('index', '<i4'), ('score', '<f8')])
  >>> arr["score"][:5]
  array([0.22772277, 0.3       , 0.31      , 0.24761905, 0.45054945])
  >>> arr["index"][:5]
  array([0, 1, 2, 3, 4], dtype=int32)
  
I'll reorder by decreasing score and show that the NumPy array values
also change::

  >>> result.reorder("decreasing-score")
  >>> arr["score"][:5]
  array([1.        , 1.        , 0.97849462, 0.96808511, 0.96808511])

This can be further similified using the short-cut method
:meth:`.SearchResult.get_scores_as_numpy_array`:: 

  >>> result.get_scores_as_numpy_array()
  array([1.        , 1.        , 0.97849462, ..., 0.13432836, 0.13235294,
         0.12977099])
  >>> result.get_indices_as_numpy_array()
  array([856, 857, 907, ..., 334, 371, 217], dtype=int32)
        
These scores can be passed directly to something like matplotlib for
plotting, which in this case shows the distribution of similarity
scores for the given query::

  >>> from matplotlib import pyplot
  >>> pyplot.plot(result.get_scores_as_numpy_array())
  [<matplotlib.lines.Line2D object at 0x122c0b190>]
  >>> pyplot.show()


Access the SearchResult scores and indices as buffer or ctypes structure
========================================================================

In this section you'll learn how to access the scores and indices for
the search result hits as a ctypes structure. This returns an
array-like view of the underlying search result hits using built-in
Python methods, which may be useful if NumPy isn't available or if
NumPy's startup overhead is too high. See the previous section if you
want to use NumPy.

The following reads the targets file created in
:ref:`pubchem_fingerprints`, randomly selects a query fingerprint
(using the seed 123), then finds the Tanimoto score between that query
and all of the fingerprints in the targets file::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> targets = chemfp.load_fingerprints ("pubchem_targets.fps")
  >>> query_fp = targets.fingerprints.random_choice(rng=123)
  >>> result = targets.threshold_tanimoto_search_fp(query_fp, threshold=0.0)

The previous section showed how to access the scores and identifiers
as NumPy arrays. Chemfp does not depend on NumPy, but several part of
the chemfp API are sped up by having a way to get a low-overhead view
of the underlying hits from Python. The
:meth:`.SearchResult.as_ctypes` method returns a `ctypes view
<https://docs.python.org/3/library/ctypes.html>`_ of that data::

  >>> hits = result.as_ctypes()
  >>> hits
  <chemfp.search.Hit_Array_14967 object at 0x10cf9fac0>
  >>> len(hits)
  14967
  >>> hits[0]
  Hit(index=856, score=1.000000)
  >>> hits[len(hits)//4]
  Hit(index=6071, score=0.557047)
  >>> hits[len(hits)//4].score
  0.5570469798657718
  >>> result.reorder("increasing-score")
  >>> hits[0]
  Hit(index=217, score=0.129771)

The raw bytes of the search result hits are available via Python's
`buffer protocol <https://docs.python.org/3/c-api/buffer.html>`_ via
the :meth:`.SearchResult.as_buffer` method. The following show that
each hit takes up 12 bytes of space.::

  >>> buf = result.as_buffer()
  >>> buf
  <memory at 0x123597f40>
  >>> len(buf)
  179604  
  >>> len(buf) / len(result)
  12.0

.. WARNING::
    Do not use the ctypes or buffer views of a search result after the
    search result has been cleared or freed up as that will likely
    cause your program to segfault. IN ADDITION, do not modify the
    contents of those views as it may cause chemfp's optimized sort
    algorithms to fail.

I'll be surprised if anyone uses this function directly. If you do,
let me know why!
  

Access the fingerprint arena bytes as a NumPy array
===================================================

In this section you'll learn how to access the arena's fingerprint
data as a NumPy array. This returns a byte view of the underlying
arena data structure. If you want the fingerprint bits as 0 or 1
values, see the next section. You will need to install `NumPy
<http://numpy.scipy.org/>`_ for the following to work.

A :class:`.FingerprintArena` stores the fingerprints in a contiguous
block of memory. Each fingerprint is stored as the first
``arena.num_bytes`` bytes of a field containing ``arena.storage_size``
bytes of memory. If ``num_bytes`` is smaller than ``storage_size`` then
the field is 0-padded, that is, the remaining bytes are set to 0.

If you work with Python code then you can use chemfp's Python API to
access the fingerprints. But what if you want to access the
fingerprints from a C extension? More specifically, what if you want
to access the fingerprints from NumPy, which contains a lot of
optimized routines for analyzing matrix-like data?

The :meth:`.FingerprintArena.to_numpy_array` method returns a
read-only view of the fingerprint data as a 2D NumPy array with
``len(arena)`` rows and ``arena.storage_size`` columns. Each element of
the matrix is an unsigned 8 bit integer, that is, a byte.

The matrix is a "view" of the data, meaning that it uses the same
contiguous block of memory that the arena uses.

.. WARNING::
    Do not use the NumPy view of an arena from an FPB file after the file
    has been closed as that will likely cause your program to segfault.

Here is an example using MACCS fingerprints for ChEBI 187 generated by
RDKit::

  >>> import chemfp
  >>> arena = chemfp.load_fingerprints("chebi_maccs.fps")
  >>> arr = arena.to_numpy_array()
  >>> arr
  array([[  0,   0,   0, ...,   0,   0,   0],
         [  0,   0,   0, ...,   0,   0,   0],
         [  0,   0,   0, ...,   0,   0,   0],
         ...,
         [  0,  16, 128, ...,   0,   0,   0],
         [  0,  16, 128, ...,   0,   0,   0],
         [  0,   0, 128, ...,   0,   0,   0]], dtype=uint8)
  >>> arr.shape
  (107207, 24)

While it isn't chemically meaningful, I'll sum the bytes down the rows::

  >>> arr.sum(axis=0)
  array([  490116,   204316,   601303,  1485108,   968167,  2407708,
          2464853,  2392025,  6600791,  5761640,  5880625, 10715664,
          8568501, 12248444, 11166730, 13371871, 12146087, 13559574,
         17746237, 20894627,  2761788,        0,        0,        0], dtype=uint64)

The last three values are 0 because of the 0-padding. By default
chemfp uses 64-bit alignment, which means 192 bits or 24 bytes for the
166-bit MACCS key fingerprints, even though only 21 bytes are
needed.

If the 0 padding is a problem then you can use NumPy indexing to make
a new NumPy array which only contains the actual fingerprint bytes::

  >>> unpadded_arr = arr[:,:arena.num_bytes]
  >>> unpadded_arr
  array([[  0,   0,   0, ...,   0,   0,   0],
         [  0,   0,   0, ...,   0,   0,   0],
         [  0,   0,   0, ...,   0,   0,   0],
         ...,
         [  0,  16, 128, ..., 255, 255,  31],
         [  0,  16, 128, ..., 255, 255,  31],
         [  0,   0, 128, ..., 255, 255,  31]], dtype=uint8)
  >>> unpadded_arr.shape
  (107207, 21)
  >>> unpadded_arr.sum(axis=0)
  array([  490116,   204316,   601303,  1485108,   968167,  2407708,
          2464853,  2392025,  6600791,  5761640,  5880625, 10715664,
          8568501, 12248444, 11166730, 13371871, 12146087, 13559574,
         17746237, 20894627,  2761788], dtype=uint64)


Access the fingerprint bits as a NumPy array
============================================

In this section you'll learn how to access the arena's fingerprint bit
values as a NumPy array. This returns a new array containing the
values 0 or 1. If you want a view of the underlying arena bytes, see
the previous section. You will need to install `NumPy
<http://numpy.scipy.org/>`_ for the following to work.

Some people use fingerprint bit values as descriptors for clustering
or other machine learning algorithm. The
:meth:`.FingerprintArena.to_numpy_array` method returns a 2D array
containing bit values. The array contains ``len(arena)`` rows. By
default it returns one column for each fingerprint bit.

Here is an example using MACCS fingerprints for ChEBI 187 generated by
RDKit::

  >>> import chemfp
  >>> arena = chemfp.load_fingerprints("chebi_maccs.fps")
  >>> bitarr = arena.to_numpy_bitarray()
  >>> bitarr.shape
  (107207, 166)

This is a normal NumPy array, so the usual NumPy methods work. For
example, here are the bits for the fingerprint at index 1000::

  >>> bitarr[1000]
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0], dtype=uint8)

and here are the number of occurrences of each bit::

  >>> bitarr.sum(axis=0)
  array([     0,      2,    514,     31,     22,     53,    264,   3663,
            452,    254,   4309,    215,    405,    326,    147,   1235,
           2239,    256,   3632,    177,    407,   4063,   1112,   2929,
           2780,   3946,    423,   2597,  11284,   2798,    481,   8993,
           9241,   3179,    720,   6701,   9993,   8846,   2306,   2387,
           2426,   9607,  10337,  11172,   2605,   1655,   6263,  13749,
          18487,  12349,  10391,   7077,  29244,  28915,  11422,   1557,
          50525,  11758,  10016,  11804,  11876,  20436,   1786,   9572,
          29439,  16754,  12719,    843,  15222,   4294,   4281,  45510,
          13238,  20715,  36963,  14132,  24909,   5101,  26283,  25017,
          18533,  47630,  33626,  13009,  41392,  33512,  13809,  22733,
          56840,  50194,  58465,  50896,  36946,  20788,  57521,  38904,
          51137,  48868,  26627,  46736,  49780,  28319,   7660,  44893,
          51824,  41062,  16770,  41733,  59879,  54221,  56176,  42384,
          39082,  19636,  44832,  45619,  56784,  54437,   6965,  58186,
          50183,  46442,  45119,  36041,      0,  48938,  64600,  55153,
          58571,  28754,  62726,  71980,  41656,  18204,  31671,  61932,
          72978,  56210,  69568,  68414,  40483,  60826,  64600,  45469,
          62761,  81488,  55865,  56584,  57693,  69504,  57550,  78234,
          75151,  83824,  79623,  71747,  89966,  73571,  93265,  78099,
          77122,  66637,  83354, 100861,  88193,      0], dtype=uint64)
            
While the default returns the bits for each fingerprint, you can use
the transpose to get which fingerprints indices contain a given
bit.

For example, there are only 5 fingerprints which set the fifth
bit. Key 5 is defined as "Lanthanide"
and implemented as the SMARTS pattern:
``[La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]``. Which fingerprints
contain a lanthanide?

  >>> bitarr.T[4].nonzero()
  (array([  334,   335,   338,   339,   340,   444,   455,   553,   554,
          1135,  1169,  1739,  1863,  3194,  3263,  3264,  3595,  4257,
          6573,  6574, 42598, 45728]),)
        
To make that useful I need the compound ids, so I'll use the indices
to get the ids from the arena::

  >>> [arena.ids[idx] for idx in bitarr.T[4].nonzero()[0]]
  ['CHEBI:33330', 'CHEBI:33331', 'CHEBI:33341', 'CHEBI:33342',
  'CHEBI:33343', 'CHEBI:52622', 'CHEBI:52635', 'CHEBI:49962',
  'CHEBI:49978', 'CHEBI:63020', 'CHEBI:134455', 'CHEBI:139502',
  'CHEBI:32234', 'CHEBI:77566', 'CHEBI:134436', 'CHEBI:134440',
  'CHEBI:53479', 'CHEBI:139496', 'CHEBI:50950', 'CHEBI:51000',
  'CHEBI:59824', 'CHEBI:139501']

Picking out a few of these:

* `CHEBI:33330
  <https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:33330>`_ - scandium atom
* `CHEBI:33331
  <https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:33331>`_ - yttrium atom
* `CHEBI:139502
  <https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI: 139502>`_ - calcium titanate
* `CHEBI: 139501
  <https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:139501>`_ - titanium(IV) bis(ammonium lactato)dihydroxide

so at least I wasn't able to find a false positive!

The above example created the entire bit array but only used the third
column. If you only want the third column then it's faster to pass an
explicit list of the bit columns you want to `to_numpy_bitarray`::

  >>> arena.to_numpy_bitarray([4])
  array([[0],
         [0],
         [0],
         ...,
         [0],
         [0],
         [0]], dtype=uint8)
  >>> arena.to_numpy_bitarray([2]).sum()
  22
  
You can ask for more than one bit column. The following computes the
Pearson product-moment correlation coefficients between columns 163
and 158 (column 163 has the most often set bit, and 158 has the second
most often)::

  >>> bitarr = arena.to_numpy_bitarray([163, 158])
  >>> bitarr
  array([[0, 0],
         [0, 0],
         [0, 0],
         ...,
         [1, 1],
         [1, 1],
         [1, 1]], dtype=uint8)  
  >>> import numpy
  >>> numpy.corrcoef(bitarr, rowvar=0)
  array([[ 1.        ,  0.64876171],
         [ 0.64876171,  1.        ]])

When this section was originally written, extracting 1 column with
`to_numpy_bitarray` was about 20x faster than extracting all of the
columns and selecting just the desired column. The break-even point
for 166 bits was around 45 columns.

       
Computing a distance matrix for clustering
==========================================

In this section you'll learn how to compute a distance matrix using
the chemfp API. The next section shows an alternative way to get the
similarity matrix.

chemfp does not do clustering. There's a huge number of tools which
already do that. A goal of chemfp in the future is to provide some
core components which clustering algorithms can use.

That's in the future, because I know little about how people want to
cluster with chemfp. Right now you can use the following to build a
distance matrix and pass that to one of those tools. (I'll use a
distance matrix of 1 - the similarity matrix.)

Since we're using the same fingerprint arena for both queries and
targets, we know the distance matrix will be symmetric along the
diagonal, and the diagonal terms will be 1.0. The
:func:`chemfp.search.threshold_tanimoto_search_symmetric`
functions can take advantage of the symmetry for a factor of two
performance gain. There's also a way to limit it to just the upper
triangle, which cuts the memory use in half.

Most of those tools use `NumPy <http://numpy.scipy.org/>`_, which is a
popular third-party package for numerical computing. You will need to
have it installed for the following to work.

.. code-block:: python

  import numpy  # NumPy must be installed
  from chemfp import search
  
  # Compute distance[i][j] = 1-Tanimoto(fp[i], fp[j])
  
  def distance_matrix(arena):
      n = len(arena)
      
      # Start off a similarity matrix with 1.0s along the diagonal
      similarities = numpy.identity(n, "d")
      
      ## Compute the full similarity matrix.
      # The implementation computes the upper-triangle then copies
      # the upper-triangle into lower-triangle. It does not include
      # terms for the diagonal.
      results = search.threshold_tanimoto_search_symmetric(arena, threshold=0.0)
      
      # Copy the results into the NumPy array.
      # NOTE: see below for an implementation which is much faster.
      for row_index, row in enumerate(results.iter_indices_and_scores()):
          for target_index, target_score in row:
              similarities[row_index, target_index] = target_score
      
      # Return the distance matrix using the similarity matrix
      return 1.0 - similarities


With the distance matrix in hand, it's easy to cluster. The `SciPy
<http://scipy.org/>`_ package contains many clustering algorithms, as
well as an adapter to generate a `matplotlib
<http://matplotlib.sourceforge.net/>`_ graph. I'll use it to compute a
single linkage clustering using 100 randomly selected fingerprints:

.. code-block:: python
  
  from __future__ import print_function  # Only for Python 2
  import chemfp
  from scipy.cluster.hierarchy import linkage, dendrogram

   # ... insert the 'distance_matrix' function definition here ...
  
  dataset = chemfp.load_fingerprints("pubchem_queries.fps")
  dataset = dataset.sample(100)  # select 100
  distances  = distance_matrix(dataset)
  
  linkage_matrix = linkage(distances, "single")
  dendrogram(linkage_matrix,
             orientation="right",
             labels = dataset.ids)
  
  import pylab
  pylab.show()

NOTE: The above code created an empty NumPy array then filled it in
with the scores. This is slow because much of the work is in Python.

Another possibility is to convert the results into a SciPy compressed
sparse row matrix (see the next section), then turn that sparse array
into a NumPy array. The following `distance_matrix` version is about
5x faster than the earlier one, even though it makes an intermediate
csr matrix, because more of the work is done at the C level::
  
  def distance_matrix(arena):
      n = len(arena)
      
      ## Compute the full similarity matrix.
      # The implementation computes the upper-triangle then copies
      # the upper-triangle into lower-triangle. It does not include
      # terms for the diagonal.
      results = search.threshold_tanimoto_search_symmetric(arena, threshold=0.0)
      
      # Extract the results as a SciPy compressed sparse row matrix
      csr = results.to_csr()
      # Convert it to a NumPy array
      similarities = csr.toarray()
      # Fill in the diagonal
      numpy.fill_diagonal(similarities, 1)
      
      # Return the distance matrix using the similarity matrix
      return 1.0 - similarities


Convert SearchResults to a SciPy csr matrix
===========================================

In this section you'll learn how to convert a SearchResults object
into a SciPy compressed sparse row matrix.

In the previous section you learned how to use the chemfp API to
create a NumPy similarity matrix, and convert that into a distance
matrix. The result is a dense matrix, and the amount of memory goes as
the square of the number of structures.

If you have a reasonably high similarity threshold, like 0.7, then
most of the similarity scores will be zero. Internally the
:class:`.SearchResults` object only stores the non-zero values for
each row, along with an index to specify the column. This is a common
way to compress sparse data.

SciPy has its own
`compressed sparse row ("csr") matrix
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html>`_
data type, which can be used as input to many of the
`scikit-learn clustering algorithms
<http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster>`_.

If you want to use those algorithms, call the
:meth:`.SearchResults.to_csr` method to convert the SearchResults
scores (and only the scores) into a csr matrix. The rows will be in
the same order as the SearchResult (and the original queries), and
the columns will be in the same order as the target arena, including
its ids.

I don't know enough about scikit-learn to give a useful example. (If
you do, let me know!) Instead, I'll start by doing an NxM search of
two sets of fingerprints::

  from __future__ import print_function # Only for Python 2
  import chemfp
  from chemfp import search
  
  queries = chemfp.load_fingerprints("pubchem_queries.fps")
  targets = chemfp.load_fingerprints("pubchem_targets.fps")
  results = search.threshold_tanimoto_search_arena(queries, targets, threshold = 0.8)

The SearchResults attribute :attr:`~.SearchResults.shape` describes the
number of rows and columns::
  
  >>> results.shape
  (10826, 14967)
  >>> len(queries)
  10826
  >>> len(targets)
  14967
  >>> >>> results[426].get_indices_and_scores()
  [(133, 0.85), (153, 0.8064516129032258)]

I'll turn it into a SciPy csr::

  >>> csr = results.to_csr()
  >>> csr
  <10826x14967 sparse matrix of type '<class 'numpy.float64'>'
      with 369471 stored elements in Compressed Sparse Row format>
  >>> csr.shape
  (10826, 14967)

and look at the same row to show it has the same indices and scores::

  >>> csr[426]
  <1x14967 sparse matrix of type '<class 'numpy.float64'>'
      with 2 stored elements in Compressed Sparse Row format>    
  >>> csr[426].indices
  array([133, 153], dtype=int32)
  >>> csr[6].data
  array([ 0.85      ,  0.80645161])


Taylor-Butina clustering
========================

For the last clustering example, here's my (non-validated) variation
of the `Butina algorithm from JCICS 1999, 39, 747-750 <http://www.chemomine.co.uk/dbclus-paper.pdf>`_.
See also http://www.redbrick.dcu.ie/~noel/R_clustering.html . You
might know it as Leader clustering.

.. highlight:: python 

First, for each fingerprint find all other fingerprints with a
threshold of 0.8::

  from __future__ import print_function # Only for Python 2
  import chemfp
  from chemfp import search
    
  arena = chemfp.load_fingerprints("pubchem_targets.fps")
  results = search.threshold_tanimoto_search_symmetric(arena, threshold = 0.8)


Sort the results so that fingerprints with more hits come first. This
is more likely to be a cluster centroid. Break ties arbitrarily by the
fingerprint id; since fingerprints are ordered by the number of bits
this likely makes larger structures appear first::

  # Reorder so the centroid with the most hits comes first.
  # (That's why I do a reverse search.)
  # Ignore the arbitrariness of breaking ties by fingerprint index
  results = sorted( (  (len(indices), i, indices)
                            for (i, indices) in enumerate(results.iter_indices())  ),
                    reverse=True)


Apply the leader algorithm to determine the cluster centroids and the singletons::


  # Determine the true/false singletons and the clusters
  true_singletons = []
  false_singletons = []
  clusters = []
    
  seen = set()
  for (size, fp_idx, members) in results:
      if fp_idx in seen:
          # Can't use a centroid which is already assigned
          continue
      seen.add(fp_idx)
      
      # True singletons have no neighbors within the threshold
      if not members:
          true_singletons.append(fp_idx)
          continue
      
      # Figure out which ones haven't yet been assigned
      unassigned = set(members) - seen
      
      if not unassigned:
          false_singletons.append(fp_idx)
          continue
        
        # this is a new cluster
        clusters.append( (fp_idx, unassigned) )
        seen.update(unassigned)

Once done, report the results::

  print(len(true_singletons), "true singletons")
  print("=>", " ".join(sorted(arena.ids[idx] for idx in true_singletons)))
  print()
  
  print(len(false_singletons), "false singletons")
  print("=>", " ".join(sorted(arena.ids[idx] for idx in false_singletons)))
  print()
  
  # Sort so the cluster with the most compounds comes first,
  # then by alphabetically smallest id
  def cluster_sort_key(cluster):
      centroid_idx, members = cluster
      return -len(members), arena.ids[centroid_idx]
      
  clusters.sort(key=cluster_sort_key)
  
  print(len(clusters), "clusters")
  for centroid_idx, members in clusters:
      print(arena.ids[centroid_idx], "has", len(members), "other members")
      print("=>", " ".join(arena.ids[idx] for idx in members))

.. highlight:: pycon 

The algorithm is quick for this small data set.

Out of curiosity, I tried this on 100,000 compounds selected
arbitrarily from PubChem. It took 35 seconds on my desktop (a 3.2 GHZ
Intel Core i3) with a threshold of 0.8. In the Butina paper, it took
24 hours to do the same, although that was with a 1024 bit fingerprint
instead of 881. It's hard to judge the absolute speed differences of a
MIPS R4000 from 1998 to a desktop from 2011, but it's less than the
factor of about 2000 you see here.

More relevent is the comparison between these numbers for the 1.1
release compared to the original numbers for the 1.0 release. On my
old laptop, may it rest it peace, it took 7 minutes to compute the
same benchmark. Where did the roughly 16-fold peformance boost come
from? Money. After 1.0 was released, Roche funded various
optimizations, including taking advantage of the symmetery (2x) and
using hardware POPCNT if available (4x). Roche and another company
helped fund the OpenMP support, and when my desktop reran this
benchmark it used 4 cores instead of 1.

The wary among you might notice that 2*4*4 = 32x faster, while I
said the overall code was only 16x faster. Where's the factor of 2x
slowdown? It's in the Python code! The
:func:`chemfp.search.threshold_tanimoto_search_symmetric` step took only 13 seconds. The
remaining 22 seconds was in the leader code written in Python. To
make the analysis more complicated, improvements to the chemfp API
sped up the clustering step by about 40%.

With chemfp 1.0 version, the clustering performance overhead was minor
compared to the full similarity search, so I didn't keep track of
it. With chemfp 1.1, those roles have reversed!

The most recent version now is chemfp 3.4, which is about 20%
faster than chemfp 1.4 for this benchmark. And of course the hardware
is faster still.

MinMax Diversity Selection using RDKit
======================================

In this section you'll learn how to do diversity selection using
RDKit's MaxMin picker. You will also learn how to convert chemfp
fingerprints into RDKit fingerprints. You will need to install `RDKit
<http://rdkit.org/>`_ for the following to work. You will also need to
download a `dataset of benzodiazepine structures
<http://dalkescientific.com/writings/benzodiazepine.sdf.gz>`_.

Diversity selection finds elements which are unlike each other. One
way to implement diversity selection is to cluster all of the
compounds then pick a compound from each cluster, but this requires
quadratic time to compute the similarity/distance matrix.

Chemfp does not implement diversity selection, though it may be added
in the future if there is enough demand. I recommend people use the
optimized version of the `MaxMin
<https://onlinelibrary.wiley.com/doi/abs/10.1002/qsar.200290002>`_
from RDKit, which does diversity selection without needing to compute
the full matrix.

While it is possible to have RDKit's MaxMinPicker use native chemfp
fingerprints, there is a huge performance overhead (about 100x!)
because every fingerprint distance requires a Python function call. It
is far faster to convert chemfp fingerprints to RDKit fingerprints so
that all of the processing can be done in C.

I'll start with an example of selecting 100 diverse fingerprints from
the `benzodiazepine data set
<http://dalkescientific.com/writings/benzodiazepine.sdf.gz>`_.  The
first step is to generate fingerprints. I'll use :ref:`rdkit2fps
<rdkit2fps>` to generate RDKit Morgan fingerprints.

.. code-block:: console
  
  % rdkit2fps --morgan benzodiazepine.sdf.gz -o benzodiazepine_morgan2.fps.gz

and then use the chemfp Python API to load the fingerprints. I'll use
``reorder=False`` so the arena fingerprints are in the same order as
the input file. (The order isn't important for this case, but may be
important if you, say, merge two data sets together where you know you
want to keep the first data set and select diverse compounds from the
second.)

.. code-block:: python
  
  >>> import chemfp
  >>> arena = chemfp.load_fingerprints("benzodiazepine_morgan2.fps.gz",
  ...      reorder=False)

The next step is to read the FPS file and convert the chemfp
fingerprints into RDKit fingerprints. This is easy because RDKit
function ``CreateFromBinaryText`` converts a chemfp fingerprint, which
is just a byte string, into the equivalent ``ExplicitBitVect``
fingerprint.

  >>> from rdkit import DataStructs
  >>> rdkit_fps = [DataStructs.CreateFromBinaryText(fp) for fp in arena.fingerprints]

The :attr:`~.FingerprintArena.fingerprints` attribute was added in
chemfp 3.4. For older chemfp versions use::

  >>> rdkit_fps = [DataStructs.CreateFromBinaryText(fp) for id, fp in arena]

Finally, use RDKit to pick 100 diverse record indices::

  >>> from rdkit import SimDivFilters
  >>> picker = SimDivFilters.MaxMinPicker()
  >>> picks = picker.LazyBitVectorPick(rdkit_fps, len(rdkit_fps), 100)
  >>> len(picks)
  100
  >>> list(picks)
  [10879, 8375, 2390, 4683, 3549, 6257, 9194, 9953, 96, 6860, 8016,
  6034, 3197, 4213, 5762, 2323, 7531, 9894, 12279, 3398, 4607, 4827,
  2874, 1608, 3234, 6128, 8710, 7691, 3006, 4898, 4372, 11609, 11401,
  10614, 3861, 1295, 6936, 6192, 7121, 11577, 5092, 2523, 4926, 4614,
  4956, 8762, 2261, 9184, 11666, 2828, 7767, 12027, 5000, 6126, 6266,
  6097, 7966, 9208, 8064, 1327, 6241, 3392, 5730, 7744, 8485, 9299,
  358, 5332, 4434, 2935, 8405, 5480, 4648, 1665, 5848, 9053, 5735,
  6583, 8407, 1706, 5347, 11779, 12022, 2598, 8378, 3565, 7394, 4888,
  10454, 6611, 11472, 2146, 6101, 295, 6632, 6717, 2442, 5638, 5372,
  8279]

The indices match the arena order, so you can use ``arena.ids`` to get
the corresponding id for each index; in this case, PubChem ids::

  >>> arena.ids[10879]
  '22984485'

The RDKit MaxMinPicker also lets you initialize the pick list with a
set of indicies. This is useful if you have a in-house compound data
set X and want to select N diverse fingerprints from a vendor data
set Y. That algorithm might look like:

.. code-block:: python

  import chemfp
  from rdkit import DataStructs, SimDivFilters
  
  have_arena = chemfp.load_fingerprints("X.fps", reorder=False)
  want_arena = chemfp.load_fingerprints("Y.fps", reorder=False)

  # Merge the two fingerprint sets together, but keep track
  # of which came from X.
  fps = [DataStructs.CreateFromBinaryText(fp) for fp in have_arena.fingerprints]
  num_have = len(fps)
  fps.extend(DataStructs.CreateFromBinaryText(fp) for fp in want_arena.fingerprints)
  
  # Do the picking
  num_to_pick = 100
  picker = SimDivFilters.MaxMinPicker()
  have_ids = list(range(num_have))
  picks = picker.LazyBitVectorPick(fps, len(fps), num_have+num_to_pick, have_ids)
  newly_picked = picks[-num_to_pick:]
  want_indices = [idx-num_have for idx in newly_picked]
  
  # Report the picked compounds
  print("Compound to evaluate:")
  for idx in want_indices:
    print(want_arena.ids[idx])

To learn more about the RDKit MaxMin picker and how to use it, see
`Roger Sayle's slides from the 2017 RDKit User Group meeting
<https://github.com/rdkit/UGM_2017/blob/master/Presentations/Sayle_RDKitDiversity_Berlin17.pdf>`_
and `Tim Dudgeon's commentary <http://rdkit.blogspot.com/2017/11/revisting-maxminpicker.html>`_.

Configuring OpenMP threads
==========================

In this section you'll learn about chemfp and OpenMP threads,
including how to set the number of threads to use.

OpenMP is an API for shared memory multiprocessing programming. Chemfp
uses it to parallelize the similarity search algorithms. Support for
OpenMP is a compile-time option for chemfp, and can be disabled with
:option:`--without-openmp` in setup.py. Versions 4.2 of gcc (released
in 2007) and later support it, as do other compilers, though chemfp
has only been tested with gcc.

Chemfp uses one thread per query fingerprint. This means that single
fingerprint queries are not parallelized. There is no performance gain
even if four cores are available.

(A note about nomenclature: a CPU can have one core, or it can have
several cores. A single processor computer has one CPU while a
multiprocessor computer has several CPUs. I think some cores can even
run multiple threads. So it's possible to have many more hardware
threads than CPUs.)

Chemfp uses multiple threads when there are many queries, which occurs
when using a query arena against a target arena. These search methods
include the high-level API in the top-level chemfp module (like
'knearest_tanimoto_search'), and the arena search function in
chemfp.search.

By default, OpenMP and therefore chemfp will use four threads::

  >>> import chemfp
  >>> chemfp.get_num_threads()
  4

You can change this through the standard OpenMP environment variable
OMP_NUM_THREADS in the shell:

.. code-block:: console
  
  % env OMP_NUM_THREADS=2 python
  Python 3.7.4 (default, Aug 13 2019, 15:17:50)
  [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
  Type "help", "copyright", "credits" or "license" for more information.  
  >>> import chemfp
  >>> chemfp.get_num_threads()
  2

or you can specify the number of threads directly using
set_num_threads():

  >>> chemfp.set_num_threads(3)
  >>> chemfp.get_num_threads()
  3

If you specify 0 or 1 thread then chemfp will not use OpenMP at all
and stick with a single-threaded implementation. (You probably want to
disable OpenMP in multi-threaded programs like web servers. See the
next section for details.)

Throwing more threads at a task doesn't always make it faster. My old
desktop has one CPU with two cores, so it's pointless to have more
than two OpenMP threads running, as you can see from some timings:

.. code-block:: none
  
  threshold_tanimoto_search_symmetric (threshold=0.8) (desktop)
  #threads  time (in s)
     1       22.6
     2       13.1
     3       12.3
     4       12.9
     5       12.6

On the other hand, my old laptop has 1 CPU with four cores, and while
my desktop beats my laptop with single threaded peformance, once I
have three cores going, my laptop is faster:

.. code-block:: none
  
  threshold_tanimoto_search_symmetric (threshold=0.8) (laptop)
  #threads  time (in s)
     1       27.4
     2       14.6
     3       10.3
     4        8.2
     5        9.0

How many cores/hardware threads are available? That's a really good
question. chemfp implements :func:`chemp.get_max_threads`, but that
doesn't seem to do what I want. So don't use it, and I'll figure out a
real solution in a future release.


OpenMP and multi-threaded applications
======================================

In this section you'll learn some of the problems of mixing OpenMP and
multi-threaded code.

Do not use OpenMP and POSIX threads on a Mac. It will crash. This
includes Django, which is a multi-threaded web server. In
multi-threaded code on a Mac you must either tell chemfp to be
single-threaded, using::

  chemfp.set_num_threads(1)

or figure out some way to put the chemfp search code into its own
process space, which is a much harder solution.

Other OSes will let you mix POSIX and OpenMP threads, but life gets
confusing. Might your web server handle three search requests at the
same time? If so, should all of those get four OpenMP threads, so that
12 threads are running in total? Can your hardware handle that many
threads?

It may be better to have chemfp not use OpenMP threads when under a
multi-threaded system, or have some way to limit the number of chemfp
search tasks running at the same time. Figuring out the right solution
will depend on your hardware and requirements.


Fingerprint Substructure Screening (experimental)
=================================================

In this section you'll learn how to find target fingerprints which
contain the query fingerprint bit patterns as a subset. Bear in mind
that this is an experimental API.

Substructure search often uses a screening step to remove obvious
mismatches before doing the subgraph isomorphism. One way is to
generate a binary fingerprint such that if a query molecule is a
substructure of a target molecule then the corresponding query
fingerprint is completely contained in the target fingerprint, that
is, the target fingerprint must have 'on' bits for all of the query
fingerprints which have 'on' bits.

I'll start by loading a fingerprint arena with four fingerprints,
where the identifiers are Unicode strings and the fingerprint are byte
strings of length 1, with the binary form shown to the right::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp 
  >>> from chemfp import bitops
  >>> arena = chemfp.load_fingerprints([
  ...    (u"A1", b"\x44"),   # 0b01000100
  ...    (u"B2", b"\x6c"),   # 0b01101100
  ...    (u"C3", b"\x95"),   # 0b10010101
  ...    (u"D4", b"\xea"),   # 0b11101010
  ...   ], chemfp.Metadata(num_bits=8))
  >>> for id, fp in arena:
  ...   print(bitops.hex_encode(fp), id)
  ... 
  44 A1
  6c B2
  95 C3
  ea D4

I could use :func:`.bitops.byte_contains` to search for fingerprints
in a loop, in this case with a query fingerprint which requires that
the 7th bit be set (they must fit the pattern ``0b*1******``)::

  >>> query_fingerprint = b"\x40"   # 0b01000000
  >>> bitops.hex_encode(query_fingerprint)
  '40'
  >>> for id, target_fingerprint in arena:
  ...   if bitops.byte_contains(query_fingerprint, target_fingerprint):
  ...     print(id)
  ... 
  A1
  B2
  D4

This is slow because it uses Python to do almost all of the
work. Instead, use :func:`.contains_fp` from the :mod:`chemfp.search`
module, which is faster because it's all implemented in C::

  >>> from chemfp import search
  >>> result = search.contains_fp(query_fingerprint, arena)
  >>> result
  <chemfp.search.SearchResult object at 0x10195e090>
  >>> print(result.get_ids())
  ['A1', 'B2', 'D4']

This is the same :class:`.SearchResult` instance that the similarity
search code returns, though the scores are all 0.0::

  >>> result.get_ids_and_scores()
  [('A1', 0.0), ('B2', 0.0), ('D4', 0.0)]
  
This API is experimental and likely to change. Please provide
feedback. While I don't think the current call parameters will change,
I might have it return the Tanimoto score (or Hamming distance?)
instead of 0.0. Or I might have a way to compute new scores given a
SearchResult.

I also plan to support start/end parameters, to search only a subset
of the arena.

There's also a :func:`.search.contains_arena` function which takes a
query arena instead of only a query fingerprint as the query, and
returns a :class:`.SearchResults`::

  >>> results = search.contains_arena(arena, arena)
  >>> results
  <chemfp.search.SearchResults object at 0x10195c2b8>
  >>> for result in results:
  ...   print(result.get_ids_and_scores())
  ... 
  [('A1', 0.0), ('B2', 0.0)]
  [('B2', 0.0)]
  [('C3', 0.0)]
  [('D4', 0.0)]
  
I don't think the NxN version of the "contains" search is all that
useful, so there's no function for that case.

The implementation doesn't yet support OpenMP, ``contains_arena()`` is
only slightly faster than multiple calls to ``contains_fp()``.


Substructure screening with RDKit
=================================

In this section you'll learn how to use `RDKit's pattern fingerprint
<http://rdkit.org/docs/source/rdkit.Chem.rdmolops.html#rdkit.Chem.rdmolops.PatternFingerprint>`_
for substructure screening.

RDKit has a fingerprint tuned for substructure search, though it's
marked as 'experimental' and subject to change. This is the "pattern"
fingerprint.

I'll use it to make a screen for one of the PubChem files. Normally
you would start with something like:

.. code-block:: console
  
  % rdkit2fps --pattern Compound_048500001_049000000.sdf.gz -o pubchem_screen.fpb

but that only gives me the identifiers and fingerprints. I want to
show some of the struture as well, so I'll do a bit of a cheat - I'll
have an augmented identifier which is the PubChem id, followed by a
space, followed by the SMILES string.

I can do this because chemfp supports almost anything as the
"identifier", except newline, tab, and the NUL character, and because
I don't need to support id lookup.

However, I have to write Python code to generate the augmented
identifiers:

.. code-block:: python
  
  import chemfp
  
  fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
  T = fptype.toolkit
  
  with chemfp.open_fingerprint_writer("pubchem_screen.fpb", fptype.get_metadata()) as writer:
    for id, mol in T.read_ids_and_molecules("Compound_048500001_049000000.sdf.gz"):
      smiles = T.create_string(mol, "smistring")  # use the isomeric SMILES string
      fp = fptype.compute_fingerprint(mol)
      # Create an "identifier" of the form:
      #    PubChem id + " " + canonical SMILES string
      writer.write_fingerprint(id + " " + smiles, fp)


Now that I have the screen, I'll write some code to actually do the
screen. I'll make this be an interactive prompt, which asks for the
query SMILES string (or "quit" or "exit" to quit), parses the SMILES
to a molecule, generates the fingerprint, does the screen, and
displays the first 10 results:

.. code-block:: python
  
  from __future__ import print_function # Only for Python 2
  import itertools
  import chemfp
  import chemfp.search
  
  fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
  T = fptype.toolkit
  
  screen = chemfp.load_fingerprints("pubchem_screen.fpb")
  print("Loaded", len(screen), "screen fingerprints")
  
  while 1:
    # Ask for the query SMILES string
    query = input("Query? ")  # use "raw_input()" for Python 2.7
    if query in ("quit", "exit"):
      break
    
    # See if it's a valid SMILES
    mol = T.parse_molecule(query, "smistring", errors="ignore")
    if mol is None:
      print("Could not parse query")
      continue
    
    # Compute the fingerprint and do the substructure screeening
    fp = fptype.compute_fingerprint(mol)
    result = chemfp.search.contains_fp(fp, screen)
    
    # Print the results, up to 10.
    n = len(result)
    if n > 10:
      print(len(result), "matches. First 10 displayed")
      n = 10
    else:
      print(len(result), "matches.")
    
    for augmented_id in itertools.islice(result.iter_ids(), 0, n):
      id, smiles = augmented_id.split()
      print(id, "=>", smiles)
    print()

(In case you haven't seen it before, the "`itertools.islice() <https://docs.python.org/2/library/itertools.html#itertools.islice>`_" gives me
an easy way to get up to the first N items from an iterator.)

I'll try out the above code:

.. code-block:: none
  
  Loaded 5208 screen fingerprints
  Query? c1ccccc1
  12376 matches. First 10 displayed
  48650571 => CCCOCC(=O)NCc1ccccc1
  48672998 => CCCOCC(=O)NOCc1ccccc1
  48845178 => C=C(Br)CNC(=S)Nc1ccccc1
  48548090 => CCNC(=O)N/N=C/c1ccc(C)cc1
  48654127 => CCCOCC(=O)NCCSc1ccccc1
  48548029 => CCNC(=O)N/N=C/c1cccc(C)c1
  48685277 => COCC(C)CNC(=O)c1ccccc1
  48915892 => CNC(=O)NCCc1ccccc1Br
  48653583 => CCCOCC(=O)N(C)c1ccccc1
  48650670 => CCCOCC(=O)Nc1cccc(C)c1  
  
  Query? c1ccccc1O
  4946 matches. First 10 displayed
  48548137 => CCNC(=O)N/N=C/c1cccc(OC)c1
  48651969 => CCCOCC(=O)NCc1cccc(OC)c1
  48980706 => CCCCNC(=O)CCCc1ccc(OC)cc1
  48661290 => CCCOCC(=O)Nc1cccc(OCC)c1
  48653813 => CCCOCC(=O)NCCOc1cccc(C)c1
  48651499 => CCCOCC(=O)NCc1ccccc1OC
  48981063 => COc1ccc(CCCC(=O)NCC(C)C)cc1
  48659995 => CCCOCC(=O)Nc1cccc(OCC#N)c1
  48916672 => CCCCCCOc1cccc(/C=N/NC(N)=O)c1
  48653272 => CCCOCC(=O)NCCc1ccccc1OC

  Query? c1ccccc1I
  10 matches.
  48731386 => Cc1cc(CNC(=O)c2ccc(I)cc2)on1
  48671550 => NC(=O)Cc1ccc(OCC(=O)Nc2ccc(I)cc2)cc1
  48731482 => Cc1cc(CNC(=O)c2cccc(I)c2)on1
  48731331 => Cc1cc(CNC(=O)c2ccccc2I)on1
  48741344 => CN(C)C(=O)c1cccc(NC(=O)Nc2ccc(I)cc2)c1
  48584231 => O=C(Nc1cccc(COCC2CC2)c1)c1ccccc1I
  48688164 => CC1CN(C(=O)c2ccc(I)cc2)CC(C)(C)O1
  48688205 => CC1CN(C(=O)c2cccc(I)c2)CC(C)(C)O1
  48946427 => N#CC1CCN(S(=O)(=O)c2ccc(I)cc2)CC1
  48522115 => CC1(C)COCCN1C(=O)c1ccc(F)cc1I
  
  Query? Fc1c(F)c(F)c(F)c(F)c1F
  3 matches.
  48759600 => O=C(Nc1cccnc1)Nc1c(F)c(F)c(F)c(F)c1F
  48980959 => Cc1cccc2cc(C(=O)Nc3ccc(F)cc3F)oc12
  48981022 => Cc1cccc2cc(C(=O)Nc3c(F)cccc3F)oc12

  Query? quit

Looks reasonable.

It's not hard to add full substructure matching, but it requires
toolkit-specific code. Chemfp doesn't try to abstract that detail, and
I'm not sure it should be part of chemfp. Instead, I'll write some
RDKit-specific code. Chemfp uses native toolkit molecules, so there's
actually only a single line of RDKit code.

I'll also completely rewrite the code so it takes the query string on
the command-line, reports all of the screening results, identifies the
true positives, and then does a brute-force verification that the
screen results are correct. Oh, and report statistics:

.. code-block:: python
  
  # This program is called 'search.py'
  from __future__ import print_function # Only for Python 2
  import sys
  import chemfp
  import chemfp.search
  from chemfp import rdkit_toolkit as T # Will only work with RDKit
  import time
  
  fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
  
  screen = chemfp.load_fingerprints("pubchem_screen.fpb")
  if len(sys.argv) != 2:
    raise SystemExit("Usage: %s <smiles>" % (sys.argv[0],))
  
  query_smiles = sys.argv[1]
  
  start_time = time.time()
  try:
    query_mol = T.parse_molecule(query_smiles, "smistring")
  except ValueError as err:
    raise SystemExit(str(err))
  
  # Compute the fingerprint and do the substructure screeening
  fp = fptype.compute_fingerprint(query_mol)
  result = chemfp.search.contains_fp(fp, screen)
  search_time = time.time()
  
  num_matches = 0
  
  for augmented_id in result.get_ids():
    id, smiles = augmented_id.split()
    target_mol = T.parse_molecule(smiles, "smistring")
    if target_mol.HasSubstructMatch(query_mol):   # RDKit specific!
      print(id, "matches", smiles)
      num_matches += 1
    else:
      print(id, "       ", smiles)
  report_time = time.time()

  # Report the results  
  print()
  print("= Screen search =")
  print("num targets:", len(screen))
  print("screen size:", len(result))
  print("num matches:", num_matches)
  print("screenout: %.1f%%" % (100.0 * (len(screen)-len(result)) / len(screen),))
  if len(result) == 0:
    precision = 100.0
  else:
    precision = (100.0*num_matches) / len(result)
  print("precision: %.1f%%" % (precision,))
  print("screen time: %.2f" % (search_time - start_time,))
  print("atom-by-atom-search and report time: %.2f" % (report_time - search_time,))
  print("total time: %.2f" % (report_time - start_time,))

  # Reduce the computations without any screening
  num_actual = 0
  actual_start_time = time.time()
  for augmented_id in screen.ids:
    id, smiles = augmented_id.split()
    target_mol = T.parse_molecule(smiles, "smistring")
    if target_mol.HasSubstructMatch(query_mol):   # RDKit specific!
      num_actual += 1
  actual_end_time = time.time()
  
  print()
  print("= Brute force search =")
  print("num matches:", num_actual)
  print("time to test all molecules: %.2f" % (actual_end_time - actual_start_time,))
  print("screening speedup: %.1f" % ((actual_end_time - actual_start_time) / (report_time - start_time),))

Here's the output with 'c1ccccc1O' on the command-line:

.. code-block:: console 
  
  % python search.py c1ccccc1O
  48548137 matches CCNC(=O)N/N=C/c1cccc(OC)c1
  48651969 matches CCCOCC(=O)NCc1cccc(OC)c1
  48980706 matches CCCCNC(=O)CCCc1ccc(OC)cc1
  48661290 matches CCCOCC(=O)Nc1cccc(OCC)c1
  48653813 matches CCCOCC(=O)NCCOc1cccc(C)c1
    ... many lines omitted ...
  48930672 matches CS(=O)(=O)c1ccc(Oc2nc(C3CC3)nc3sc4c(c23)CCCC4)cc1
  48673774 matches COc1ccc(C)cc1-n1ccc(C(=O)N2CCc3[nH]c4ccccc4c3C2)n1
  48551088 matches Cc1cc(C)c(CN2C(=O)NC3(CCOc4ccccc43)C2=O)c(C)c1
  48944841 matches CC(C)(CNS(=O)(=O)CC12CCC(CC1=O)C2(C)C)c1ccc2c(c1)OCO2
  48729925 matches O=C(Cn1c(-c2ccccc2)noc1=O)Nc1ccc2c(c1)OC1(CCCC1)O2  
  
  = Screen search =
  num targets: 14967
  screen size: 4946
  num matches: 4943
  screenout: 67.0%
  precision: 99.9%
  screen time: 0.00
  atom-by-atom-search and report time: 2.99
  total time: 3.00

  = Brute force search =
  num matches: 4943
  time to test all molecules: 5.00
  screening speedup: 1.7

It's a relief to see that the versions with and without the screen
give the same number of matches!

Next, 'c1ccccc1I' (that's iodobenzene):

.. code-block:: console 
  
  % python search.py 'c1ccccc1I'
  48731386 matches Cc1cc(CNC(=O)c2ccc(I)cc2)on1
  48671550 matches NC(=O)Cc1ccc(OCC(=O)Nc2ccc(I)cc2)cc1
  48731482 matches Cc1cc(CNC(=O)c2cccc(I)c2)on1
  48731331 matches Cc1cc(CNC(=O)c2ccccc2I)on1
  48741344 matches CN(C)C(=O)c1cccc(NC(=O)Nc2ccc(I)cc2)c1
  48584231 matches O=C(Nc1cccc(COCC2CC2)c1)c1ccccc1I
  48688164 matches CC1CN(C(=O)c2ccc(I)cc2)CC(C)(C)O1
  48688205 matches CC1CN(C(=O)c2cccc(I)c2)CC(C)(C)O1
  48946427 matches N#CC1CCN(S(=O)(=O)c2ccc(I)cc2)CC1
  48522115 matches CC1(C)COCCN1C(=O)c1ccc(F)cc1I
  
  = Screen search =
  num targets: 14967
  screen size: 10
  num matches: 10
  screenout: 99.9%
  precision: 100.0%
  screen time: 0.01
  atom-by-atom-search and report time: 0.01
  total time: 0.02
  
  = Brute force search =
  num matches: 10
  time to test all molecules: 5.17
  screening speedup: 281.4

Now for some bad news. Try '[Pu]'. This doesn't screen out many
structures yet has no matched. I'll report the search statistics:

.. code-block:: none 
  
  = Screen search =
  num targets: 14967
  screen size: 14967
  num matches: 0
  screenout: 0.0%
  precision: 0.0%
  screen time: 0.00
  atom-by-atom-search and report time: 8.40
  total time: 8.40
  
  = Brute force search =
  num matches: 0
  time to test all molecules: 5.24
  screening speedup: 0.6

That's horrible! It's slower! What happened is that '[Pu]' generates a
fingerprint with only two bits set:

.. code-block:: console 
  
  % echo '[Pu] plutonium' | rdkit2fps --pattern --fpSize 1024
  #FPS1
  #num_bits=1024
  #type=RDKit-Pattern/4 fpSize=1024
  #software=RDKit/2019.09.1 chemfp/3.4
  #date=2020-05-13T12:12:48
  00000000200000000000000000000000000000000000000000000000000000000000000000000000
  00008000000000000000000000000000000000000000000000000000000000000000000000000000
  00000000000000000000000000000000000000000000000000000000000000000000000000000000
  0000000000000000	plutonium

You know, that's really hard to see. I'll use a bit of perl to replace
the zeros with "."s::

  % echo '[Pu] plutonium' | python ../rdkit2fps --pattern --fpSize 1024 \
  ?      | perl -pe 's/0/./g'
  #FPS1
  #num_bits=1.24
  #type=RDKit-Pattern/4 fpSize=1.24
  #software=RDKit/2.19..9.1 chemfp/3.4
  #date=2.2.-.5-13T12:15:19
  ........2.......................................................................
  ....8...........................................................................
  ................................................................................
  ................	plutonium

Ha! And it converted zeros in the header lines to "." (and it would
have converted any zeros in the identifier). I'll just omit the header
lines in the following.

Unfortunately, so many other structures also set those two bits that
it isn't an effective screen for plutonium.


Reading structure fingerprints using a toolkit
==============================================

In this section you'll learn how to use a chemistry toolkit to compute
fingerprints from a given structure file.

What happens if you're given a structure file and you want to find the
two nearest matches in an FPS file? You'll have to generate the
fingerprints for the structures in the structure file, then do the
comparison.

For this section you'll need to have a chemistry toolkit. I'll use the
"chebi_maccs.fps" file generated in :ref:`chebi_fingerprints` as the
targets, and the PubChem file `Compound_099000001_099500000.sdf.gz
<ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/Compound_099000001_099500000.sdf.gz>`_
as the source of query structures::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> from chemfp import search
  >>> targets = chemfp.load_fingerprints("chebi_maccs.fps")
  >>> queries = chemfp.read_molecule_fingerprints(targets.metadata, "Compound_099000001_099500000.sdf.gz")
  >>> for (query_id, hits) in chemfp.knearest_tanimoto_search(queries, targets, k=2, threshold=0.0):
  ...   print(query_id, "=>", end=" ")
  ...   for (target_id, score) in hits.get_ids_and_scores():
  ...     print("%s %.3f" % (target_id, score), end=" ")
  ...   print()
  ... 
  99000039 => CHEBI:116650 0.870 CHEBI:105034 0.812
  99000230 => CHEBI:120636 0.840 CHEBI:127468 0.839
  99002251 => CHEBI:92604 0.756 CHEBI:92191 0.733
  99003537 => CHEBI:112376 0.745 CHEBI:32193 0.696
  99003538 => CHEBI:112376 0.745 CHEBI:32193 0.696
       ... many, many lines omitted ...

That's it! Pretty simple, wasn't it? I didn't even need to explicitly
specify which toolkit I wanted to use because the
:func:`.read_molecule_fingerprints` got that information from the
arena's :class:`.Metadata`.

The new function is :func:`chemfp.read_molecule_fingerprints`, which
reads a structure file and generates the appropriate fingerprints for
each one. The first parameter of this is the metadata used to
configure the reader. In my case it's::

  >>> print(targets.metadata)
  #num_bits=166
  #type=OpenBabel-MACCS/2
  #software=OpenBabel/3.0.0 chemfp/3.4
  #source=ChEBI_lite.sdf.gz
  #date=2020-05-12T10:09:35

The metadata's "type" told chemfp which toolkit to use to read
molecules, and how to generate fingerprints from those
molecules.

You can pass in your own metadata as the first parameter to
``read_molecule_fingerprints``, and as a shortcut, if you pass in a
string then it will be used as the fingerprint type.

For examples, if you have OpenBabel installed then you can do::

  >>> import chemfp 
  >>> from chemfp import bitops 
  >>> reader = chemfp.read_molecule_fingerprints("OpenBabel-MACCS", "Compound_099000001_099500000.sdf.gz")
  >>> for i, (id, fp) in enumerate(reader):
   ...   print(id, bitops.hex_encode(fp))
   ...   if i == 3:
   ...     break
   ... 
  99000039 000004000000300001c0004e9361b041dce1676e1f
  99000230 000000800100649f0445a7fe2aeab1eb8f6bdfff1f
  99002251 00000000001132000088004985601140dce4e3fe1f
  99003537 00000000200020000156149a906994830c3159ae1f


If you have OEChem and OEGraphSim installed and licensed then you can do::

  >>> import chemfp
  >>> from chemfp import bitops
  >>> reader = chemfp.read_molecule_fingerprints("OpenEye-MACCS166", "Compound_099000001_099500000.sdf.gz")
  >>> for i, (id, fp) in enumerate(reader):
  ...   print(id, bitops.hex_encode(fp))
  ...   if i == 3:
  ...     break
  ... 
  99000039 000004000000300001c0404e93e19053dca06b6e1b
  99000230 000000880100648f0445a7fe2aeab1738f2a5b7e1b
  99002251 00000000001132000088404985e01152dca46b7e1b
  99003537 00000000200020000156149a90e994938c30592e1b
  
If you have RDKit installed then you can do::

  >>> import chemfp
  >>> from chemfp import bitops
  >>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", "Compound_099000001_099500000.sdf.gz")
  >>> for i, (id, fp) in enumerate(reader):
  ...   print(id, bitops.hex_encode(fp))
  ...   if i == 3:
  ...     break
  ... 
  99000039 000004000000300001c0004e9361b051dce1676e1f
  99000230 000000800100649f0445a7fe2aeab1fb8f6bdfff1f
  99002251 00000000001132000088004985601150dce4e3fe1f
  99003537 00000000200020000156149a906994930c3159ae1f

And if you have the CDK JAR file on your CLASSPATH and the JPype
Python/Java adapter installed then you can do::

  >>> import chemfp
  >>> from chemfp import bitops
  >>> reader = chemfp.read_molecule_fingerprints("CDK-MACCS", "Compound_099000001_099500000.sdf.gz")
  >>> for i, (id, fp) in enumerate(reader):
  ...   print(id, bitops.hex_encode(fp))
  ...   if i == 3:
  ...     break
  ... 
  99000039 000004000000300001c4004e83e1b053dce16f6e1f
  99000230 000000800100649f0645a7fe2aeab1eb8febffff1f
  99002251 00000000001132000088004985e01172dce4ebfe1f
  99003537 00000000200020000356149a80e994930cb179ae1f


Select a fingerprint subset using a list of indices
===================================================

In this section you'll learn how to make a new arena given a list of
indices for the fingerprints to select from an old arena.

For this section, one example will use indices will be a randomly
selected subset of the indices in the fingerprint. If that's your
goal, see the next section, :ref:`sample_N_elements` to learn how to
use :meth:`.FingerprintArena.sample`. If you want to split the arena
into a training set and a test set, see the section after that,
:ref:`train_and_test_split` which shows how to use
:meth:`.FingerprintArena.train_test_split`.
      
A :class:`.FingerprintArena` slice creates a subarena. Technically
speaking, this is a "view" of the original data. The subarena doesn't
actually copy its fingerprint data from the original arena. Instead,
it uses the same fingerprint data, but keeps track of the start and
end position of the range it needs. This is why it's not possible to
slice with a step size other than +1.

This also means that memory for a large arena won't be freed until
all of its subarenas are also removed.

You can see some evidence for this because a :class:`.FingerprintArena` stores
the entire fingerprint data as a set of bytes named ``arena``::

  >>> import chemfp
  >>> targets = chemfp.load_fingerprints("pubchem_targets.fps") 
  >>> subset = targets[10:20]
  >>> targets.arena is subset.arena
  True

This shows that the *targets* and *subset* share the same raw data
set. At least it does to me, the person who wrote the code.

You can ask an arena or subarena to make a :meth:`copy
<chemfp.arena.FingerprintArena.copy>`. This allocates new memory for
the new arena and copies all of its fingerprints there.

::

  >>> new_subset = subset.copy()
  >>> len(new_subset) == len(subset)
  True
  >>> new_subset.arena is subset.arena
  False
  >>> subset[7][0]
  '48637548'
  >>> new_subset[7][0]
  '48637548'


The :meth:`.FingerprintArena.copy` method can do more than just copy
the arena. You can give it a list of indices (or an iterable) and it
will only copy those fingerprints::

  >>> three_targets = targets.copy([3112, 0, 1234])
  >>> three_targets.ids
  ['48942244', '48568841', '48628197']
  >>> [targets.ids[3112], targets.ids[0], targets.ids[1234]]
  ['48628197', '48942244', '48568841']

Are you confused about why the identifiers aren't in the same order?
That's because when you specify indicies, the copy automatically
reorders them by popcount and stores the popcount information. This
requires a bit extra overhead to sort, but makes future searches
faster. Use ``reorder=False`` to leave the order unchanged

  >>> my_ordering = targets.copy([3112, 0, 1234], reorder=False)
  >>> my_ordering.ids
  ['48628197', '48942244', '48568841']

Suppose you want to partition the data set into two parts; one
containing the fingerprints at positions 0, 2, 4, ... and the other
containing the fingerprints at positions 1, 3, 5, .... The `range()`
function returns iterator for the right length, and you can have it start
from either 0 or 1 and count by twos, like this::

  >>> list(range(0, 10, 2))
  [0, 2, 4, 6, 8]
  >>> list(range(1, 10, 2))
  [1, 3, 5, 7, 9]

so the following will create the correct indices and from that the
correct arena subsets::

  >>> evens = targets.copy(range(0, len(targets), 2))
  >>> odds = targets.copy(range(1, len(targets), 2))
  >>> len(evens)
  7484
  >>> len(odds)
  7483

(Use :meth:`.FingerprintArena.train_test_split` if you want to select
two disjoint subsets selected at random without replacement.)

What about getting a random subset of the data? I want to select *m*
records at random, without replacement, to make a new data set. (See
the next section for a better way to do this using
:meth:`.FingerprintArena.sample`.)

You can see this just means making a list with *m* different index
values. Python's built-in `random.sample
<https://docs.python.org/3/library/random.html#random.sample>`_
function makes this easy::

  >>> import random
  >>> random.sample("abcdefgh", 3)
  ['b', 'h', 'f']
  >>> random.sample("abcdefgh", 2)
  ['d', 'a']
  >>> random.sample([5, 6, 7, 8, 9], 2)
  [7, 9]
  >>> help(random.sample)
  Help on method sample in module random:
    
  sample(population, k) method of random.Random instance
      Chooses k unique random elements from a population sequence or set.
       ...
      To choose a sample in a range of integers, use range as an argument.
      This is especially fast and space efficient for sampling from a
      large population:   sample(range(10000000), 60)

The last line of the help points out what do next!::

  >>> random.sample(range(len(targets)), 5)
  [610, 2850, 705, 1402, 2635]
  >>> random.sample(range(len(targets)), 5)
  [1683, 2320, 1385, 2705, 1850]

(Note: on Python 2.7 you'll need to use "xrange()" not "range()".)

Putting it all together, and here's how to get a new arena containing
100 randomly selected fingerprints, without replacement, from the
*targets* arena::

  >>> sample_indices = random.sample(range(len(targets)), 100)
  >>> sample = targets.copy(indices=sample_indices)
  >>> len(sample)
  100

But really, see the next section for an easier way to do this.

.. _sample_N_elements:

Sample N fingerprints at random
===============================

In this section you'll learn how to select a random subset of the
fingerprints in an arena.

The previous section showed how to use the
:meth:`.FingerprintArena.copy` method to create a new arena containing
a randomly selected subset of the fingerprints in an arena. This
required writing some code to specify the randomly samples indices.

Chemfp 3.4.1 added the method :meth:`.FingerprintArena.sample` which
lets you make a random sample using a single call::

  >>> import chemfp
  >>> targets = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> sample_arena = targets.sample(10000)
  >>> len(sample_arena)
  10000  
  >>> sample_arena.ids[:5]
  ['48941399', '48940284', '48943050', '48656867', '48839855']

If you do the sample a few times you'll see that many of the elements
occur often::

  >>> targets.sample(10000).ids[:5]
  ['48942244', '48656867', '48966209', '48946425', '48946734']
  >>> targets.sample(10000).ids[:5]
  ['48942244', '48940284', '48656359', '48839855', '48946668']
  >>> targets.sample(10000).ids[:5]
  ['48942244', '48940284', '48656359', '48656867', '48839855']
  >>> targets.sample(10000).ids[:5]
  ['48940284', '48656359', '48656867', '48839855', '48946668']

This is for two reasons. First, the sample size is about 2/3rds of the
size of the the data set::
  
  >>> len(targets)
  14967

which means there's a roughly 2/3rds chance that a given record will
be in the sample. Second, by default the sampled fingerprints are reordered
by popcount when making the arena, which means many of the first few
identifiers are the same.

Set `reorder` to `False` to keep the fingerprints in random sample
order::

  >>> targets.sample(10000, reorder=False).ids[:5]
  ['48830242', '48946583', '48559359', '48836764', '48692192']
  >>> targets.sample(10000, reorder=False).ids[:5]
  ['48868183', '48703234', '48577832', '48913224', '48659805']
  >>> targets.sample(10000, reorder=False).ids[:5]
  ['48965603', '48596355', '48691077', '48688289', '48940955']
  >>> targets.sample(10000, reorder=False).ids[:5]
  ['48560433', '48933559', '48662000', '48958077', '48675138']

Remember that similarity search performance is better if the the
fingerprints are sorted by popcount.

The above examples used `num_samples=10000`. If `num_samples` is an
integer, then it's used as the number of samples to make. (Chemfp
raises a ValueError if the size is negative or too large.) If
`num_samples` is a float between 0.0 and 1.0 inclusive then it's used
as the fraction of the dataset to sample. For example, the following
samples 10% of the arena, rounded down::

  >>> len(targets.sample(0.1))
  1496

If no `rng` is given then the underlying implementation uses Python's
`random.sample
<http://docs.python.org/3/library/random.html#random.sample>`__
function. That in turn uses a random number
generator (RNG) which was initialized with a hard-to-guess seed.

If you need a reproducible sample, you can pass in an integer `rng`
value. This is used to seed a new RNG for the sampling. In the
following example, using the same seed always returns the same
fingerprints::

  >>> targets.sample(2, rng=123).ids
  ['48651340', '48778262']
  >>> targets.sample(2, rng=123).ids
  ['48651340', '48778262']
  >>> targets.sample(2, rng=789).ids
  ['48693989', '48507089']
  >>> targets.sample(2, rng=789).ids
  ['48693989', '48507089']

Another option is pass in a
`random.Random()
<https://docs.python.org/3/library/random.html#random.Random>`_
instance, which will be used directly as the RNG::
  
  >>> import random
  >>> my_rng = random.Random(123)
  >>> targets.sample(2, rng=my_rng).ids
  ['48651340', '48778262']
  >>> targets.sample(2, rng=my_rng).ids
  ['48730072', '48908385']
  >>> targets.sample(2, rng=my_rng).ids
  ['48690445', '48502715']

This may be useful if you need to make several random samples, want
reproducibility, and only want to specify one RNG seed. (Be aware that
Python's RNG may be subject to change across different versions of
Python.)

.. _train_and_test_split:

Split into training and test sets
=================================

In this section you'll learn how to split an arena into two disjoint
arenas, which can be then be used as a training set and a test set.

The previous section, :ref:`sample_N_elements` showed how to use
chemfp to select N fingerprints at random from an arena. Sometimes you
need two randomly selected subsets, with no overlap between the
two. For example, one might be used as a training set and the other as
a test set.

Chemfp 3.4.1 added the method
:meth:`.FingerprintArena.train_test_split` which does that. You give
it the number of fingerprints you want in the training set and/or the
test set, and it returns two arenas; the first is the training set and
the second is the test set::
  
  >>> import chemfp
  >>> targets = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> len(targets)
  14967
  >>> train_arena, test_arena = targets.train_test_split(train_size=10, test_size=5)
  >>> len(train_arena)
  10
  >>> len(test_arena)
  5

This function is modeled on the scikit learn function
`train_test_split()
<https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>`_
, which allows for the sizes to be specified as an integer number or a
floating point fraction.

If a specified size is an integer, it is interpreted at the number of
fingerprints to have in the corresponding set. If a specified size is
a float between 0.0 and 1.0 inclusive then it's interpreted as the
fraction of fingerprints to select. For example, the following puts
10% of the fingerprints into the training arena and 20 fingerprints 

  >>> train_arena, test_arena = targets.train_test_split(train_size=0.1, test_size=20)
  >>> len(train_arena), len(test_arena)
  (1496, 20)

If you don't specify the test or arena size then the training set gets
75% of the fingerprints and the test set gets the rest::

  >>> train_arena, test_arena = targets.train_test_split()
  >>> len(train_arena), len(test_arena)
  (11226, 3741)

If only one of `train_size` or `test_size` is specified then the
other value is interpreted as the complement size, so the entire arena
is split into the two sets. In the following, 75% of the fingerprints
are put into the training arena and 25% into the test arena::

  >>> train_arena, test_arena = targets.train_test_split(train_size=0.75)
  >>> len(train_arena), len(test_arena)
  (11225, 3742)

It is better to let chemp figure out the complement size than to
specify both sizes as a float because integer rounding may cause a
fingerprint to be left out (the test arena size is 3741 in the
following when it should be 3742)::

  >>> train_arena, test_arena = targets.train_test_split(train_size=0.75, test_size=0.25)
  >>> len(train_arena), len(test_arena)
  (11225, 3741)

By default, after the random sampling the fingerprints in each set are
reordered by population count and indexed for fast similarity search.

  >>> from chemfp import bitops
  >>> train_arena, test_arena = targets.train_test_split(10, 10)
  >>> [bitops.byte_popcount(train_arena.get_fingerprint(i)) for i in range(10)]
  [71, 118, 119, 145, 146, 159, 162, 167, 176, 196]
  >>> [bitops.byte_popcount(test_arena.get_fingerprint(i)) for i in range(10)]
  [87, 116, 117, 121, 129, 131, 139, 183, 184, 193]

To keep the fingerprints in random sample order, specify
`reorder=False`::

  >>> train_arena, test_arena = targets.train_test_split(10, 10, reorder=False)
  >>> [bitops.byte_popcount(train_arena.get_fingerprint(i)) for i in range(10)]
  [118, 53, 170, 110, 138, 169, 129, 125, 129, 151]
  >>> [bitops.byte_popcount(test_arena.get_fingerprint(i)) for i in range(10)]
  [172, 167, 123, 152, 147, 162, 156, 197, 45, 151]  

The `rng` parameter affects how the fingerprints are samples. By
default (if `rng=None`), Python's default RNG is used. If `rng` is an
integer then it's used as the seed for a new `random.Random()
<https://docs.python.org/3/library/random.html#random.Random>`_ instance. Otherwise
it's assumed to be an RNG object and its `sample()
<https://docs.python.org/3/library/random.html#random.sample>`_ method
is used to make the sample.

The `rng` parameter here works the same as in
:meth:`.FingerprintArena.sample` so for examples see the previous
section, :ref:`sample_N_elements`.


Don't reorder an arena by popcount
==================================

In this section you'll learn about why you might want to store your
fingerprints in specific order, rather than being ordered by
population count.

The previous section showed how to make an arena where the
fingerprints are in a user-specified order::

  >>> import chemfp
  >>> targets = chemfp.load_fingerprints("pubchem_targets.fps") 
  >>> [targets.ids[i] for i in [3112, 0, 1234]]
  ['48628197', '48942244', '48568841']
  >>> targets.copy([3112, 0, 1234], reorder=False).ids
  ['48628197', '48942244', '48568841']
  >>> targets.copy([3112, 0, 1234], reorder=True).ids
  ['48942244', '48568841', '48628197']

If the ``reorder`` option is not specified, the fingerprints in the
new arena will be in popcount order. Similarity search is faster when
the arena is in popcount order because it lets chemfp make an index of
the different regions, based on popcount, and use that for sublinear
search.

Why would someone want search to be slower?

Sometimes data organization is more important. For one client I
developed a SEA implementation, where I compared a set of query
fingerprints to about 50 other sets of target fingerprint sets. The
largest set had only few thousand fingerprints, so the overall search
was fast without a popcount index.

I could have stored each target data set as its own file, but that
would have resulted in about 50 data files to manage, in addition to
the original fingerprint file and the configuration file containing
the information about which identifiers are in which set.

Instead, I stored all of the target data sets in a single FPB file,
where the fingerprints for the first set came first, then the
fingerprints for the second set, and so on. I also made a range file
to store the set name and the start/end range of that set in the FPB
file. This reduced 50 files down to two, which was much easier to
manage.

It's a bit fiddly to go through the details of how this works, because
it requires set membership information which is a bit complicated to
extract and which won't be used for the rest of this
documentation. Instead of walking though an example here, I'll refer
you to my essay `ChEMBL target sets association network
<http://www.dalkescientific.com/writings/diary/archive/2017/03/27/chembl_target_sets_association_network.html>`_.

You can use the subranges directly as an arena slice, like
``arena[54:91]`` as the target. This will work, but as I said earlier,
the search time will be slower because the sublinear algorithm
requires a popcount index.

If you need that search performance then during load time make a copy
of the slice, as in ``arena[54:91].copy(reorder=True)``, and use that
as the target.

A few paragraphs ago I wrote that "I stored all of the target data
sets in a single FPB file." When you load an FPB format, the
fingerprint order will be exactly as given in the file. However, if
you load fingerprints from an FPS file, the fingerprints are by
default reordered. For example, given this data set:

.. code-block:: none
  
  % cat unordered_example.fps
  #FPS1
  0001	Record1
  ffee	Record2
  00f0	Record3

I'll load it into chemfp and show that by default the records are in
the order 1, 3, 2::

  >>> import chemfp
  >>> chemfp.load_fingerprints("unordered_example.fps").ids
  ['Record1', 'Record3', 'Record2']

On the other hand, if I ask it to not reorder then the records are in
the input order, which is 1, 2, 3::

  >>> chemfp.load_fingerprints("unordered_example.fps", reorder=False).ids
  ['Record1', 'Record2', 'Record3']

In short, if you want to preserve the fingerprint order as given in
the input file then use the ``reorder=False`` argument in
:func:`chemfp.load_fingerprints`.

    
Look up a fingerprint with a given id
=====================================

In this section you'll learn how to get a fingerprint record with a
given id. You will need the "pubchem_targets.fps" file generated in
:ref:`pubchem_fingerprints` in order to do this yourself.

All fingerprint records have an identifier and a
fingerprint. Identifiers should be unique. (Duplicates are allowed, and
if they exist then the lookup code described in this section will
arbitrarily decide which record to return. Once made, the choice will
not change.)

Let's find the fingerprint for the record in "pubchem_targets.fps"
which has the identifier "14564126". One solution is to iterate
over all of the records in a file, using the FPS reader::

  >>> import chemfp
  >>> for id, fp in chemfp.open("pubchem_targets.fps"):
  ...   if id == "48500164":
  ...     break
  ... else:
  ...   raise KeyError("%r not found" % (id,))
  ...
  >>> id, fp[:5]
  ('48500164', b'\x07\xde\x0c\x00\x00')

I used the somewhat obscure ``else`` clause to the ``for`` loop. If the
``for`` finishes without breaking, which would happen if the identifier
weren't present, then it will raise an exception saying that it
couldn't find the given identifier.

If the fingerprint records are already in a :class:`.FingerprintArena` then
there's a better solution. Use the :meth:`.FingerprintArena.get_fingerprint_by_id`
method to get the fingerprint byte string, or None if the
identifier doesn't exist::

  >>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> fp = arena.get_fingerprint_by_id("48500164")
  >>> fp[:5]
  b'\x07\xde\x0c\x00\x00'
  >>> missing_fp = arena.get_fingerprint_by_id("does-not-exist")
  >>> missing_fp
  >>> missing_fp is None
  True

Internally this does about what you think it would. It uses the
arena's ``id`` list to make a lookup table mapping identifier to
index, and caches the table for later use. Given the index, it's
very easy to get the fingerprint.

In fact, you can get the index and do the record lookup yourself::

  >>> arena.get_index_by_id("48500164")
  8168
  >>> arena[8168]
  ('48500164', b'\x07\xde\x0c\x00\x00 .. rest omittted ..'')


Sorting search results
======================

In this section you'll learn how to sort the search results.

The k-nearest searches return the hits sorted from highest score to
lowest, and break ties arbitrarily. This is usually what you want, and
the extra cost to sort is small (k*log(k)) compared to the time needed
to maintain the internal heap (N*log(k)).

By comparison, the threshold searches return the hits in arbitrary
order. Sorting takes up to N*log(N) time, which is extra work for
those cases where you don't want sorted data. If you actually want it
sorted, then call :meth:`.SearchResult.reorder` method to sort the
hits in-place::

  >>> import chemfp
  >>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> query_fp = arena.get_fingerprint_by_id("48500164")
  >>> from chemfp import search
  >>> result = search.threshold_tanimoto_search_fp(query_fp, arena, threshold=0.90)
  >>> len(result)
  5
  >>> result.get_ids_and_scores()
  [('48530223', 0.9044585987261147), ('48533220', 0.9230769230769231),
  ('48533212', 0.9299363057324841), ('48500164', 1.0), ('48501504',
  0.906832298136646)]  
  >>> 
  >>> result.reorder("decreasing-score")
  >>> result.get_ids_and_scores()
  [('48500164', 1.0), ('48533212', 0.9299363057324841), ('48533220',
  0.9230769230769231), ('48501504', 0.906832298136646),
  ('48530223', 0.9044585987261147)]
  >>>
  >>> result.reorder("increasing-score")
  >>> result.get_ids_and_scores()
  [('48530223', 0.9044585987261147), ('48501504', 0.906832298136646),
  ('48533220', 0.9230769230769231), ('48533212', 0.9299363057324841),
  ('48500164', 1.0)]

There are currently eight different sort methods, all specified by
a name string. These are

  * ``increasing-score`` - sort by increasing score
  * ``increasing-score-plus`` - sort by increasing score, breaking ties on
    increasing index
  * ``decreasing-score`` - sort by decreasing score 
  * ``decreasing-score-plus`` - sort by decreasing score, breaking ties on
    increasing index
  * ``increasing-index`` - sort by increasing target index
  * ``decreasing-index`` - sort by decreasing target index
  * ``reverse`` - reverse the current ordering
  * ``move-closest-first`` - move the hit with the highest score to the first position

The first two sort by increasing score. The first one only considers
the scores when sorting, and preserves the original order of the
indices for hits with the same score. The second one sorts by
increasing score and, if multiple hits have the same score, ties are
broken by increasing index. This gives an absolute ordering to the
results, independent of history, but is a bit slower than the first
option.

The third and fourth options sort by decreasing score. The third one
only considers the scores when sorting, and preserves the original
order of the indices for hits with the same score. The fourth one
sorts by decreasing score and, if multiple hits have the same score,
ties are broken by increasing index. This gives an absolute ordering
to the results, independent of history, but is a bit slower than the
third option.

.. NOTE::
   Before chemfp 3.5 the ``increasing-score`` and ``decreasing-score``
   implementations gave an absolute order (the same as
   ``increasing-score-plus`` and ``decreasing-score-plus`` do
   now). This implementation detail was never documented. The behavior
   was changed to allow a faster sort implementation which does not
   break ties for the same score.

If you find something useful for ``increasing-index`` and
``decreasing-index``, let me know. They sort exactly like their names
describe.

The ``reverse`` method reverses the current ordering, and is mostly
useful if you want to reverse the sorted results from a k-nearest
search.

The ``move-closest-first`` option exists to improve the leader
algorithm stage used by the Taylor-Butina algorithm. The newly seen
compound is either in the same cluster as its nearest neighbor or it
is the new centroid. I felt it best to implement this as a special
reorder term, rather than one of the other possible options.

If you have suggestions for alternate orderings which might help
improve your clustering performance, let me know.

If you want to reorder all of the search results then you could use
the :meth:`.SearchResult.reorder` method on each result, but it's
easier to use :meth:`.SearchResults.reorder_all` and change everything
in a single call. It takes the same ordering names as ``reorder``::

  >>> from __future__ import print_function # Only for Python 2
  >>> similarity_matrix = search.threshold_tanimoto_search_symmetric(
  ...                         arena, threshold=0.8)
  >>> for query_id, row in zip(arena.ids, similarity_matrix):
  ...   print(query_id, "->", row.get_ids_and_scores()[:3])
  ...
  48942244 -> []
  48941399 -> []
  48940284 -> []
  48943050 -> []
  48656359 -> [('48656867', 0.9761904761904762), ('48656360', 0.9111111111111111), ('48650490', 0.851063829787234)]
  48656867 -> [('48656360', 0.8913043478260869), ('48650490', 0.8333333333333334), ('48521769', 0.8)]
  48839855 -> [('48839869', 0.9148936170212766), ('48839845', 0.8775510204081632), ('48839868', 0.8269230769230769)]
           ... lines deleted ....
  >>> 
  >>> similarity_matrix.reorder_all("increasing-score") 
  >>> for query_id, row in zip(arena.ids, similarity_matrix):
  ...   print(query_id, "->", row.get_ids_and_scores()[:3])
  ... 
  48942244 -> []
  48941399 -> []
  48940284 -> []
  48943050 -> []
  48656359 -> [('48680086', 0.803921568627451), ('48693263', 0.803921568627451), ('48693634', 0.803921568627451)]
  48656867 -> [('48521769', 0.8), ('48521768', 0.803921568627451), ('48653206', 0.803921568627451)]
  48839855 -> [('48839868', 0.8269230769230769), ('48839845', 0.8775510204081632), ('48839869', 0.9148936170212766)]
           ... lines deleted ....

For display purposes, I used ``[:3]`` to display only the first three
matches. In the first block the results are in arbitrary order, while
in the second the elements are sorted so the smallest score is first.


Working with raw scores and counts in a range
=============================================

In this section you'll learn how to get the hit counts and raw scores
for an interval.

The length of a :class:`.SearchResult` is the number of
hits it contains::

  >>> import chemfp
  >>> from chemfp import search
  >>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
  >>> fp = arena.get_fingerprint_by_id("48500164")
  >>> result = search.threshold_tanimoto_search_fp(fp, arena, threshold=0.2)
  >>> len(result)
  14888

This gives you the number of hits at or above a threshold of 0.2,
which you can also get by doing
:func:`chemfp.search.count_tanimoto_hits_fp`::

  >>> search.count_tanimoto_hits_fp(fp, arena, threshold=0.2)
  14888

The advantage to the first version is the result also stores the hits.
You can query the hit to get the number of hits which are within a
specified interval. Here are the counts of the number of hits at or
above 0.5, 0.80, and 0.95::

  >>> result.count(0.5)
  7785
  >>> result.count(0.8)
  42
  >>> result.count(0.95)
  1

The first parameter, *min_score*, specifies the minimum threshold. If
not specified it's -infinity. The second, *max_score*, specifies the
maximum, and is +infinity if not specified. Here's how to get the
number of hits with a score of at most 0.95 and 0.5::

  >>> result.count(max_score=0.95)
  14887
  >>> result.count(max_score=0.5)
  7209

If you double-check the math, and add the number above 0.5 (7785) and
the number below 0.5 (7209) you'll get 14994, even through there are
only 14888 records. The extra 106 is because by default the count
interval uses a closed range. There are 106 hits with a score of
exactly 0.5::

  >>> result.count(0.5, 0.5)
  106

The third parameter, *interval*, specifies the end conditions. The
default is "[]" which means that both ends are closed. The interval
"()" means that both ends are open, and "[)" and "(]" are the two
half-open/half-closed ranges. To get the number of hits below 0.5 and
the number of hits at or above 0.5 then you might use:

  >>> result.count(None, 0.5, "[)")
  7103
  >>> result.count(0.5, None, "[]")
  7785
  >>> 7103+7785
  14888

This total matches the expected count. (A min or max of `None` means
-infinity and +infinity, respectively.)

Cumulative search result counts and scores
==========================================

In this section you'll learn some more advanced ways to work with
SearchResults and SearchResult instances.

I wanted to title this section "Going to SEA", but decided to use a
more descriptive name. "SEA" refers to the "Similarity Ensemble
Approach" (SEA) work of Keiser, Roth, Armbruster, Ernsberger, and
Irwin. The paper is available online from http://sea.bkslab.org/ ,
though I won't actually implement it here. Why do I mention it?
Because these chemfp methods were added specifically to make it easier
to support a SEA implementation for one of the chemfp customers.

Suppose you have two sets of structures. How well do they compare to
each other? I can think of various ways to do it. One is to look at a
comparison profile. Find all NxM comparisons between the two sets. How
many of the hits have a threshold of 0.2? How many at 0.5? 0.95?

If there are "many", then the two sets are likely more similar than
not. If the answer is "few", then they are likely rather distinct.

I'll be more specific. I want to know if the coenzyme A-like
structures in ChEBI are more similar to the penicillin-like structures
than one would expect by comparing two randomly chosen subsets. To
quantify "similar", I'll use Tanimoto similarity of the
"chebi_maccs.fps" fingerprints, which are the 166 MACCS key-like
fingerprints from RDMACCS for the `ChEBI data set
<ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/ChEBI_lite.sdf.gz>`_.
See :ref:`chebi_fingerprints` for details about why I use the
:option:`--id-tag` options:

.. code-block:: bash
   
  # Use one of the following to create chebi_maccs.fps
  oe2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
  ob2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
  rdkit2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
  cdk2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps

I used oe2fps to create RDMACCS-OpenEye fingerprints.

The CHEBI id for coenzyme A is CHEBI:15346 and for penicillin is
CHEBI:17334. I'll define the "coenzyme A-like" structures as the 256
structures where the fingerprint is at least 0.95 similar to coenzyme
A, and "penicillin-like" as the 24 structures at least 0.85 similar to
penicillin. This gives 6144 total comparisons.

You know enough to do this, but there's a nice optimization I haven't
told you about. You can get the total count of all of the threshold
hits using the :meth:`chemfp.search.SearchResults.count_all` method
instead of looping over each SearchResult and calling
:meth:`chemfp.search.SearchResult.count`:

.. code-block:: python 
    
  from __future__ import print_function # Only for Python 2
  import chemfp
  from chemfp import search
  
  def get_neighbors_as_arena(arena, id, threshold):
      fp = arena.get_fingerprint_by_id(id)
      neighbor_results =  search.threshold_tanimoto_search_fp(fp, chebi, threshold=threshold)
      neighbor_arena = arena.copy(neighbor_results.get_indices())
      return neighbor_arena
  
  chebi = chemfp.load_fingerprints("chebi_maccs.fps")
  
  # Find the 256 neighbors of coenzyme A
  coA_arena = get_neighbors_as_arena(chebi, "CHEBI:15346", threshold=0.95)
  print(len(coA_arena), "coenzyme A-like structures")
  
  # Find the 24 neighbors of penicillin
  penicillin_arena = get_neighbors_as_arena(chebi, "CHEBI:17334", threshold=0.85)
  print(len(penicillin_arena), "penicillin-like structures")
  
  # I'll compute a profile at different thresholds
  thresholds = [0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.7, 0.8, 0.9, 0.95]
  
  # Compare the two sets. (For this case the speed difference between a threshold
  # of 0.25 and 0.0 is not noticible, but having it makes me feel better.)
  coA_against_penicillin_result = search.threshold_tanimoto_search_arena(
      coA_arena, penicillin_arena, threshold=min(thresholds))
  
  # Show a similarity profile
  print("Counts  coA/penicillin")
  for threshold in thresholds:
      print(" %.2f      %5d" % (threshold,
                                coA_against_penicillin_result.count_all(min_score=threshold)))

This gives a not very useful output:

.. code-block:: none 
    
  272 coenzyme A-like structures
  24 penicillin-like structures
  Counts  coA/penicillin
   0.30       6528
   0.35       6528
   0.40       6523
   0.45       4403
   0.50       1193
   0.55          0
   0.60          0
   0.70          0
   0.80          0
   0.90          0
   0.95          0

It's not useful because it's not possible to make any decisions from
this. Are the numbers high or low? It should be low, because these are
two quite different structure classes, but there's nothing to compare
it against.

I need some sort of background reference. What I'll do is construct
two randomly chosen sets, one with 256 fingerprints and the other with
24, and generate the same similarity profile with them. That isn't
quite fair, since randomly chosen sets will most likely be
diverse. Instead, I'll pick one fingerprint at random, then get its
256 or 24, respectively, nearest neighbors as the set members (place
the following code at the end of the file with the previous code):

.. code-block:: python 
    
  # Get background statistics for random similarity groups of the same size
  import random
    
  # Find a fingerprint at random, get its k neighbors, return them as a new arena
  def get_random_fp_and_its_k_neighbors(arena, k):
      fp = arena[random.randrange(len(arena))][1]
      similar_search = search.knearest_tanimoto_search_fp(fp, arena, k)
      return arena.copy(similar_search.get_indices())

I'll construct 1000 pairs of sets this way, accumulate the threshold
profile, and compare the CoA/penicillin profile to it:

.. code-block:: python 
    
  # Initialize the threshold counts to 0
  total_background_counts = dict.fromkeys(thresholds, 0)
    
  REPEAT = 1000
  for i in range(REPEAT):
      # Select background sets of the same size and accumulate the threshold count totals
      set1 = get_random_fp_and_its_k_neighbors(chebi, len(coA_arena))
      set2 = get_random_fp_and_its_k_neighbors(chebi, len(penicillin_arena))
      background_search = search.threshold_tanimoto_search_arena(set1, set2, threshold=min(thresholds))
      for threshold in thresholds:
          total_background_counts[threshold] += background_search.count_all(min_score=threshold)
    
  print("Counts  coA/penicillin  background")
  for threshold in thresholds:
      print(" %.2f      %5d          %5d" % (threshold,
                                             coA_against_penicillin_result.count_all(min_score=threshold),
                                             total_background_counts[threshold] / (REPEAT+0.0)))

Your output should now have something like this at the end:

.. code-block:: none 
    
  Counts  coA/penicillin  background
   0.30       6528           2798
   0.35       6528           2273
   0.40       6523           1789
   0.45       4403           1301
   0.50       1193            988
   0.55          0            656
   0.60          0            411
   0.70          0            160
   0.80          0             54
   0.90          0             15
   0.95          0              0

This is a bit hard to interpret. Clearly the coenzyme A and penicillin
sets are not closely similar, but for low Tanimoto scores the
similarity is higher than expected. That difficulty is okay for now
because I mostly wanted to show an example of how to use the chemfp
API. If you want to dive deeper into this sort of analysis then read a
three-part series I wrote at
http://www.dalkescientific.com/writings/diary/archive/2017/03/20/fingerprint_set_similarity.html
on using chemfp to build a target set association network using ChEMBL.


The SEA paper actually wants you to use the raw score, which is the
sum of the hit scores in a given range, and not just the number of
hits. No problem! Use :meth:`.SearchResult.cumulative_score` for the
cumulative scores for an individual result, or
:meth:`.SearchResults.cumulative_score_all` for the cumulative scores
across all of the results. The two functions compute *almost*
identical values for the whole data set::

  >>> sum(row.cumulative_score(min_score=0.5, max_score=0.9)
  ...             for row in coA_against_penicillin_result)
  605.5158868869943
  >>> coA_against_penicillin_result.cumulative_score_all(min_score=0.5, max_score=0.9)
  605.5158868869953

The cumulative methods, like the count method you learned about in the
previous section, also take the *interval* parameter for when you
don't want the default of "[]".

You may wonder why these two values aren't exactly the same. They
differ because floating point addition is not associative. The first
computes the sum for each result, then the sum of sums. The second
computes the sum by adding each score to the cumulative sum.

I get a different result if I sum up the values in reverse order::
  
  >>> sum(list(row.cumulative_score(min_score=0.5, max_score=0.9)
  ...                for row in coA_against_penicillin_result)[::-1])
  605.5158868869959

Which is the "right" score? The ``cumulative_score_all()`` method at
least matches the one you might write if you computed the sum
directly::

  >>> total_score = 0.0
  >>> for row_scores in coA_against_penicillin_result.iter_scores():
  ...   for score in row_scores:
  ...     if 0.5 <= score <= 0.9:
  ...       total_score += score
  ...
  >>> total_score
  605.5158868869953


Writing fingerprints with a fingerprint writer
==============================================

In this section you'll learn how to create a fingerprint file using
the chemfp fingerprint writer API.

You probably don't need this section. In most cases you can save the
contents of an FPS reader or fingerprint arena by using the
:meth:`.FingerprintReader.save` method, as in the following examples:

.. code-block:: python
   
  chemfp.open("pubchem_targets.fps").save("example.fps")
  chemfp.open("pubchem_targets.fps").save("example.fpb")
  chemfp.open("pubchem_targets.fpb").save("example.fps.gz")

The structure-based fingerprint readers also implement the ``save``
method so you could simply write:

.. code-block:: python
  
  import chemfp
  reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", "Compound_099000001_099500000.sdf.gz")
  reader.save("example.fps")  # or "example.fpb"

However, if you generate the fingerprints yourself, or want more
fine-grained control over the writer parameters then read on!

(If you don't have RDKit installed then use "OpenBabel-MACCS" for Open
Babel's MACCS fingerprints, "OpenEye-MACCS166" for OpenEye's, or
"CDK-MACCS" for CDK's.)

Here's an example of the fingerprint writer API. I open the writer,
ask it to write a fingerprint id and the fingerprint, and then close
it.

  >>> import chemfp
  >>> writer = chemfp.open_fingerprint_writer("example.fps")
  >>> writer.write_fingerprint("ABC123", b"\0\0\0\0\0\3\2\1")
  >>> writer.close()

I'll ask Python to read the file and print the contents::

  >>> from __future__ import print_function # Only for Python 2
  >>> print(open("example.fps").read())
  #FPS1
  0000000000030201      ABC123

Of course you don't need to use chemfp to write this file. It's simple
enough that you could get the same result in fewer lines of normal
Python code. The advantage starts to be useful when you want to
include metadata.

  >>> metadata = chemfp.Metadata(num_bits=64, type="Example-FP/0")
  >>> writer = chemfp.open_fingerprint_writer("example.fps", metadata)
  >>> writer.write_fingerprint("ABC123", b"\0\0\0\0\0\3\2\1")
  >>> writer.close()
  >>> 
  >>> print(open("example.fps").read())
  #FPS1
  #num_bits=64
  #type=Example-FP/0
  0000000000030201	ABC123

Even then, native Python code is probably easier to use if you know
what the header lines will be, because it's a bit of a nuisance to
create the :class:`chemfp.Metadata` yourself.

On the other hand, if you have a chemfp fingerprint type you can just
ask it for the correct metadata instance::

  >>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
  >>> metadata = fptype.get_metadata()
  >>> metadata
  Metadata(num_bits=166, num_bytes=21, type='RDKit-MACCS166/2',
  aromaticity=None, sources=[], software='RDKit/2019.09.1
  chemfp/3.4', date='2020-05-13T13:34:37')

Putting the two together, and switching to a 21 byte fingerprint
instead of an 8 byte fingerprint, gives:

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp 
  >>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
  >>> writer = chemfp.open_fingerprint_writer("example.fps", fptype.get_metadata())
  >>> writer.write_fingerprint("ABC123", b"\0\1\2\3\4\5\6\7\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14")
  >>> writer.close()
  >>> 
  >>> print(open("example.fps").read())
  #FPS1
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1 chemfp/3.4
  #date=2020-05-13T13:35:23
  000102030405060708090a0b0c0d0e0f1011121314	ABC123

In real life that fingerprint comes from somewhere. The high-level
structure-based fingerprint reader has a handy ``metadata`` attribute::

  >>> filename = "Compound_099000001_099500000.sdf.gz"
  >>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
  >>> print(reader.metadata)
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1chemfp/3.4
  #source=Compound_099000001_099500000.sdf.gz
  #date=2020-05-13T13:36:11

By the way, note that this includes the source filename, which
:meth:`.FingerprintType.get_metadata` can't automatically do. (See
:ref:`merging_multiple_fingerprints` for an example of how to pass
that information to get_metadata().)

A structure-based fingerprint reader is just like any other reader, so
you can iterate over the (id, fingerprint) pairs::

  >>> from chemfp import bitops
  >>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
  >>> for count, (id, fp) in enumerate(reader):
  ...   print(id, "=>", bitops.hex_encode(fp))
  ...   if count == 5:
  ...     break
  ... 
  99000039 => 000004000000300001c0004e9361b051dce1676e1f
  99000230 => 000000800100649f0445a7fe2aeab1fb8f6bdfff1f
  99002251 => 00000000001132000088004985601150dce4e3fe1f
  99003537 => 00000000200020000156149a906994930c3159ae1f
  99003538 => 00000000200020000156149a906994930c3159ae1f
  99005028 => 00000000000000008000004e84683ca49100f7fa1f

You probably already see how to combine this with
:meth:`.FingerprintWriter.write_fingerprint` to generate the FPS
output. The key part would look like:

.. code-block:: python
  
  for id, fp in reader:
    writer.write_fingerprint(id, fp)

While that would work, there's a better way. The chemfp fingerprint
writer has a :meth:`.FingerprintWriter.write_fingerprints` method which
takes a list or iterator of (id, fingerprint) pairs. Here's a better
way to write the code:

.. code-block:: python
  
  import chemfp
  filename = "Compound_099000001_099500000.sdf.gz"
  reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
  writer = chemfp.open_fingerprint_writer("example.fps", reader.metadata)
  writer.write_fingerprints(reader)
  writer.close()
  reader.close()
  # Note: See the next section for an even better  solution
  # which uses a context manager.

This produces output which starts:

.. code-block:: none
  
  #FPS1
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1 chemfp/3.4
  #source=Compound_099000001_099500000.sdf.gz
  #date=2020-05-13T13:38:31
  000004000000300001c0004e9361b051dce1676e1f	99000039
  000000800100649f0445a7fe2aeab1fb8f6bdfff1f	99000230
  00000000001132000088004985601150dce4e3fe1f	99002251
  00000000200020000156149a906994930c3159ae1f	99003537


Why is ``write_fingerprints`` "better" than multiple calls to
``write_fingerprint``? I think it more directly describes the goal of
writing all of the fingerprints, rather than the mechanics of
unpacking and repacking the (id, fingerprint) pairs. I had hoped that
there would be performance improvement, because there's less Python
function call overhead, but my timings show no differences.

However, there's a still better way, which is to use a context manager
to close the files automatically, rather than calling ``close()``
explicitly. I'll leave that for the next section.


Fingerprint readers and writers are context managers
====================================================

In this section you'll learn how the fingerprint readers and writers
can be used as a `context manager <https://www.python.org/dev/peps/pep-0343/>`_.

The previous section ended with the following code:

.. code-block:: python
  
  import chemfp
  filename = "Compound_099000001_099500000.sdf.gz"
  reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
  writer = chemfp.open_fingerprint_writer("example.fps", reader.metadata)
  writer.write_fingerprints(reader)
  writer.close()
  reader.close()

This reads a PubChem file with RDKit, generates MACCS fingerprints,
and saves the results to "example.fps".

The two :meth:`.FingerprintWriter.close` lines ensure that the reader
and writer files are closed. This isn't required for a simple script,
because Python will close the files automatically at the end of the
script, or when the garbage collector kicks in.

However, since the writer may buffer the output, you have to close the
file before you or another program can read it. It's good practice to
always close the file when you're done with it, as otherwise there are
ways to get really confused about why you don't have a complete file.

Even with the explicit ``close`` calls, if there's an exception in
:meth:`.FingerprintWriter.write_fingerprints` then the files will be
left open. In older-style Python this was handled with a try/finally
block, but that's verbose. Instead, chemfp's readers and writers
implement modern Python's context manager API, to make it easier to
close files automatically at just the right place. Here's what the
above looks like with a context manager:

.. code-block:: python
  
  import chemfp
  filename = "Compound_099000001_099500000.sdf.gz"
  with chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename) as reader:
    with chemfp.open_fingerprint_writer("example.fps", reader.metadata) as writer:
      writer.write_fingerprints(reader)


Isn't that nice and short? Just bear in mind that it's even more
succinctly written as:

.. code-block:: python
  
  import chemfp
  filename = "Compound_099000001_099500000.sdf.gz"
  with chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename) as reader:
    reader.save("example.fps")


Write fingerprints to stdout or a file-like object
==================================================

In this section you'll learn how to write fingerprints to stdout, and
how to write them to a BytesIO instance.

The previous section showed examples of passing a filename string to
:func:`chemfp.open_fingerprint_writer`. If the ``filename`` argument
is None then the writer will write to stdout in uncompressed FPS
format::

  >>> import chemfp
  >>> writer = chemfp.open_fingerprint_writer(None,
  ...     chemfp.Metadata(num_bits=16, type="Experiment/1"))
  #FPS1
  #num_bits=16
  #type=Experiment/1
  >>> writer.write_fingerprint("QWERTY", b"AA")
  4141    QWERTY
  >>> writer.write_fingerprint("SHRDLU", b"\0\1")
  0001    SHRDLU
  >>> writer.close()

The ``filename`` argument may also be a file-like object, which is
defined as any object which implements the method ``write(s)`` where
``s`` is a byte string. A `io.BytesIO
<https://docs.python.org/2/library/io.html#io.BytesIO>`_ instance is
one such file-like object. It gives access to the output as a byte
string::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> from io import BytesIO
  >>> f = BytesIO()
  >>> writer = chemfp.open_fingerprint_writer(f, chemfp.Metadata(num_bits=16, type="Experiment/1"))
  >>> print(f.getvalue().decode("utf8"))  # convert byte string to text
  #FPS1
  #num_bits=16
  #type=Experiment/1
  
  >>> writer.write_fingerprint("ETAOIN", b"00")
  >>> writer.close()
  >>> print(f.getvalue().decode("utf8")) # convert byte string to text
  #FPS1
  #num_bits=16
  #type=Experiment/1
  3030	ETAOIN

You can see that closing the fingerprint writer does not close the
underlying file-like object. (If it did then you couldn't get access
to the string content, which gets deleted when the StringIO is
closed.)

You can also write an FPB file to a file-like object, if it supports
``seek()`` and ``tell()`` and binary writes. This means that you
cannot write an FPB format to stdout, but you can write it to a
BytesIO instance.

  >>> import chemfp
  >>> from io import BytesIO
  >>> f = BytesIO()
  >>> writer = chemfp.open_fingerprint_writer(f, format="fpb")
  >>> writer.write_fingerprint("ID123", b"\x01\xfe")
  >>> writer.close()
  >>> len(f.getvalue())
  2269

Writing fingerprints to an FPB file
===================================

In this section you'll learn how to write an FPB file.

The FPS file is a text format which was designed to be easy to read
and write. The FPB file is a binary format which is designed to be
fast to load. Internally it stores the fingerprints in a way which can
be mapped directly to the arena data structure. However, writing this
format yourself is not easy.

Instead, let chemfp do it for you. With the
:func:`chemfp.open_fingerprint_writer` function, the difference
between writing an FPS file and an FPB file is a matter of changing
the extension. Here's a simple example::

  >>> import chemfp
  >>> writer = chemfp.open_fingerprint_writer("simple.fpb")
  >>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
  >>> writer.close()

Almost all you need to know is to use the ".fpb" extension instead of
".fps". The rest of this section goes into low-level details that
might be enlightening, but probably aren't that directly useful for
most people.

It's hard to show the content of the FPB file, because it is
binary. I'll do a character dump to show the first 96 bytes:

.. code-block:: console 
  
  % od -c simple.fpb
  0000000    F   P   B   1  \r  \n  \0  \0  \r  \0  \0  \0  \0  \0  \0  \0
  0000020    M   E   T   A   #   n   u   m   _   b   i   t   s   =   1   6
  0000040   \n   #  \0  \0  \0  \0  \0  \0  \0   A   R   E   N 002  \0  \0
  0000060   \0  \b  \0  \0  \0 002  \0  \0 001 002  \0  \0  \0  \0  \0  \0
  0000100    Z   Z  \0  \0  \0  \0  \0  \0 377 377  \0  \0  \0  \0  \0  \0
  0000120    H  \0  \0  \0  \0  \0  \0  \0   P   O   P   C  \0  \0  \0  \0  
    ...

The first eight bytes are the file signature. Following that are a set
of blocks, with eight bytes for the length, a four byte block type
name, and then the block content. Here you can see the "META"data
block, followed by the "AREN"a block containing the fingerprint data,
followed by the start of the "POPC"ount block with the popcount index
information.

That's probably a bit too much detail for you. I'll use chemfp to read
the file and show the contents:

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> reader = chemfp.open("simple.fpb")
  >>> print(reader.metadata)
  #num_bits=16
  
  >>> from chemfp import bitops
  >>> for id, fp in reader:
  ...   print(id, "=>", bitops.hex_encode(fp))
  ... 
  third => 0102
  second => 5a5a
  first => ffff

Unlike the FPS format, the FPB format requires a *num_bits* in the
metadata. Since I didn't give the writer that information, it figured
it out from the number of bytes in the first written fingerprint.

You can see that record order is different than the input order. While
the FPS fingerprint writer preserves input order, the FPB writer will
reorder the records by population count, so the records with fewer
'on' bits come first. It then creates a popcount index, to mark the
start and end location of all of the fingerprints with a given
popcount. This is used to pre-compute the popcount for a fingerprint,
and quickly reject some of the similarity search space.

Use the *reorder* parameter to control if the fingerprints should be
reordered. The default is True, and False will preserve the input
order::

  >>> writer = chemfp.open_fingerprint_writer("simple.fpb", reorder=False)
  >>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
  >>> writer.close()
  >>> 
  >>> reader = chemfp.open("simple.fpb")
  >>> for id, fp in reader:
  ...   print(id, "=>", bitops.hex_encode(fp))
  ... 
  first => ffff
  second => 5a5a
  third => 0102

You might think it's a bit useless to preserve input order, because
the performance won't be as fast. It's actually proved useful for one
project, where the targets were broken up into clusters, and cluster
membership was done using a SEA analysis. Rather than have a few dozen
separate fingerprint files, I stored everything in the same file
(including duplicate fingerprints), and used a configuration file
which specified the cluster name and its range in the file. This made
it a lot easier to organize the data, and since there were only a few
thousand fingerprints sublinear search performance wasn't needed.

The FPB fingerprint writer also has an *alignment* option. If you look
very carefully at the character dump you can see that the fingerprints
are eight byte aligned:

.. code-block:: none 
  
  0000040   \n   #  \0  \0  \0  \0  \0  \0  \0   A   R   E   N 002  \0  \0
  0000060   \0  \b  \0  \0  \0 002  \0  \0 001 002  \0  \0  \0  \0  \0  \0
  0000100    Z   Z  \0  \0  \0  \0  \0  \0 377 377  \0  \0  \0  \0  \0  \0
  0000120    H  \0  \0  \0  \0  \0  \0  \0   P   O   P   C  \0  \0  \0  \0

The "AREN" is the start of the arena block, the next four bytes ("002
\0 \0 \0 \0") are the number of bytes in a fingerprint, in this
case 2. The four bytes after that ("\b \0 \0 \0") are the number of
bytes allocated for each fingerprint; "\b" is the escape code for
backspace, or ASCII 8. Yes, 8 bytes are used even though the
fingerprints only have 2 bytes in them. This is because the FPB format
expects to be able to use the 8 byte "POPC" assembly instruction, if
available, because that has the fastest performance.

After the storage size field is a byte for the spacer length. The
"002" means two NUL spacer characters follow. This is used to put the
start of the first fingerprint on the eight byte boundary, so there
will be no alignment issues with using the POPC instruction. (This is
not that important for recent Intel processors, but Intel isn't the
only processor in the world.)

Finally you see the fingerprints; the first fingerprint is "001 002",
followed by six NUL characters to fill up the 8 bytes of storage, the
second is "Z Z" followed by six more NUL pad characters, etc.

If you are really working with a two byte fingerprint, then six NUL
characters is likely a waste of space. You can ask chemfp to use a two
byte alignment instead::

  >>> import chemfp
  >>> writer = chemfp.open_fingerprint_writer("simple.fpb", alignment=2)
  >>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
  >>> writer.close()

giving:

.. code-block:: console 
  
  % od -c simple.fpb
  0000000    F   P   B   1  \r  \n  \0  \0  \r  \0  \0  \0  \0  \0  \0  \0
  0000020    M   E   T   A   #   n   u   m   _   b   i   t   s   =   1   6
  0000040   \n 017  \0  \0  \0  \0  \0  \0  \0   A   R   E   N 002  \0  \0
  0000060   \0 002  \0  \0  \0  \0 001 002   Z   Z 377 377   H  \0  \0  \0
  0000100   \0  \0  \0  \0   P   O   P   C  \0  \0  \0  \0  \0  \0  \0  \0

If you stare at it long enough you'll see that the storage size is now
two bytes, and that the fingerprints are arranged without any
padding. (Actually, since chemfp's two byte popcount uses character
pointers, you could even use 1 byte alignment without a performance
hit. But all this will do is save you at most one byte of spacer.)

Going in the other direction, it's possible to specify up to 256 bytes
of alignment. This is far beyond any conceivable use. Even the AVX
instructions need only 256 bits, or 32 byte alignment, and that's not
a requirement, only a performance optimization to avoid a cache line
split.

(If some future instruction set needs a larger alignment then the FPB
format acquire a new block type which provides the right alignment.)


Specify the output fingerprint format
=====================================

In this section you'll learn about the *format* option to the
fingerprint writer.

By default :func:`chemfp.open_fingerprint_writer` uses the destination
filename's extension to determine if it should write an FPS file
(".fps"), a gzip compressed FPS file (".fps.gz"), a zstandard
compressed FPS file (".fps.zst") or an FPB file (".fpb"). If it
doesn't recognize the extension, or if the filename is None (to write
to stdout) then it will assume the FPS format.

If the destination is a file-like object then things become a bit more
complicated. If the object has a ``name`` attribute, which is the case
with real file objects, then that will be examined for any known
extension. That's why the following writes the output in fps.gz
format::

  >>> import chemfp
  >>> f = open("example.fps.gz", "wb") # must be in binary mode!
  >>> writer = chemfp.open_fingerprint_writer(f)
  >>> writer.write_fingerprint("ABC", b"\0\0\0\0")
  >>> writer.close()
  >>> f.close()
  >>> open("example.fps.gz", "rb").read() # must be in binary mode!
  b"\x1f\x8b\x08\x08K\xfc\xbb^\x02\xffexample.fps\x00S ... mode deleted
  >>>
  >>> import gzip
  >>> print(gzip.open("example.fps.gz").read())
  b'#FPS1\n00000000\tABC\n'
  >>> print(gzip.open("example.fps.gz").read().decode("utf8"))
  #FPS1
  00000000        ABC

There's a large amount of magic behind the scenes to connect the
filename in the Python open() call to the chemfp output format.

The other solution is to just tell it which format to use, with the
*format* parameter. For example, if you want to send the output to
stdout in gzip compressed FPS format then do:

.. code-block:: python 
    
  writer = chemfp.open_fingerprint_writer(None, format="fps.gz")

If you want to save an FPB file to a BytesIO instance then do:

.. code-block:: python
    
  from io import BytesIO
  f = BytesIO()
  writer = chemfp.open_fingerprint_writer(f, format="fpb")

And if you really want to save to a file with an ".fpb" extension but
have it as an FPS file, then do:

.. code-block:: python
    
  writer = chemfp.open_fingerprint_writer("really_an_fps_file.fpb", format="fps")

But that would be silly.

.. _merging_multiple_fingerprints:

Merging multiple structure-based fingerprint sources
====================================================

In this section you'll learn how to merge multiple fingerprint scores
into a single file, and include the full list of source filenames.

The structure-based fingerprint readers include a source filename in
the metadata::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> filename = "Compound_099000001_099500000.sdf.gz"
  >>> reader = chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename)
  >>> print(reader.metadata)
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1 chemfp/3.4
  #source=Compound_099000001_099500000.sdf.gz
  #date=2020-05-13T13:57:58

If you have a single input file and a single output file then you can
save the reader to an FPS or FPB file directly::

  >>> reader.save("example.fpb")
  >>> reader.close()

Strictly speaking, the ``close()`` is rarely necessary as the garbage
collector will close the file during finalization. Still, it's good
practice to close file, and to use a context manager to ensure that
the file is always closed. Here's what that looks like::

  >>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename) as reader:
  ...     reader.save("example.fpb")
  
However you create it, the output file will have the original
metadata::

  >>> arena = chemfp.open("example.fpb")
  >>> print(arena.metadata)
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1 chemfp/3.4
  #source=Compound_099000001_099500000.sdf.gz
  #date=2020-05-13T13:58:42

What happens if you want to want to merge multiple files? How does the
output fingerprint file get the correct metadata?

I'll demonstrate the problem by computing fingerprints from two
structure files. I'll get the fingerprint type and ask it to create a
metadata instance::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp 
  >>> filenames = ["Compound_099000001_099500000.sdf.gz", "Compound_048500001_049000000.sdf.gz"]
  >>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
  >>> print(fptype.get_metadata())
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1 chemfp/3.4
  #date=2020-05-13T14:00:13

The problem is that I also want to include the filenames as source
fields in the metadata. The fingerprint type doesn't have this
information. Instead, I'll them in through the *sources* parameter,
which takes a string or a list of strings::

  >>> metadata = fptype.get_metadata(sources=filenames)
  >>> print(metadata)
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1 chemfp/3.4
  #source=Compound_099000001_099500000.sdf.gz
  #source=Compound_048500001_049000000.sdf.gz
  #date=2020-05-13T14:00:34

What remains is to pass this metadata to the fingerprint writer, then
loop through the structure filenames to compute the fingerprints and
send them to the writer::

  >>> with chemfp.open_fingerprint_writer("example.fpb", metadata=metadata) as writer:
  ...   for filename in filenames:
  ...     with fptype.read_molecule_fingerprints(filename) as reader:
  ...       writer.write_fingerprints(reader)
  ... 

Here's a quick check to see that the metadata was saved correctly::

  >>> print(chemfp.open("example.fpb").metadata)
  #num_bits=166
  #type=RDKit-MACCS166/2
  #software=RDKit/2019.09.1 chemfp/3.4
  #source=Compound_099000001_099500000.sdf.gz
  #source=Compound_048500001_049000000.sdf.gz
  #date=2020-05-13T14:00:34

If your toolkit can't parse one of the records then it will raise an
exception. You likely want it to ignore errors, which you can do with
the *errors* option to :func:`chemfp.read_molecule_fingerprints`. The
final code for this section looks like:

.. code-block:: python
  
  import chemfp
  
  filenames = ["Compound_099000001_099500000.sdf.gz", "Compound_048500001_049000000.sdf.gz"] 
  
  fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
  metadata = fptype.get_metadata(sources=filenames)
  
  with chemfp.open_fingerprint_writer("example.fpb", metadata=metadata) as writer:
    for filename in filenames:
      with fptype.read_molecule_fingerprints(filename, errors="ignore") as reader:
        writer.write_fingerprints(reader)


Merging multiple fingerprint files
==================================

In this section you'll learn how to make a modified copy of a metadata
instance.

The previous section merged multiple structure-based fingerprints, and
used the fingerprint type to get the correct metadata instance.

What if you want to merge several existing fingerprint files, and
those use a fingerprint type that chemfp doesn't understand? In that
case there is no chemfp fingerprint type, and therefore no
:meth:`~.FingerprintType.get_metadata` method to call. Instead, you'll
need some other way to make a :class:`chemfp.Metadata` instance.

I'll work through a solution, and start by using :ref:`sdf2fps
<sdf2fps>` to extract the PubChem/CACTVS fingerprints from two PubChem
SD files:

.. code-block:: console 
  
  % sdf2fps --pubchem Compound_099000001_099500000.sdf.gz -o Compound_099000001_099500000.fps
  % sdf2fps --pubchem Compound_048500001_049000000.sdf.gz -o Compound_048500001_049000000.fps
  % head -7 Compound_099000001_099500000.fps | fold
  #FPS1
  #num_bits=881
  #type=CACTVS-E_SCREEN/1.0 extended=2
  #software=CACTVS/unknown
  #source=Compound_099000001_099500000.sdf.gz
  #date=2020-05-13T14:03:21
  07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
  0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
  69001393e338d1017100000000204000000000000010200000000000000000	99000039
  % head -7 Compound_048500001_049000000.fps | fold
  #FPS1
  #num_bits=881
  #type=CACTVS-E_SCREEN/1.0 extended=2
  #software=CACTVS/unknown
  #source=Compound_048500001_049000000.sdf.gz
  #date=2020-05-13T14:03:34
  07de05000000000000000000000000000080060000000c060000000000001a802f00007800080000
  00b01483f920cc0b6d9309001de0e44e2e004501b48548059099051d2e1911174503998d29041016
  69401313f40801007010000000000000040800000000000002000000000000	48500020

Of course you could just ignore the header data, which is what the
following does:

.. code-block:: python 
  
  import chemfp
  
  filenames = ["Compound_099000001_099500000.fps", "Compound_048500001_049000000.fps"] 
  
  with chemfp.open_fingerprint_writer("merged_pubchem.fps") as writer:
    for filename in filenames:
      with chemfp.open(filename) as reader:
        writer.write_fingerprints(reader)

but then you'll be left with no metadata in the FPS header:

.. code-block:: python
  
  % head -3 merged_pubchem.fps | fold
  #FPS1
  07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
  0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
  69001393e338d1017100000000204000000000000010200000000000000000	99000039
  07de1c000200000000000000000000000080040000003c0200000000000000800300007820080200
  00b034870b604ce0410320421100954a090e43100824040010119971301370664c21addce99c1427
  6b881995e1398a405000010000000000008000000000000000000000000000	99000230

While you could do that, the metadata keeps track of potentially
useful information, so it's better to add it. For that matter,
metadata usually isn't useful until some time after the fingerprints
are generated. People tend to put off writing code until it's needed,
but by then it's too late. I've tried to make chemfp's API easy, to
encourage people to add the right metadata from the start.

There are a couple of ways to add the right metadata. The classic way
is to make your own :class:`chemfp.Metadata` with the right values::

  >>> metadata = chemfp.Metadata(num_bits=881, type="CACTVS-E_SCREEN/1.0 extended=2",
  ...   software="CACTVS/unknown", sources=["Compound_099000001_099500000.sdf.gz",
  ...       "Compound_048500001_049000000.sdf.gz"])
  >>> print(metadata)
  #num_bits=881
  #type=CACTVS-E_SCREEN/1.0 extended=2
  #software=CACTVS/unknown
  #source=Compound_099000001_099500000.sdf.gz
  #source=Compound_048500001_049000000.sdf.gz

The downside is this requires knowing all of the fields
beforehand. Another option is to :meth:`copy <chemfp.Metadata.copy>`
the metadata from the first fingerprint file, and ask the ``copy()``
to use a new list of sources::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> reader = chemfp.open("Compound_099000001_099500000.fps")
  >>> metadata = reader.metadata.copy()
  >>> metadata.sources
  ['Compound_099000001_099500000.sdf.gz']
  >>> metadata = reader.metadata.copy(sources=[
  ...     u"Compound_099000001_099500000.sdf.gz",
  ...     u"Compound_048500001_049000000.sdf.gz"])
  >>> print(metadata)
  #num_bits=881
  #type=CACTVS-E_SCREEN/1.0 extended=2
  #software=CACTVS/unknown
  #source=Compound_099000001_099500000.sdf.gz
  #source=Compound_048500001_049000000.sdf.gz
  #date=2020-05-13T14:03:21

Now to put the pieces together. I'll make one pass through the
fingerprint files to get the sources, and then another pass to
generate the output. If you only have a handful of files then this
works nicely::


  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> filenames = ["Compound_099000001_099500000.fps", "Compound_048500001_049000000.fps"]
  >>> sources = []
  >>> for filename in filenames:
  ...   with chemfp.open(filename) as reader:
  ...     sources.extend(source.metadata.sources)
  ...
  >>> sources
  ['Compound_048500001_049000000.sdf.gz', 'Compound_048500001_049000000.sdf.gz']
  >>> metadata = reader.metadata.copy(sources=sources) # use the last reader
  >>> print(metadata)
  #type=CACTVS-E_SCREEN/1.0 extended=2
  #software=CACTVS/unknown
  #source=Compound_048500001_049000000.sdf.gz
  #source=Compound_048500001_049000000.sdf.gz
  #date=2020-05-13T14:03:34
  
  >>> with chemfp.open_fingerprint_writer("merged_pubchem.fps", metadata=metadata) as writer:
  ...     for filename in filenames:
  ...         with chemfp.open(filename) as reader:
  ...             writer.write_fingerprints(reader)
  ...

This code assumes that the fingerprints are compatible, that is, that
the fingerprints are the same size, and the fingerprint types and
other metadata fields are compatible. The next section shows how to
detect if there are compatibility problems.


Check for metadata compatibility problems
=========================================

In this section you'll learn how to detect compatibility mismatches
between two metadata instances, and between a metadata and a
fingerprint.

In the previous section you learned how to merge multiple fingerprint
files, which all happened to have the same fingerprint type. What
happens if they are different types?

There are actually a few possible problems:

- the fingerprint lengths are different (very bad)
- the fingerprint types are different (probably bad)
- the software is from different versions (probably okay)


The :func:`chemfp.check_metadata_problems` function compares two
metadata objects and returns a list of possible problems::

  >>> from __future__ import print_function # Only for Python 2
  >>> import chemfp
  >>> rdkit_metadata = chemfp.get_fingerprint_type("RDKit-MACCS166").get_metadata()
  >>> openeye_metadata = chemfp.get_fingerprint_type("OpenEye-MACCS166").get_metadata()
  >>> problems = chemfp.check_metadata_problems(rdkit_metadata, openeye_metadata)
  >>> len(problems)
  2
  >>> for problem in problems:
  ...   print(problem)
  ... 
  WARNING: query has fingerprints of type 'RDKit-MACCS166/2' but
  target has fingerprints of type 'OpenEye-MACCS166/3'
  INFO: query comes from software 'RDKit/2020.03.1 chemfp/3.4' but
  target comes from software 'OEGraphSim/2.4.3 (20191016) chemfp/3.4'

In this case the fingerprint types are different, but since the
fingerprint lengths are the same it's not an error, only a
warning. The software field is also not identical, but as that's not
so significant it's listed as "info".

The returned problem objects are :func:`chemfp.ChemFPProblem`
instances, which have useful attributes::

  >>> for problem in problems:
  ...   print("Problem:")
  ...   print("  severity:", problem.severity)
  ...   print("  category:", problem.category)
  ...   print("  description:", problem.description)
  ... 
  Problem:
    severity: warning
    category: type mismatch
    description: query has fingerprints of type 'RDKit-MACCS166/2' but target has fingerprints of type 'OpenEye-MACCS166/3'
  Problem:
    severity: info
    category: software mismatch
    description: query comes from software 'RDKit/2020.03.1 chemfp/3.4' but target comes from software 'OEGraphSim/2.4.3 (20191016) chemfp/3.4'

The idea is that the ``category`` text won't change, so your code can
figure out what's going on, while the ``description`` is subject to
change and hopefully improvement. The severity is one of "info",
"warning" and "error".

  >>> rdkit1_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=512").get_metadata()
  >>> rdkit2_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=1024").get_metadata()
  >>> problems = chemfp.check_metadata_problems(rdkit1_metadata, rdkit2_metadata)
  >>> for problem in problems:
  ...   print(problem)
  ... 
  ERROR: query has 512 bit fingerprints but target has 1024 bit fingerprints
  WARNING: query has fingerprints of type 'RDKit-Fingerprint/2 minPath=1
  maxPath=7 fpSize=512 nBitsPerHash=2 useHs=1' but target has
  fingerprints of type 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024
  nBitsPerHash=2 useHs=1'
 
A :class:`chemfp.ChemFPProblem` is derived from ``Exception``, so you
can raise it directly if you want:

  >>> for problem in chemfp.check_metadata_problems(rdkit1_metadata, rdkit2_metadata):
  ...   if problem.severity == "error":
  ...     raise problem
  ... 
  Traceback (most recent call last):
    File "<stdin>", line 3, in <module>
  chemfp.ChemFPProblem: ERROR: query has 512 bit fingerprints but target has 1024 bit fingerprints

You might have noticed that the error message uses the words "query"
and "target". Chemfp is designed around similarity searches, so I
expect the default to compare query metadata to target metadata.

On the other hand, the previous section merged multiple fingerprint
files, where "query" and "target" don't make sense. Instead, you can
give alternative names via the *query_name* and *target_name*
parameters::

  >>> rdkit1_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=512").get_metadata()
  >>> rdkit2_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=1024").get_metadata()
  >>> for problem in chemfp.check_metadata_problems(rdkit1_metadata, rdkit2_metadata,
  ...                                                "file #1", "file #14"):
  ...   if problem.severity == "error":
  ...     print(problem)
  ... 
  ERROR: file #1 has 512 bit fingerprints but file #14 has 1024 bit fingerprints

I'll use this to update the code from the previous section to raise an
exception on errors, print warnings to stderr, and do nothing about
"info" problems, and add a MACCS fingerprint file to the list of files
to process, so I can show what happens if there's a problem:

.. code-block:: python 

  from __future__ import print_function   # Only for Python 2
  import sys
  import chemfp
  
  filenames = ["Compound_099000001_099500000.fps",
               "Compound_048500001_049000000.fps",
	       "chebi_maccs.fps"]
  
  # Create the correct metadata with all of the sources from all of the files.
  metadata = None
  sources = []
  for filename in filenames:
      with chemfp.open(filename) as reader:
          if metadata is None:
              metadata = reader.metadata.copy()
              first_filename = filename
          else:
              # Check for compatibility problems
              for problem in chemfp.check_metadata_problems(metadata, reader.metadata,
                                                            repr(first_filename),
                                                            repr(filename)):
                  if problem.severity == "error":
                      raise problem
                  elif problem.severity == "warning":
                      sys.stderr.write(str(problem) + "\n")
                  
          sources.extend(reader.metadata.sources)

  if metadata is not None:
    metadata = metadata.copy(sources=sources)
  
  # Merge the files using the new metadata
  with chemfp.open_fingerprint_writer("merged_pubchem.fps", metadata=metadata) as writer:
      for filename in filenames:
          with chemfp.open(filename) as reader:
              writer.write_fingerprints(reader)
  
When I run that code with the mismatched fingerprint types, I get the
error message::

  Traceback (most recent call last):
    File "x.py", line 23, in <module>
      raise problem
  chemfp.ChemFPProblem: ERROR: 'Compound_099000001_099500000.fps' has 881 bit fingerprints but 'chebi_maccs.fps' has 166 bit fingerprints


I then removed the ``chebi_maccs.fps`` and manually changed the
fingerprint type in Compound_048500001_049000000.fps, so I could
demonstrate what a warning message looks like:

.. code-block:: none 
  
  WARNING: 'Compound_099000001_099500000.fps' has fingerprints of type
  'CACTVS-E_SCREEN/1.0 extended=2' but 'Compound_048500001_049000000.fps'
  has fingerprints of type 'CACTVS-E_SCREEN/1.0 extended=DIFFERENT_VALUE'
  Traceback (most recent call last):
    File "/Users/dalke/cvses/cfp-3x/docs/x.py", line 23, in <module>
      raise problem
  chemfp.ChemFPProblem: ERROR: 'Compound_099000001_099500000.fps' has
  881 bit fingerprints but 'chebi_maccs.fps' has 166 bit fingerprints

(In case you're wondering what the type string means, those are the
actual CACTVS parameters that PubChem uses, according to the CACTVS
author, Wolf-Dietrich Ihlenfeldt.)

Lastly, sometimes the query is a simple byte string. There's not
really much to compare, but you use
:func:`chemfp.check_fingerprint_problems` to see if the fingerprint
length is compatible with a metadata instance::

  >>> import chemfp
  >>> metadata = chemfp.get_fingerprint_type("RDKit-MACCS166").get_metadata()
  >>> chemfp.check_fingerprint_problems(b"\0\0\0\0", metadata)
  [ChemFPProblem('error', 'num_bytes mismatch', 'query contains 4
  bytes but target has 21 byte fingerprints')]

The :ref:`simsearch <simsearch>` command-line tool uses this function
to check if the query fingerprint, which is entered as hex as a
command-line parameter, is compatible with the target fingerprints.


How to write very large FPB files
=================================

In this section you'll learn how to write an FPB file even when
fingerprint data is so large that the intermediate data doesn't all
fit into memory at once.

By default the FPB format will reorder the fingerprints to be in
popcount order.  (Use ``reorder=False`` option to preserve the input
order.) This requires intermediate storage in order to sort all of the
records. By default the writer will use memory for this, but the
implementation may require about two to three times as much memory as
the raw fingerprint size.

That is, if you have 50 million fingerprints, with 1024 bits per
fingerprint, plus 10 bytes for the name, then the fingerprint arena
requires about 6 GiB of memory, plus 0.5 GiB for the ids, and another
~1 GiB for the id lookup table.

That calculation gives the minimum amount of memory needed. The actual
implementation may preallocate up to twice as much memory as the
current size, in order to handle growth gracefully, and there is some
additional overhead. You may be left with the case where you have 12
GiB of RAM, and where the final FPB file is only 8 GiB in size, but
where the intermediate storage requires 15 GiB of RAM.

Or you may want to build that data set on a machine with 6 GiB of RAM,
and copy the result over to the production machine with a lot more
memory.

If that happens, then use the *max_spool_size* option to specify the
maximum number of bytes to store in memory before switching to
temporary files for additional storage. This should be about 1/3 of
the available RAM because there can be two different temporary file
"spools", each of which can use up to *max_spool_size* bytes of RAM.

For example, the following will use at most about 4 GiB of RAM:

.. code-block:: python
    
  writer = chemfp.open_fingerprint_writer(
      "pubchem.fpb", max_spool_size = 2 * 1024 * 1024 * 1024)
 
Note: do not make this too small. The merge step opens all of the
temporary files in order to make the final FPB output file. If you
specify a spool size of 50 MiB then you'll end up creating several
hundred files for PubChem, which may exceed the resource limits for
the number of open file descriptors for a process. When that happens
you'll get an exception like:

.. code-block:: none 
    
  IOError: [Errno 24] Too many open files

Where does the FPB writer store the temporary files? It uses Python's
`tempfile module
<https://docs.python.org/2/library/tempfile.html>`_
to create the temporary files in a directory. Quoting from that
documentation, "The default directory is chosen from a
platform-dependent list, but the user of the application can control
the directory location by setting the TMPDIR, TEMP or TMP environment
variables."

Environment variables give one way to specify an alternate
directory. Or you can specify it directly using the *tmpdir*
parameter, as in:

.. code-block:: python 
    
  writer = chemfp.open_fingerprint_writer(
      "pubchem.fpb", max_spool_size = 2 * 1024 * 1024 * 1024,
      tmpdir = "/scratch")

This can be very important on some cluster machines with a small local
/tmp but a large networked scratch disk.


FPS fingerprint writer errors
=============================

In this section you'll learn how the FPS fingerprint writer handles
errors, and how to change the error handling behavior.

It's hard but not impossible to have the FPS writer raise an
exception::

  >>> import chemfp
  >>> writer = chemfp.open_fingerprint_writer(None)
  #FPS1
  >>> writer.write_fingerprint("Tab\tHere", b"\0")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "chemfp/fps_io.py", line 550, in write_fingerprint
      raise_tb(err[0], err[1])
    File "chemfp/fps_io.py", line 467, in _fps_writer_gen
      location)
    File "chemfp/io.py", line 87, in error
      _compat.raise_tb(ParseError(msg, location), None)
    File "<string>", line 1, in raise_tb
  chemfp.ParseError: Unable to write an identifier containing a tab: 'Tab\tHere', file '<stdout>', line 1, record #1

The FPS file format simply doesn't support tab characters in the
indentifier, nor newline characters, for that matter. It also doesn't
allow empty identifiers.

As you saw, the default error action is to raise an exception.

Sometimes it's okay to ignore errors. For example, you might process a
large number of structures, where you know that a few of them have
missing, or poorly formed, identifiers, and where it's okay to omit
those records.

The *errors* parameter can be used to change the error handler. The
value of "report" tells the parser to skip failing record and write an
error message written to stderr. The value of "ignore" simply skips
the record::

  >>> writer = chemfp.open_fingerprint_writer(None, errors="report")
  #FPS1
  >>> writer.write_fingerprint("", b"\0\0\0\0")
  ERROR: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 1, record #1. Skipping.
  >>> 
  >>> writer = chemfp.open_fingerprint_writer(None, errors="ignore")
  #FPS1
  >>> writer.write_fingerprint("", b"\0")
  >>> writer.write_fingerprint("Tab\tHere", b"\0")

Granted, this feature isn't so important for
:meth:`.FingerprintWriter.write_fingerprint` because catching the
exception isn't hard to do. It's a bit more useful for bulk
conversions with :meth:`.FingerprintWriter.write_fingerprints`, like:

.. code-block:: python
  
  import chemfp
  with chemfp.read_molecule_fingerprints("RDKit-MACCS166", "Compound_099000001_099500000.sdf.gz") as reader:
    with chemfp.open_fingerprint_writer("example.fps", reader.metadata, errors="report") as writer:
      writer.write_fingerprints(reader)

Note that the FPB writer ignores the *errors* parameter and treats all
errors as "strict".


FPS fingerprint writer location
===============================

In this section you'll learn how to get information like the number of
lines and number of records written to an FPS file.

I'll start by saying that this feature isn't all that useful. It
exists because of parallelism to the toolkit structure writers, and I
wanted to experiment to see if it could be useful in the future.

The :class:`FPS fingerprint writer <.FPSWriter>` has a
``location`` attribute. This can be used to get some information about
the state of the output writer. The most basic is the output
filename. If the output is None or an unnamed file object then a fake
filename will be used::

  >>> import chemfp
  >>> writer = chemfp.open_fingerprint_writer("example.fps")
  >>> writer.location.filename
  'example.fps'
  >>> writer = chemfp.open_fingerprint_writer(None)
  #FPS1
  >>> writer.location.filename
  '<stdout>'

At this point the signature line has been written, so the file is at
line 1, but no record have been written::

  >>> writer.location.lineno
  1
  >>> writer.location.recno
  0
  >>> writer.location.output_recno
  0

Each of these values is incremented by one after adding a valid record::

  >>> writer.write_fingerprint("FP001", b"\xA0\xFE")
  a0fe    FP001
  >>> writer.location.lineno
  2
  >>> writer.location.recno
  1
  >>> writer.location.output_recno
  1

If however the record is invalid then the ``recno`` will increase by one
because it's the number of records sent to the writer, but the other
values do not increase because they only change when a record is
written successfully::

  >>> writer.write_fingerprint("", b"\xA0\xFE")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "chemfp/fps_io.py", line 550, in write_fingerprint
      raise_tb(err[0], err[1])
    File "chemfp/fps_io.py", line 475, in _fps_writer_gen
      location)
    File "chemfp/io.py", line 87, in error
      _compat.raise_tb(ParseError(msg, location), None)
    File "<string>", line 1, in raise_tb
  chemfp.ParseError: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 2, record #2
  >>> writer.location.lineno
  2
  >>> writer.location.recno
  2
  >>> writer.location.output_recno
  1

This is perhaps more clearly shown if I try to write four records at
one, where two contain errors, and where I've asked the writer to
"report" errors rather than raise an exception::

  >>> metadata = chemfp.Metadata(type="Experiment/1", software="AndrewDalke/1")
  >>> writer = chemfp.open_fingerprint_writer(None, metadata=metadata, errors="report")
  #FPS1
  #type=Experiment/1
  #software=AndrewDalke/1
  >>> writer.location.lineno
  3
  >>> writer.location.recno
  0
  >>> writer.location.output_recno
  0
  >>> writer.write_fingerprints( [("A", b"\0\0"), ("\t", b"\0\1"), ("", b"\0\2"), ("B", b"\0\3")] )
  0000	A
  ERROR: Unable to write an identifier containing a tab: '\t', file '<stdout>', line 4, record #2. Skipping.
  ERROR: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 4, record #3. Skipping.
  0003	B
  >>> writer.location.recno
  4
  >>> writer.location.output_recno
  2
  >>> writer.location.lineno
  5

There are three lines in the header; the signature, the type line, and
the software line. I tried to write four fingerprints, but two were
invalid. It wrote the valid fingerprint "A" to stdout, report the two
invalid records to stderr, and write the valid fingerprint "B" to
stdout. Thus, two records were actually output, which is why
``output_recno`` is 2, while four records were sent to the writer,
which is why ``recno`` is 4. The three header lines and the two lines
of output give five lines of output, so the final ``lineno`` is 5.

In case you hadn't figured it out, the location information is used to
make the exception and error message. That explains why both of the
error reports say the error is on "line 4"; that's the line that would
have been output if there were no error.

Note that the :class:`FPB writer <~FPBWriter>` does not have a
location, and it ignores the ``location`` parameter.

.. _maccs_and_hydrogens:

MACCS dependency on hydrogens
=============================

In this section you'll learn how the RDKit MACCS fingerprints differ
if there are explicit or implicit hydrogens.

Note: A goal of this is to show that MACCS key generation isn't as
easy as you might think it is!

One of my long-term goals is to get a good cross-toolkit
implementation of the MACCS keys. It's very odd how the MACCS keys
are the de facto fingerprint for cheminformatics, but the toolkits
don't give the same answers. Over the years, I've found bugs or
incomplete definitions in all of the toolkits I've looked at, which
I've reported and have since been fixed.

If you use RDKit, Open Babel, or CDK then your toolkit implements
MACCS keys that were derived from the ones that Greg Landrum developed
for RDKit. The portable portion uses hand-translated SMARTS
definitions for most of the MACCS key definitions. A couple keys, like
key 125 ("at least two aromatic rings") cannot be represented as
SMARTS. RDKit had special code for these definitions, but Open Babel
does not.

Even with a portable SMARTS definition, I would expect to see some
differences between the toolkits, if only because they have different
aromaticity models. One toolkit might call something an aromatic ring,
while another says it's aliphatic.

Unfortunately, the SMARTS patterns used in those programs give
different results if you have explicit hydrogens or implicit
hydrogens. I'll demonstrate with using RDKit, because that has a
*reader_arg* to specify if I want to remove hydrogens from the input
structure or not. (Here "remove" means to make them implicit.)

I'll use RDKit twice to read the first molecule from a file and
compute the RDKit fingerprint; the first time I keep the hydrogens and
the second time I remove them::

  >>> import chemfp
  >>> from chemfp import bitops
  >>> filename = "Compound_099000001_099500000.sdf.gz"
  >>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
  >>> 
  >>> with_h_reader = fptype.read_molecule_fingerprints(filename,
  ...    reader_args={"removeHs": False})
  >>> with_h_id, with_h_fp = next(with_h_reader)
  >>> with_h_id, bitops.hex_encode(with_h_fp)
  ('99000039', '000004000000300001c4004e93e1b053dce16f6e1f')
  >>> 
  >>> without_h_reader = fptype.read_molecule_fingerprints(filename,
  ...    reader_args={"removeHs": True})
  >>> without_h_id, without_h_fp = next(without_h_reader)
  >>> without_h_id, bitops.hex_encode(without_h_fp)
  ('99000039', '000004000000300001c0004e9361b051dce1676e1f')

If you look closely you'll see that they have two different
fingerprints! I'll make it easier to see by reporting the bits which
are only in one or the other fingerprint:

  >>> with_h_bits = set(bitops.byte_to_bitlist(with_h_fp))
  >>> without_h_bits = set(bitops.byte_to_bitlist(without_h_fp))
  >>> sorted(with_h_bits - without_h_bits) # only with hydrogens
  [74, 111, 121, 147]
  >>> sorted(without_h_bits - with_h_bits) # only without hydrogens
  []

The molecule with explicit hydrogens sets four more bits than the one
with implicit hydrogens.

Why is that? The RDKit (and hence Open Babel and CDK) definitions
often use "\*" to match an atom, when the corresponding MACCS
definition is supposed to exclude hydrogens. A hydrogen-independent
version would use "[!#1]" instead. By default RDKit removes normal
explicit hydrogens, so this isn't usually a problem. As far as I can
tell, Open Babel always removes them from an SD file, so again this
isn't really a problem. (Well, except for hydrogens with an explicit
isotope number.)

The list [74, 111, 121, 147] are bit numbers. The corresponding keys
are 75, 112, 122, and 148. I looked at how key 122 is defined in
various sources:

.. code-block:: none 
    
  Definitions for key 112 (bit 111)
      MACCS: AA(A)(A)A
      RDKit:  *~*(~*)(~*)~*
      OpenBabel: *~*(~*)(~*)~*
      CDK: *~*(~*)(~*)~*
      chemfp's RDMACCS-*: [!#1]~*(~[!#1])(~[!#1])~[!#1]   
      O'Donnell: *~*(~*)(~*)~*

("O'Donnell" here comes from Table A.4 of TJ O'Donnell's `Design and
Use of Relational Databases in Chemistry <http://www.crcpress.com/product/isbn/9781420064421>`_.)

If you know SMARTS you can see how an explicit H will lead to a
different match than an implicit one, except for chemfp's own attempt
at making a cross-toolkit MACCS implementation. I'll test out
RDMACCS-RDKit, which is chemfp's implementation of the MACCS 166
fingerprint using RDKit::

  >>> chemfp_maccs = chemfp.get_fingerprint_type("RDMACCS-RDKit")
  >>> 
  >>> with_h_reader = chemfp_maccs.read_molecule_fingerprints(filename,
  ...     reader_args={"removeHs": False})
  >>> with_h_id, with_h_fp = next(with_h_reader)
  >>> with_h_id, bitops.hex_encode(with_h_fp)
  ('99000039', '000004000000300001c0004e9361b051dce1676e1f')
  >>> 
  >>> without_h_reader = chemfp_maccs.read_molecule_fingerprints(filename,
  ...     reader_args={"removeHs": True})
  >>> without_h_id, without_h_fp = next(without_h_reader)
  >>> without_h_id, bitops.hex_encode(without_h_fp)
  ('99000039', '000004000000300001c0004e9361b051dce1676e1f')
  >>> 
  >>> with_h_fp == without_h_fp
  True

What a relief that they are the same!

If you want to use the OEChem, Open Babel, or CDK-based RDMACSS
implemenations, the corresponding fingerprint type names are
"RDMACCS-OpenEye", "RDMACCS-OpenBabel", or "RDMACCS-CDK" respectively,
and the command-line option for :ref:`oe2fps <oe2fps>`, :ref:`ob2fps
<ob2fps>` and :ref:`cdk2fps <cdk2fps>` is :option:`--rdmaccs`.

WARNING: the RDMACCS fingerprints have not been fully validated!
Validation is hard. A chemfp goal is to make that easier.

To finish, I was curious about the differences in RDKit's native
MACCS166 implementation across all of the records in the file, so I
wrote some code. It's a direct evolution of the code you already
saw. (Note: for Python 2 I use itertools.izip() as a replacement for
the generator-based zip() in Python 3.)

.. code-block:: python 
    
  from __future__ import print_function # Only for Python 2
  import itertools 
  from collections import Counter 
  import chemfp
  from chemfp import bitops
  
  zip = getattr(itertools, "izip", zip) # Support Python2 and Python3
  
  filename = "Compound_099000001_099500000.sdf.gz"
  with_h_fingerprints = chemfp.read_molecule_fingerprints(
      "RDKit-MACCS166", filename, reader_args={"removeHs": False})
  without_h_fingerprints = chemfp.read_molecule_fingerprints(
      "RDKit-MACCS166", filename, reader_args={"removeHs": True})
  
  extra_with_h = Counter()
  extra_without_h = Counter()
  num_records = 0
  for (id1, with_h_fp), (id2, without_h_fp) in zip(with_h_fingerprints,
                                                    without_h_fingerprints):
      num_records += 1
      assert id1 == id2, (id1, id2)
      if with_h_fp != without_h_fp:
          with_h_keys = set(bitno+1 for bitno in bitops.byte_to_bitlist(with_h_fp))
          without_h_keys = set(bitno+1 for bitno in bitops.byte_to_bitlist(without_h_fp))
          only_with_h = sorted(with_h_keys - without_h_keys)
          only_without_h = sorted(without_h_keys - with_h_keys)
          print(id1, "with:", only_with_h, "without:", only_without_h)
          extra_with_h.update(only_with_h)
          extra_without_h.update(only_without_h)
  
  print("\nNumber of records:", num_records)
  print("\nCounts that were only with hydrogens:")
  for key, count in extra_with_h.most_common():
      print("  %d %d" % (key, count))
  print("\nCounts that were only without hydrogens:")
  for key, count in extra_without_h.most_common():
      print("  %d %d" % (key, count))

In case you were wondering, the report summary starts:

.. code-block:: none 
  
  Number of records: 10826
  
  Counts that were only with hydrogens:
    112 6851
    150 3345
    144 3209
    122 2807
    138 2767
    66 2763
    148 2372
    155 2311
    126 684
    76 682
    75 412
    81 407
    128 344
    118 173
    156 96
    107 24
    90 18
    108 15
    129 9
    132 2
  
Now you can see why I used key 112 in my elaboration - it's the one
that causes the most problems!


Create similarity search web service
=====================================

In this section you'll learn how to write a simple WSGI-based web
service which does a similarity search given an SDF record.

I found it a bit difficult to write this section because few people
will write a WSGI service directly. I think most people use Django,
but a Django example would require several different files to make it
work. There are other web frameworks I could use, like Flask, but I
eventually decided to limit myself to what's available in the standard
library, that is, the `wsgiref module
<https://docs.python.org/2/library/wsgiref.html>`_.

I'm going to write a WSGI server named "simple_server.py" which takes
an SDF record as input and returns the top 5 hits from a specified
database. If there's a GET request then the result is a simple
form. The form sends a POST request to the server, with the SDF record
in the query parameter *q*.

By the way, if the target fingerprint data set is large then you
should use an FPB file to get the best startup performance.

.. highlight:: python 

Let's get started. The first part is a comment about what the code
does and some imports::

  # This is a very simple fingerprint search server.
  # I call it 'simple_server.py'.
  #
  # Usage: simple_server <fingerprint_filename> [port]
  #
  # A GET to the server (default uses port 80) returns a simple form.
  # The form has a single text box, to paste the SDF query or queries.
  # The POST query variable 'q' contains the SDF contents.
  # The search finds the nearest 5 queries for each query record.
  # The result is a simple list of query ids and its matches.
  
  
  import argparse
  from wsgiref.simple_server import make_server
  import cgi
  
  import chemfp

The server will return an HTML form for a GET request::
  
  # Create a simple form.
  def query_form(environ, start_response):
      status = '200 OK' # HTTP Status
      headers = [('Content-type', 'text/html')] # HTTP Headers
      start_response(status, headers)
  
      # The returned object is going to be printed.
      # Must be a byte string for Python 3.
      return [b"""<html>
  <head>
   <title>Simple fingerprint search</title>
  </head>
  <body>
  <form method="POST">
  Paste in SDF records(s):<br />
  <textarea name="q" type="text" rows="20" cols="80"></textarea><br />
  <button type="submit">Search!</button>
  </form>
  </body>
  </html>
  """]

I'll use the argparse module to handle the command-line arguments::

  # Command-line parameters
  parser = argparse.ArgumentParser("simple_search",
                                   description="Simple fingerprint web server with SDF input")
  parser.add_argument("filename",
                      help="chemfp fingerprint filename")
  
  parser.add_argument("port", type=int, default=8080, nargs="?",
                      help="port to use (default is 8080)")

The heavy work is in the 'main' function. It starts with some setup to
load the fingerprints and make sure the fingerprint type is available::

  def main():
      args = parser.parse_args()
  
      # Load the arena, get the type, and make sure I can handle the type.
      arena = chemfp.load_fingerprints(args.filename)
      print("Loaded %s fingerprints from %r" % (len(arena), args.filename))
      
      type = arena.metadata.type
      if type is None:
          parser.error("File %r does not contain a fingerprint type" % (args.filename,))
  
      try:
          fptype = chemfp.get_fingerprint_type(type)
      except KeyError as err:
          parser.error(str(err))

It then defines the WSGI app, which returns the query_form() for a GET
request, or processes the form for a POST request. I think the
embedded comments explain things enough::

  # ... continue the 'main' function ...
      # This is the WSGI app, defined inside of main
  
      def fingerprint_search_app(environ, start_response):
          # Is this a GET or a POST? If a GET, return the query form
          if environ["REQUEST_METHOD"] != "POST":
              return query_form(environ, start_response)
          
          # Get the query data from the POST
          post_env = environ.copy()
          post = cgi.FieldStorage(
              fp=environ['wsgi.input'],
              environ=post_env,
              keep_blank_values=True,
          )
          q = post.getfirst("q", "")
          # The underlying toolkit code may require "\n" instead of "\r\n" strings.
          q = q.replace("\r\n", "\n")
          
          # For each input record, do a search, get the results, and build up the output lines.
          # Ignore any records that can't be parsed.
          
          output = ["Search against %r using k=5 and threshold=0.0\n\n" % (args.filename,)]
          
          # The next three lines use chemfp to convert the record into a
          # fingerprint, do the search for the top 5 hits, get the ids
          # and scores for the hits, and make the output text.
          
          for query_id, fp in fptype.read_molecule_fingerprints_from_string(q, "sdf", errors="ignore"):
              results = arena.knearest_tanimoto_search_fp(fp, k=5, threshold=0.0)
              text = " ".join( "%s (%.3f)" % (id, score) for (id, score) in results.get_ids_and_scores())
              output.append("%s => %s\n" % (query_id, text))
    
          # Return the results in plain text.
            
          status = '200 OK' # HTTP Status
          headers = [('Content-type', 'text/plain')] # HTTP Headers
          start_response(status, headers)
          
          # Python 3 requires bytes, not strings, so convert to UTF-8
          return [line.encode("utf8") for line in output]
  
The main function ends with some code to start the WSGI server using
the correct port::
        
      # ... end of the 'main' function ...
      # Make the server and run it. (Use ^C to kill it.)
      httpd = make_server('', args.port, fingerprint_search_app)
      print("Serving fingerprint search on port %s..." % (args.port,))
    
      httpd.serve_forever()
 

Finally, code to start things rolling::
  
  if __name__ == "__main__":
      main()

.. highlight:: pycon
      
I'll start the server using a ChEBI-derived data set:

.. code-block:: console
  
  % python simple_server.py rdkit_chebi.fps
  Loaded 106965 fingerprints from 'rdkit_chebi.fps'
  Serving fingerprint search on port 8080...
  
then direct the browser to http://127.0.0.1:8080/ . I pasted in the
first three records from ChEBI itself, pressed "Search!", and got the
result:

.. code-block:: none
   
  Search against 'rdkit_chebi.fps' using k=5 and threshold=0.0
  
   => CHEBI:90 (1.000) CHEBI:15600 (1.000) CHEBI:23053 (1.000) CHEBI:33992 (1.000) CHEBI:58994 (1.000)
   => CHEBI:165 (1.000) CHEBI:4999 (1.000) CHEBI:36612 (1.000) CHEBI:132827 (0.977) CHEBI:15994 (0.944)
   => CHEBI:598 (1.000) CHEBI:52595 (1.000) CHEBI:144315 (0.965) CHEBI:17389 (0.716) CHEBI:134138 (0.716)

I don't think I'll continue this WSGI example in future documentation
as that API is too low-level and seldom used by web developers. If you
think otherwise, let me know.