# Fingerprints and fingerprint search examples¶

The chemfp command-line programs use a Python library called chemfp. Portions of the API are in flux and subject to change. The stable portions of the API which are open for general use are documented in chemfp API.

The API includes:

• low-level Tanimoto, Tversky, and popcount operations
• Tanimoto and Tversky search algorithms based on threshold and/or k-nearest neighbors
• routines for reading and writing fingerprints
• a cross-toolkit molecule I/O API
• a cross-toolkit fingerprint type API

The following chapters give examples of how to use the API, starting with fingerprints, fingerprint I/O, and fingerprint search.

## Python 2 vs. Python 3¶

Python 2.7 support by the core Python developers ended at the start of 2020. The Python 2 series has reached its effective end-of-life. It’s time for you to migrate code to Python 3.

If you are writing new code which uses chemfp then you really should start using Python 3. OpenEye stopped shipping a Python 2.7 version of OEChem by the end of 2017, and Open Babel and RDKit stopped Python 2.7 support by 2019. Chemfp 3.5 is the last version of the commercial chemfp development track which will support Python 2.

If you have code which works under Python 2 and you want it to work on Python 3, then there are two main options. In some cases you can re-write all the incompatible code, so the result works under Python 3 but not Python 2. However, that can be too big of a step.

Another option is to port your code to the subset of Python which works under both Python 2 and Python 3. While this is more work overall, the steps are smaller, and it’s possible to develop new features while gradually doing the port.

A goal of the chemfp 3 series is to help with that migration. It supports both Python 2.7 and Python 3.6 or later, with the same API.

This documentation is written with that second option in mind. The examples are shown in Python 3, but the same code will work under Python 2.7. The only differences are in the output, which I’ll detail in the next section.

## Unicode and byte strings¶

In chemfp 3.x, the record identifier is a Unicode string while the fingerprint is a byte string. Earlier versions of chemfp treated both identifiers and fingerprints as byte strings. To make things more confusing, Python 2 and Python 3 use different ways to input and denote Unicode and binary strings.

Under Python 2, normal strings are byte strings, while Unicode strings are represented with the u"" syntax:

>>> "This is a byte string"     # Python 2
'This is a byte string'
>>> u"This is a Unicode string"
u'This is a Unicode string'


Under Python 3, normal strings are Unicode strings, while byte strings are represented with the b"" syntax:

>>> b"This is a byte string"    # Python 3
b'This is a byte string'
>>> "This is a Unicode string"
'This is a Unicode string'


Python 2.7 understands the b"" notation, and Python 3 understands the u"" notation, so the portable way to represent a Unicode identifier and binary fingerprint is to be explicit about the string type:

>>> id = u"España"   # Works in Python 2.7 and Python 3
>>> fp = b"\x00A!\xff"


While the data types are the same, the output representations are different on the two versions of Python:

>>> (id, fp)    # Python 2.7
(u'Espa\xf1a', '\x00A!\xff')

>>> (id, fp)    #  Python 3
('España', b'\x00A!\xff')


The output in these examples will be from Python 3. Unless otherwise stated, the equivalent output in Python 2.7 differs only in the prefix.

## Hex representation of a binary fingerprint¶

In Python 2 it is easy to turn a byte string into a hex-encoded string:

>>> fp = b"\x00A!\xff"  # Python 2.7
>>> fp.encode("hex")
'004121ff'


The more direct route (and faster) is to use the binascii.hexlify function:

>>> import binascii   # Python 2.7
>>> binascii.hexlify(fp)
'004121ff'


In Python 3 it’s even easier to turn a byte string into a hex-encoded string:

>>> fp = b"\x00A!\xff"  # Python 3
>>> fp.hex()
'004121ff'


but that is not portable. Nor does fp.encode("hex") work, because in Python 3 byte strings do not have an encode() method:

>>> fp.encode("hex")  # Python 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'


If you want a byte string as output then the portable solution is to use hexlify:

>>> import binascii  # Python 3
>>> binascii.hexlify(fp)
b'004121ff'


However, on Python 2.7 I often want the hex-encoded version as a byte (“normal”) string, while on Python 3 I want it as a (“normal”) Unicode string, because I use hex strings for text output.

Python does not offer a portable solution, but chemfp does, in the chemfp.bitops module, named hex_encode

>>> from chemfp import bitops  # Python 2 and Python 3
>>> bitops.hex_encode(b"\x00A!\xff")
'004121ff'


The variant hex_encode_as_bytes returns a byte string, and I think is easier to remember than binascii.hexlify:

>>> bitops.hex_encode_as_bytes(b"\x00A!\xff")
b'004121ff'


## Byte and hex fingerprints¶

In this section you’ll learn how chemfp stores fingerprints and some of the low-level bit operations on those fingerprints.

chemfp stores fingerprints as byte strings. Here are two 8 bit fingerprints:

>>> fp1 = b"A"
>>> fp2 = b"B"


The chemfp.bitops module contains functions which work on byte fingerprints. Here’s the byte Tanimoto of those two fingerprints:

>>> from chemfp import bitops
>>> bitops.byte_tanimoto(fp1, fp2)
0.3333333333333333


To understand why, you have to know that ASCII character “A” has the value 65, and “B” has the value 66. The bit representation is:

"A" = 01000001   and   "B" = 01000010


so their intersection has 1 bit and the union has 3, giving a Tanimoto of 1/3 or 0.3333333333333333 as stored in Python’s 64 bit floating point value.

You can compute the Tanimoto between any two byte strings with the same length, as in:

>>> bitops.byte_tanimoto(b"apples&", b"oranges")
0.58333333333333334


You’ll get a ValueError if they have different lengths:

>>> bitops.byte_tanimoto(b"ABC", b"A")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: byte fingerprints must have the same length


The Tversky index is also available. The default values for alpha and beta are 1.0, which is identical to the Tanimoto:

>>> bitops.byte_tversky(b"apples&", b"oranges")
0.5833333333333334
>>> bitops.byte_tversky(b"apples&", b"oranges", 1.0, 1.0)
0.5833333333333334


Using alpha = beta = 0.5 gives the Dice index:

>>> bitops.byte_tversky(b"apples&", b"oranges", 0.5, 0.5)
0.7368421052631579


In chemfp, the alpha and beta may be between 0.0 and 100.0, inclusive. Values outside that range will raise a ValueError:

>>> bitops.byte_tversky(b"A", b"B", 0.2, 101)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: beta must be between 0.0 and 100.0, inclusive


Most fingerprints are not as easy to read as the English ones I showed above. They tend to look more like:

P1@\x84K\x1aN\x00\n\x01\xa6\x10\x98\\\x10\x11


which is hard to read. I usually show hex-encoded fingerprints. The above fingerprint in hex is:

503140844b1a4e000a01a610985c1011


which is simpler to read. I’ll use hex_encode as the portable way to convert a byte fingerprint to a string under Python 2 and Python 3:

>>> bitops.hex_encode(b"apples&")   # Portable (returns a native string)
'6170706c657326'
>>> bitops.hex_encode(b"oranges")
'6f72616e676573'
>>> bitops.hex_decode(b"416e64726577")  # (returns a byte string)
b'Andrew'


If you do not need to support Python 2.7 then it’s easier to use the Python3 specific “.hex()” and “fromhex()” methods of byte strings:

>>> b"apples&".hex()   # Python 3 only!
'6170706c657326'
>>> b"oranges".hex()   # Python 3 only!
'6f72616e676573'
>>> bytes.fromhex("416e64726577")   # Python 3 only!
b'Andrew'


Most of the byte functions in the bitops module have an equivalent hex version, like bitops.hex_tanimoto() which is the hex equivalent for bitops.byte_tanimoto():

>>> bitops.hex_tanimoto("6170706c657326", "6f72616e676573")
0.5833333333333334
>>> bitops.hex_tanimoto(u"6170706c657326", u"6f72616e676573")
0.5833333333333334
>>> bitops.hex_tanimoto(b"6170706c657326", b"6f72616e676573")
0.5833333333333334


These functions accept both byte strings and Unicode strings.

Even though hex-encoded fingerprints are easier to read than raw bytes, it can still be hard to figure out that which bit is set in the hex fingerprint “00001000” (which is the byte fingerprint “\x00\x00\x10\x00”). For what it’s worth, bit number 20 is set, where bit 0 is the first bit.

You can get the list of “on” bits with the bitops.byte_to_bitlist() function:

>>> bitops.byte_to_bitlist(b"P1@\x84K\x1aN\x00\n\x01\xa6\x10\x98\\\x10\x11")
[4, 6, 8, 12, 13, 22, 26, 31, 32, 33, 35, 38, 41, 43, 44, 49, 50,
51, 54, 65, 67, 72, 81, 82, 85, 87, 92, 99, 100, 103, 106, 107,
108, 110, 116, 120, 124]


That’s a lot of overhead if you only want to tell if, say, bit 41 is set. For that case use bitops.byte_contains_bit():

>>> bitops.byte_contains_bit(b"P1@\x84K\x1aN\x00\n\x01", 41)
True
>>> bitops.byte_contains_bit(b"P1@\x84K\x1aN\x00\n\x01", 42)
False


The bitops.byte_from_bitlist() function creates a fingerprint given a list of ‘on’ bits. By default it generates a 1024 bit fingerprint, which is a bit too long for this documentation. I’ll use 64 bits instead:

>>> bitops.byte_from_bitlist([0], 64)
b'\x01\x00\x00\x00\x00\x00\x00\x00'


The bit positions folded based on the modulo of the fingerprint size, so bit 65 is mapped to bit 1, as in the following:

>>> bitops.byte_from_bitlist([0, 65], 64)
b'\x03\x00\x00\x00\x00\x00\x00\x00'
>>> bitops.byte_to_bitlist(bitops.byte_from_bitlist([0, 65], 64))
[0, 1]


The bitops module includes other low-level functions which work on byte fingerprints, as well as corresponding functions which work on hex fingerprints. (Hex-encoded fingerprints are decidedly second-class citizens in chemfp, but they are citizens.) The byte-based functions are:

The hex-based functions are:

There are two functions which compare a byte fingerprint to a hex fingerprint. These are somewhat faster than the pure hex version because they don’t need to verify that the query fingerprint contain only hex characters:

In this section you’ll learn the basics of the fingerprint reader classes and fingerprint metadata.

A fingerprint record is the fingerprint plus an identifier. In chemfp, a fingerprint reader is an object which supports iteration through fingerprint records. There some fingerprint readers, like the FingerprintArena also support direct record lookup.

That’s rather abstract, so let’s work with a few real examples. You’ll need to create a copy of the “pubchem_targets.fps” file generated in Generate fingerprint files from PubChem SD tags in order to follow along.

Here’s how to open an FPS file:

>>> import chemfp


Every fingerprint collection has a metadata attribute with details about the fingerprints. It comes from the header of the FPS file. You can view the metadata in Python repr format:

>>> reader.metadata
aromaticity=None, sources=['Compound_048500001_049000000.sdf.gz'],
software='CACTVS/unknown', date='2020-05-11T14:35:11')


In chemfp 3.x the type, software, date and the source filenames are Unicode strings. In earlier versions of chemfp these were byte strings.

I added a few newlines to make that easier to read, but I think it’s easier still to view it in string format, which matches the format of the FPS header:

>>> from __future__ import print_function  # Only needed in Python 2
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_048500001_049000000.sdf.gz
#date=2020-05-11T14:35:11


(The print statement in Python 2 was replaced with a print function in Python 3. The special future statement tells Python 2 to use the new print function syntax of Python 3.)

All fingerprint collections support iteration. Each step of the iteration returns the fingerprint identifier and the fingerprint byte string. Since I know the 6th record has the id 14550010, I can write a simple loop which stops with that record:

>>> from chemfp import bitops
>>> for (id, fp) in reader:
...   print(id, "starts with", bitops.hex_encode(fp)[:20])
...   if id == u"48500199":
...     break
...
48500020 starts with 07de0500000000000000
48500053 starts with 07de0c00000000000000
48500091 starts with 07de8c00000000000000
48500092 starts with 07de0d00020000000000
48500110 starts with 075e0c00000000000000
48500164 starts with 07de0c00000000000000
48500177 starts with 03de0500000800000000
48500199 starts with 07de0c00000000000000


Fingerprint collections also support iterating via arenas, and several support Tanimoto search methods.

## Working with a FingerprintArena¶

In this section you’ll learn about the FingerprintArena fingerprint collection and how to iterate through subarenas in a collection.

Chemfp supports two format types. The FPS format is designed to be easy to read and write, but searching through it requires a linear scan of the disk, which can only be done once. If you want to do many queries then it’s best to load the FPS data into memory as a FingerprintArena.

Use chemfp.load_fingerprints() to load fingerprints into an arena:

>>> from __future__ import print_function # Only needed for Python 2
>>> import chemfp
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_048500001_049000000.sdf.gz
#date=2020-05-11T14:35:11


The fingerprints can come from an FPS file, as in this example, or from an FPB file. The FPB format is much more complex internally, but can be loaded directly and quickly into a FingerprintArena, also with the same function:

>>> arena = chemfp.load_fingerprints("pubchem_targets.fpb")


An arena implements the fingerprint collection API, so you can do things like iterate over an arena and get the id/fingerprint pairs:

>>> from chemfp import bitops
>>> for id, fp in arena:
...     print(id, "with popcount", bitops.byte_popcount(fp))
...     if id == u"48656867":
...         break
...
48942244 with popcount 33
48941399 with popcount 39
48940284 with popcount 40
48943050 with popcount 40
48656359 with popcount 41
48656867 with popcount 42


If you look closely you’ll notice that the fingerprint record order has changed from the previous section, and that the population counts are suspiciously non-decreasing. By default load_fingerprints() on an FPS file reorders the fingerprints into a data structure which is faster to search, though you can disable that with the reorder parameter if you want the fingerprints to be the same as the input order.

The FingerprintArena has new capabilities. You can ask it how many fingerprints it contains, get the list of identifiers, and look up a fingerprint record given an index:

>>> len(arena)
14967
>>> list(arena.ids[:5])
['48942244', '48941399', '48940284', '48943050', '48656359']
>>> id, fp = arena[6]
>>> id
'48839855'
>>> arena[-1][0]  # the identifier of the last record in the arena
'48985180'
>>> bitops.byte_popcount(arena[-1][1])  # its fingerprint
253


An arena supports iterating through subarenas. This is like having a long list and being able to iterate over sublists. Here’s an example of iterating over the arena to get subarenas of size 2000 (excepting the last), and print information about each subarena:

>>> for subarena in arena.iter_arenas(2000):
...   print(subarena.ids[0], len(subarena))
...
48942244 2000
48629741 2000
48848217 2000
48873983 2000
48575094 2000
48531270 2000
48806978 2000
48584671 967
>>> arena[0][0]
'48942244'
>>> arena[2000][0]
'48629741'


To help demonstrate what’s going on, I showed the first id of each record along with the main arena ids for records 0 and 2000, so you can verify that they are the same.

Arenas are a core part of chemfp. Processing one fingerprint at a time is slow, so the main search routines expect to iterate over query arenas, rather than query fingerprints.

That’s why the FPSReaders – and all chemfp fingerprint collections – also support the chemfp.FingerprintReader.iter_arenas() method. Here’s an example of reading 25 records at a time from the targets file:

>>> queries = chemfp.open("pubchem_queries.fps")
>>> for arena in queries.iter_arenas(25):
...   print(len(arena))
...
25
25
25
25
1


Those add up to 10826, which you can verify is the number of structures in the original source file.

If you have a FingerprintArena then you can also use Python’s slice notation to make a subarena:

>>> queries = chemfp.load_fingerprints("pubchem_queries.fps")
>>> queries[10:15]
<chemfp.arena.FingerprintArena object at 0x552c10>
>>> queries[10:15].ids
['99110546', '99110547', '99123452', '99123453', '99133437']
>>> queries.ids[10:15]  # a different way to get the same list
['99110546', '99110547', '99123452', '99123453', '99133437']


The big restriction is that slices can only have a step size of 1. Slices like [10:20:2] and [::-1] aren’t supported. If you want something like that then you’ll need to make a new arena instead of using a subarena slice. (Hint: pass the list of indices to the arena's copy method.)

In case you were wondering, yes, you can use iter_arenas and the the other FingerprintArena methods on a subarena:

>>> queries[10:15][1:3].ids
['99110547', '99123452']
>>> queries.ids[11:13]
['99110547', '99123452']


## Create an arena with user-specified fingerprints¶

In this section you’ll learn how to create an arena containing user-specified fingerprint data.

Most of the examples in this manual use fingerprints created by a cheminformatics toolkit or extracted from an SD file. Chemfp accepts any byte string as a fingerprint, which includes, for example, novel fingerprint types which you have created for your own research.

The first parameter of the load_fingerprints() function can be any iterator which returns a sequence of Unicode identifier and byte string fingerprint. For example, if you have three fingerprint records where each fingerprint contains 32-bits of data, like this:

>>> data = [(u"ID1", b"\xc4\xa7\xd2\x1e"),
...         (u"ID2", b"\x04\x82\xd6\x08"),
...         (u"ID3", b"\xc1\xa3\xd2\x1e")]


then you can pass the list directly to load_fingerprints, along with a Metadata instance to tell chemfp the fingerprint size and type:

>>> import chemfp
>>> len(arena)
3


What if the fingerprint data comes from a file which isn’t in FPS format? The chemfp.bitops and chemfp.encodings modules contains functions which can help with the conversion. Suppose each line in the file contains an id followed by a list of bit indices for the on bits:

>>> lines = ["ID1 0 1 9 10 11 14 15 16 18 19 43\n",
...          "ID2 0 1 2 9 10 11 12 14 18 19 20 43\n"]


The following function reads the lines, parses the id and bit list, converts the bitlist into a 64-bit byte string, and yields the id/fingerprint pairs:

>>> def get_id_and_fp(lines):
...   for line in lines:
...     fields = line.split()
...     bitlist = [int(bit) for bit in fields[1:]]
...     yield fields[0], bitops.byte_from_bitlist(bitlist, 64)
...
>>> for id, fp in get_id_and_fp(lines):
...   print(id, repr(fp))
...
ID1 b'\x03\xce\r\x00\x00\x08\x00\x00'
ID2 b'\x07^\x1c\x00\x00\x08\x00\x00'


Here’s one way to use the function to create an arena:

>>> arena = chemfp.load_fingerprints(get_id_and_fp(lines),
>>>
>>> len(arena)
n2
>>> arena.get_fingerprint(0)
b'\x03\xce\r\x00\x00\x08\x00\x00'


It’s a bit cumbersome to pass the metadata into load_fingerprints when the parser already knows that information, but there’s a better way. If no metadata is passed to the load_fingerprints function then the function will try to get it from the metadata attribute of the first function. That’s why you get an exception if you omit the metadata:

>>> arena = chemfp.load_fingerprints(get_id_and_fp(lines))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/__init__.py", line 550, in load_fingerprints
alignment=alignment)
File "chemfp/arena.py", line 849, in fps_to_arena
AttributeError: 'generator' object has no attribute 'metadata'


Instead, wrap the metadata and id/fingerprint iterator inside of a FingerprintIterator utility class:

>>> def read_bitlist_format(lines):
...   return chemfp.FingerprintIterator(
...       get_id_and_fp(lines))
...


The result can be passed directly to load_fingerprints:

>>> arena = chemfp.load_fingerprints(read_bitlist_format(lines))
>>> len(arena)
2
>>> arena[1]
('ID2', b'\x07^\x1c\x00\x00\x08\x00\x00')


The FingerprintIterator also implements the FingerprintReader.save() method, which can be used to save the fingerprints to an FPS or FPB file. See the next section for more details.

## Save a fingerprint arena¶

In this section you’ll learn how to save an arena in FPS and FPB formats.

This is probably the easiest section. If you have an arena (or any FingerprintReader), like:

>>> import chemfp


then you can save it to an FPS file using the FingerprintReader.save() method and a filename ending with “.fps”. (You’ll also get an FPS file if you specify an unknown extension.):

>>> queries.save("example.fps")


If the filename ends with “.fps.gz” then the file will be saved as a gzip-compressed FPS file, and if the filename ends with “.fpb.zst” and the zstandard Python package is installed, then the file will be saved as a zstandard-compressed FPS file.

Finally, if the name ends with “.fpb”, as in:

>>> queries.save("example.fpb")


then the result will be in FPB format. The save() method can also save gzip- and zstandard-compressed FPB files.

The save method supports a second option, format, should you for some odd reason want the format to be different than what’s implied by the filename extension:

>>> queries.save("example.fpb", "fps")  # save in FPS format


The save method supports a third option, level, which specifies the compression level. This should be an integer appropriate for the compression library. The string aliases “min”, “default”, and “max” are mapped to the appropriate compression level for the given format: “min” is 1; “default” is 9 for gzip and 3 for zstandard; “max” is 9 for gzip and 19 for zstandard.

## How to use query fingerprints to search for similar target fingerprints¶

In this section you’ll learn how to do a Tanimoto search using the previously created PubChem fingerprint files for the queries and the targets from Generate fingerprint files from PubChem SD tags.

It’s faster to search an arena, so I’ll load the target fingerprints:

>>> from __future__ import print_function   # Only for Python 2.7
>>> import chemfp
>>> len(targets)
14967


and open the queries as an FPSReader.

>>> queries = chemfp.open("pubchem_queries.fps")


I’ll use chemfp.threshold_tanimoto_search() to find, for each query, all hits which are at least 0.7 similar to the query.

>>> for (query_id, hits) in chemfp.threshold_tanimoto_search(queries, targets, threshold=0.7):
...   print(query_id, len(hits), list(hits)[:2])
...
99000039 641 [(3619, 0.7085714285714285), (4302, 0.7371428571428571)]
99000230 373 [(2747, 0.703030303030303), (3608, 0.7041420118343196)]
99002251 270 [(2512, 0.7006369426751592), (2873, 0.7088607594936709)]
99003537 523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392)]
99003538 523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392)]
99005028 131 [(772, 0.7589285714285714), (796, 0.7522123893805309)]
99005031 131 [(772, 0.7589285714285714), (796, 0.7522123893805309)]
99006292 308 [(805, 0.7058823529411765), (808, 0.7)]
99006293 308 [(805, 0.7058823529411765), (808, 0.7)]
99006597 0 []
# ... many lines omitted ...


I’m only showing the first two hits for the sake of space. It seems rather pointless to show all 641 hits of query id 99000039.

However, there’s a subtle problem here. The “list(hits)” returns a list of (index, score) tuples when the targets are an arena, and (id, score) tuples when the targets are a FPS reader. (I’ll talk about that more in the next section for how that works.) It’s best to always specify how you want the results. In my case I always want the identifiers and the scores so I’ll use hits.get_ids_and_scores(), like this:

from __future__ import print_function # Only for Python 2
import chemfp
queries = chemfp.open("pubchem_queries.fps")
for (query_id, hits) in chemfp.threshold_tanimoto_search(queries, targets, threshold=0.7):
print(query_id, len(hits), hits.get_ids_and_scores()[:2])


which gives as output:

99000039 641 [('48528698', 0.7085714285714285), ('48529189', 0.7371428571428571)]
99000230 373 [('48737535', 0.703030303030303), ('48502523', 0.7041420118343196)]
99002251 270 [('48857943', 0.7006369426751592), ('48846196', 0.7088607594936709)]
99003537 523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392)]
99003538 523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392)]
99005028 131 [('48659090', 0.7589285714285714), ('48657042', 0.7522123893805309)]
99005031 131 [('48659090', 0.7589285714285714), ('48657042', 0.7522123893805309)]
99006292 308 [('48976796', 0.7058823529411765), ('48542022', 0.7)]
99006293 308 [('48976796', 0.7058823529411765), ('48542022', 0.7)]
99006597 0 []
# ... many lines omitted ...


What you don’t see in either case is that the implementation uses the chemfp.FingerprintReader.iter_arenas() interface on the queries so that it processes one subarena at a time. There’s a tradeoff between a large arena, which is faster because it doesn’t often go back to Python code, or a small arena, which uses less memory and is more responsive. You can change the tradeoff using the arena_size parameter.

If all you need is the count of the hits at or above a given threshold then use chemfp.count_tanimoto_hits():

>>> queries = chemfp.open("pubchem_queries.fps")
>>> for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.7):
...     print(query_id, count)
...
99000039 641
99000230 373
99002251 270
99003537 523
99003538 523
99005028 131
99005031 131
99006292 308
99006293 308
99006597 0
# ... many lines omitted ...


Or, if you only want the k=2 nearest neighbors to each target within that same threshold of 0.7 then use chemfp.knearest_tanimoto_search():

>>> queries = chemfp.open("pubchem_queries.fps")
>>> for (query_id, hits) in chemfp.knearest_tanimoto_search(queries, targets, k=2, threshold=0.7):
...   print(query_id, hits.get_ids_and_scores())
...
99000039 [('48503376', 0.8784530386740331), ('48503380', 0.8729281767955801)]
99000230 [('48563034', 0.8588235294117647), ('48731730', 0.8522727272727273)]
99002251 [('48798046', 0.8109756097560976), ('48625236', 0.8106508875739645)]
99003537 [('48997075', 0.9035532994923858), ('48997697', 0.8984771573604061)]
99003538 [('48997075', 0.9035532994923858), ('48997697', 0.8984771573604061)]
99005028 [('48651160', 0.8288288288288288), ('48848576', 0.8166666666666667)]
99005031 [('48651160', 0.8288288288288288), ('48848576', 0.8166666666666667)]
99006292 [('48945841', 0.9652173913043478), ('48737522', 0.8793103448275862)]
99006293 [('48945841', 0.9652173913043478), ('48737522', 0.8793103448275862)]
99006597 []
# ... many lines omitted ...


## How to search an FPS file¶

In this section you’ll learn how to search an FPS file directly, without loading it into a FingerprintArena. You’ll need the previously created PubChem fingerprint files for the queries and the targets from Generate fingerprint files from PubChem SD tags.

The previous example loaded the fingerprints into a FingerprintArena. That’s the fastest way to do multiple searches. Sometimes you only want to do one or a couple of queries. It seems rather excessive to read the entire targets file into an in-memory data structure before doing the search when you could search while processing the file.

For that case, use an FPSReader as the targets file. Here I’ll get the first two records from the queries file and use it to search the targets file:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> query_arena = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
>>> query_arena
<chemfp.arena.FingerprintArena object at 0x11039c850>
>>> len(query_arena)
2


That first line is complicated. It opens the file and iterates over its fingerprint records two at a time as arenas. The next() returns the first of these arenas, so that line is a way of saying “get the first two records as an arena”.

Here are the k=5 closest hits against the targets file:

>>> targets = chemfp.open("pubchem_targets.fps")
>>> for query_id, hits in chemfp.knearest_tanimoto_search(query_arena, targets, k=5, threshold=0.0):
...   print("** Hits for", query_id, "**")
...   for hit in hits.get_ids_and_scores():
...     print("", hit)
...
** Hits for 99000039 **
('48503376', 0.8784530386740331)
('48503380', 0.8729281767955801)
('48732162', 0.8595505617977528)
('48520532', 0.8540540540540541)
('48985130', 0.8449197860962567)
** Hits for 99000230 **
('48563034', 0.8588235294117647)
('48731730', 0.8522727272727273)
('48583483', 0.8411764705882353)
('48563042', 0.8352941176470589)
('48935653', 0.8333333333333334)


To make it easier to see, here’s the code in a single chunk:

from __future__ import print_function # Only for Python 2
import chemfp
query_arena = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
for query_id, hits in chemfp.knearest_tanimoto_search(query_arena, targets, k=5, threshold=0.0):
print("**Hits for", query_id, "**")
for hit in hits.get_ids_and_scores():
print("", hit)


Remember that the FPSReader reads an FPS file. Once you’ve done a search, the file is read, and you can’t do another search. (Well, you can; but it will return empty results.) You’ll need to reopen the file to reuse the file, or reseek the file handle to the start position and pass the handle to a new FPSReader.

Each search processes arena_size query fingerprints at a time. You will need to increase that value if you want to search more than that number of fingerprints with this method.

## How do to a Tversky search using the Dice weights¶

In this section you’ll learn how to search a set of fingerprints using the more general Tversky parameters, without loading it into a FingerprintArena. You’ll need the previously created PubChem fingerprint files for the queries and the targets from Generate fingerprint files from PubChem SD tags.

Chemfp-2.1 added support for Tversky searches. The Tversky index supports weights for the superstructure and substructure terms to the similarity. Some people like the Dice index, which is the Tversky index with alpha = beta = 0.5, so here are a couple of ways to search the targets based on the Dice index.

The previous two sections did a Tanimoto search by using chemfp.knearest_tanimoto_search(). The Tversky search uses chemfp.knearest_tversky_search(), which shouldn’t be much of a surprise. Just like the Tanimoto search code, it can take a fingerprint arena or an FPS reader as the targets.

The first example loads all of the targets into an arena, then searches using each of the queries:

from __future__ import print_function # Only for Python 2
import chemfp
queries = chemfp.open("pubchem_queries.fps")
for query_id, hits in chemfp.knearest_tversky_search(queries, targets, k=5,
threshold=0.0, alpha=0.5, beta=0.5):
print("**Hits for", query_id, "**")
for hit in hits.get_ids_and_scores():
print("", hit)


The first two output records are:

**Hits for 99000039 **
('48503376', 0.9352941176470588)
('48503380', 0.9321533923303835)
('48732162', 0.9244712990936556)
('48520532', 0.9212827988338192)
('48985130', 0.9159420289855073)
**Hits for 99000230 **
('48563034', 0.9240506329113924)
('48731730', 0.9202453987730062)
('48583483', 0.9137380191693291)
('48563042', 0.9102564102564102)
('48935653', 0.9090909090909091)


On the other hand, the following reads the first two queries into an arena, then searches the targets as an FPS file, without loading all of the targets into memory at once:

import chemfp
queries = next(chemfp.open("pubchem_queries.fps").iter_arenas(2))
targets = chemfp.open("pubchem_targets.fps")
for query_id, hits in chemfp.knearest_tversky_search(queries, targets, k=5,
threshold=0.0, alpha=0.5, beta=0.5):
print("** Hits for", query_id, "**")
for hit in hits.get_ids_and_scores():
print("", hit)


Not surprisingly, this gives the same output as before:

** Hits for 99000039 **
('48503376', 0.9352941176470588)
('48503380', 0.9321533923303835)
('48732162', 0.9244712990936556)
('48520532', 0.9212827988338192)
('48985130', 0.9159420289855073)
** Hits for 99000230 **
('48563034', 0.9240506329113924)
('48731730', 0.9202453987730062)
('48583483', 0.9137380191693291)
('48563042', 0.9102564102564102)
('48935653', 0.9090909090909091)


## FingerprintArena searches returning indices instead of ids¶

In this section you’ll learn how to search a FingerprintArena and use hits based on integer indices rather than string ids.

The previous sections used a high-level interface to the Tanimoto and Tversky search code. Those are designed for the common case where you just want the query id and the hits, where each hit includes the target id.

Working with strings is actually rather inefficient in both speed and memory. It’s usually better to work with indices if you can, and in the next section I’ll show how to make a distance matrix using this interface.

The index-based search functions are in the chemfp.search module. They can be categorized into three groups, with Tanimoto and Tversky versions for each group:

1. Count the number of hits:
1. Find all hits at or above a given threshold, sorted arbitrarily:
1. Find the k-nearest hits at or above a given threshold, sorted by decreasing similarity:

The functions ending “_fp” take a query fingerprint and a target arena. The functions ending “_arena” take a query arena and a target arena. The functions ending “_symmetric” use the same arena as both the query and target.

In the following example, I’ll use the first 5 fingerprints of a data set to search the entire data set. To do this, I load the data set as an arena, extract the first 5 records as a sub-arena, and do the search.

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> from chemfp import search
>>> queries = next(chemfp.open("pubchem_queries.fps").iter_arenas(5))
>>> results = search.threshold_tanimoto_search_arena(queries, targets, threshold=0.7)


The search.threshold_tanimoto_search_arena() call finds the target fingerprints which have a similarity score of at least 0.7 compared to the query.

You can iterate over the results (which is a SearchResults) to get the list of hits for each of the queries. The order of the results is the same as the order of the records in the query:

>>> for hits in results:
...   print(len(hits), hits.get_ids_and_scores()[:3])
...
641 [('48528698', 0.7085714285714285), ('48529189', 0.7371428571428571), ('48937990', 0.7039106145251397)]
373 [('48737535', 0.703030303030303), ('48502523', 0.7041420118343196), ('48560268', 0.7)]
270 [('48857943', 0.7006369426751592), ('48846196', 0.7088607594936709), ('48855282', 0.710691823899371)]
523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392), ('48529584', 0.705)]
523 [('48542237', 0.7230769230769231), ('48739065', 0.7085427135678392), ('48529584', 0.705)]


The results object don’t store the query id. Instead, you have to know that the results are in the same order as the input as the query arena, so you can match the query arena’s id attribute, which contains the list of fingerprint identifiers, to each result:

>>> for query_id, hits in zip(queries.ids, results):
...   print("Hits for", query_id)
...   for hit in hits.get_ids_and_scores()[:3]:
...     print("", hit)
...
Hits for 99000039
('48528698', 0.7085714285714285)
('48529189', 0.7371428571428571)
('48937990', 0.7039106145251397)
Hits for 99000230
('48737535', 0.703030303030303)
('48502523', 0.7041420118343196)
('48560268', 0.7)
Hits for 99002251
('48857943', 0.7006369426751592)
('48846196', 0.7088607594936709)
('48855282', 0.710691823899371)
Hits for 99003537
('48542237', 0.7230769230769231)
('48739065', 0.7085427135678392)
('48529584', 0.705)
Hits for 99003538
('48542237', 0.7230769230769231)
('48739065', 0.7085427135678392)
('48529584', 0.705)


What I really want to show is that you can get the same data only using the offset index for the target record instead of its id. The result from a Tanimoto search with a query arena is a SearchResults. Iterating over the results gives a SearchResult object, with methods like SearchResult.get_indices_and_scores(), SearchResult.get_ids(), and SearchResult.get_scores():

>>> for hits in results:
...   print(len(hits), hits.get_indices_and_scores()[:3])
...
641 [(3619, 0.7085714285714285), (4302, 0.7371428571428571), (4576, 0.7039106145251397)]
373 [(2747, 0.703030303030303), (3608, 0.7041420118343196), (3777, 0.7)]
270 [(2512, 0.7006369426751592), (2873, 0.7088607594936709), (3185, 0.710691823899371)]
523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]
523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]
>>>
>>> targets.ids[0]
'48942244'
>>> targets.ids[1]
'48941399'
>>> targets.ids[3619]
'48528698'
>>> targets.ids[4302]
'48529189'


I did a few id lookups given the target dataset to show you that the index corresponds to the identifiers from the previous code.

These examples iterated over each individual SearchResult to fetch the ids and scores, or indices and scores. Another possibility is to ask the SearchResults collection to iterate directly over the list of fields you want. SearchResults.iter_indices_and_scores(), for example, iterates through the get_indices_and_score of each SearchResult.

>>> for row in results.iter_indices_and_scores():
...   print(len(row), row[:3])
...
641 [(3619, 0.7085714285714285), (4302, 0.7371428571428571), (4576, 0.7039106145251397)]
373 [(2747, 0.703030303030303), (3608, 0.7041420118343196), (3777, 0.7)]
270 [(2512, 0.7006369426751592), (2873, 0.7088607594936709), (3185, 0.710691823899371)]
523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]
523 [(6697, 0.7230769230769231), (7478, 0.7085427135678392), (7554, 0.705)]


This was added to get a bit more performance out of chemfp and because the API is sometimes cleaner one way and sometimes cleaner the other. Yes, I know that the Zen of Python recommends that “there should be one– and preferably only one –obvious way to do it.” Oh well.

## Access the SearchResult scores and indices as a NumPy array¶

In this section you’ll learn how to access the scores and indices for the search result hits as a NumPy array. This returns a structured array view of the underlying search result hits. You will need to install NumPy for the following to work. See the next section if you want to use ctypes or Python’s buffer API.

The following reads the targets file created in Generate fingerprint files from PubChem SD tags, randomly selects a query fingerprint (using the seed 123), then finds the Tanimoto score between that query and all of the fingerprints in the targets file:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> query_fp = targets.fingerprints.random_choice(rng=123)
>>> result = targets.threshold_tanimoto_search_fp(query_fp, threshold=0.0)


The SearchResult.get_scores() method returns the result’s scores as a Python array object, which gives a compact way to store a list of doubles using a stock Python installation:

>>> result.get_scores()[:5]
array('d', [0.22772277227722773, 0.3, 0.31, 0.24761904761904763, 0.45054945054945056])


However, this array is a copy of the underlying scores. Making the copy doubles the amont of memory used, and reduces the performance somewhat as it must create the new array and copy the values over to it.

Most people who use chemfp also use NumPy, which has its own array type, along with a large number of functions which work on NumPy arrays. The SearchResult.as_numpy_array() method returns a NumPy array view of the underlying search result hits. The term “view” means that the NumPy array doesn’t have its own storage space but uses the same storage space as the search result object.

Warning

Do not use the NumPy view of a search result after the search result has been cleared or freed up as that will likely cause your program to segfault. IN ADDITION, do not modify the contents of the array as it may cause chemfp’s optimized sort algorithms to fail.

The view is implemented as a structured array with access to the index and score arrays:

>>> arr = result.as_numpy_array()
>>> arr
array([(    0, 0.22772277), (    1, 0.3       ), (    2, 0.31      ), ...,
(14964, 0.30350195), (14965, 0.33333333), (14966, 0.30798479)],
dtype=[('index', '<i4'), ('score', '<f8')])
>>> arr["score"][:5]
array([0.22772277, 0.3       , 0.31      , 0.24761905, 0.45054945])
>>> arr["index"][:5]
array([0, 1, 2, 3, 4], dtype=int32)


I’ll reorder by decreasing score and show that the NumPy array values also change:

>>> result.reorder("decreasing-score")
>>> arr["score"][:5]
array([1.        , 1.        , 0.97849462, 0.96808511, 0.96808511])


This can be further similified using the short-cut method SearchResult.get_scores_as_numpy_array():

>>> result.get_scores_as_numpy_array()
array([1.        , 1.        , 0.97849462, ..., 0.13432836, 0.13235294,
0.12977099])
>>> result.get_indices_as_numpy_array()
array([856, 857, 907, ..., 334, 371, 217], dtype=int32)


These scores can be passed directly to something like matplotlib for plotting, which in this case shows the distribution of similarity scores for the given query:

>>> from matplotlib import pyplot
>>> pyplot.plot(result.get_scores_as_numpy_array())
[<matplotlib.lines.Line2D object at 0x122c0b190>]
>>> pyplot.show()


## Access the SearchResult scores and indices as buffer or ctypes structure¶

In this section you’ll learn how to access the scores and indices for the search result hits as a ctypes structure. This returns an array-like view of the underlying search result hits using built-in Python methods, which may be useful if NumPy isn’t available or if NumPy’s startup overhead is too high. See the previous section if you want to use NumPy.

The following reads the targets file created in Generate fingerprint files from PubChem SD tags, randomly selects a query fingerprint (using the seed 123), then finds the Tanimoto score between that query and all of the fingerprints in the targets file:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> query_fp = targets.fingerprints.random_choice(rng=123)
>>> result = targets.threshold_tanimoto_search_fp(query_fp, threshold=0.0)


The previous section showed how to access the scores and identifiers as NumPy arrays. Chemfp does not depend on NumPy, but several part of the chemfp API are sped up by having a way to get a low-overhead view of the underlying hits from Python. The SearchResult.as_ctypes() method returns a ctypes view of that data:

>>> hits = result.as_ctypes()
>>> hits
<chemfp.search.Hit_Array_14967 object at 0x10cf9fac0>
>>> len(hits)
14967
>>> hits[0]
Hit(index=856, score=1.000000)
>>> hits[len(hits)//4]
Hit(index=6071, score=0.557047)
>>> hits[len(hits)//4].score
0.5570469798657718
>>> result.reorder("increasing-score")
>>> hits[0]
Hit(index=217, score=0.129771)


The raw bytes of the search result hits are available via Python’s buffer protocol via the SearchResult.as_buffer() method. The following show that each hit takes up 12 bytes of space.:

>>> buf = result.as_buffer()
>>> buf
<memory at 0x123597f40>
>>> len(buf)
179604
>>> len(buf) / len(result)
12.0


Warning

Do not use the ctypes or buffer views of a search result after the search result has been cleared or freed up as that will likely cause your program to segfault. IN ADDITION, do not modify the contents of those views as it may cause chemfp’s optimized sort algorithms to fail.

I’ll be surprised if anyone uses this function directly. If you do, let me know why!

## Access the fingerprint arena bytes as a NumPy array¶

In this section you’ll learn how to access the arena’s fingerprint data as a NumPy array. This returns a byte view of the underlying arena data structure. If you want the fingerprint bits as 0 or 1 values, see the next section. You will need to install NumPy for the following to work.

A FingerprintArena stores the fingerprints in a contiguous block of memory. Each fingerprint is stored as the first arena.num_bytes bytes of a field containing arena.storage_size bytes of memory. If num_bytes is smaller than storage_size then the field is 0-padded, that is, the remaining bytes are set to 0.

If you work with Python code then you can use chemfp’s Python API to access the fingerprints. But what if you want to access the fingerprints from a C extension? More specifically, what if you want to access the fingerprints from NumPy, which contains a lot of optimized routines for analyzing matrix-like data?

The FingerprintArena.to_numpy_array() method returns a read-only view of the fingerprint data as a 2D NumPy array with len(arena) rows and arena.storage_size columns. Each element of the matrix is an unsigned 8 bit integer, that is, a byte.

The matrix is a “view” of the data, meaning that it uses the same contiguous block of memory that the arena uses.

Warning

Do not use the NumPy view of an arena from an FPB file after the file has been closed as that will likely cause your program to segfault.

Here is an example using MACCS fingerprints for ChEBI 187 generated by RDKit:

>>> import chemfp
>>> arr = arena.to_numpy_array()
>>> arr
array([[  0,   0,   0, ...,   0,   0,   0],
[  0,   0,   0, ...,   0,   0,   0],
[  0,   0,   0, ...,   0,   0,   0],
...,
[  0,  16, 128, ...,   0,   0,   0],
[  0,  16, 128, ...,   0,   0,   0],
[  0,   0, 128, ...,   0,   0,   0]], dtype=uint8)
>>> arr.shape
(107207, 24)


While it isn’t chemically meaningful, I’ll sum the bytes down the rows:

>>> arr.sum(axis=0)
array([  490116,   204316,   601303,  1485108,   968167,  2407708,
2464853,  2392025,  6600791,  5761640,  5880625, 10715664,
8568501, 12248444, 11166730, 13371871, 12146087, 13559574,
17746237, 20894627,  2761788,        0,        0,        0], dtype=uint64)


The last three values are 0 because of the 0-padding. By default chemfp uses 64-bit alignment, which means 192 bits or 24 bytes for the 166-bit MACCS key fingerprints, even though only 21 bytes are needed.

If the 0 padding is a problem then you can use NumPy indexing to make a new NumPy array which only contains the actual fingerprint bytes:

>>> unpadded_arr = arr[:,:arena.num_bytes]
array([[  0,   0,   0, ...,   0,   0,   0],
[  0,   0,   0, ...,   0,   0,   0],
[  0,   0,   0, ...,   0,   0,   0],
...,
[  0,  16, 128, ..., 255, 255,  31],
[  0,  16, 128, ..., 255, 255,  31],
[  0,   0, 128, ..., 255, 255,  31]], dtype=uint8)
(107207, 21)
array([  490116,   204316,   601303,  1485108,   968167,  2407708,
2464853,  2392025,  6600791,  5761640,  5880625, 10715664,
8568501, 12248444, 11166730, 13371871, 12146087, 13559574,
17746237, 20894627,  2761788], dtype=uint64)


## Access the fingerprint bits as a NumPy array¶

In this section you’ll learn how to access the arena’s fingerprint bit values as a NumPy array. This returns a new array containing the values 0 or 1. If you want a view of the underlying arena bytes, see the previous section. You will need to install NumPy for the following to work.

Some people use fingerprint bit values as descriptors for clustering or other machine learning algorithm. The FingerprintArena.to_numpy_array() method returns a 2D array containing bit values. The array contains len(arena) rows. By default it returns one column for each fingerprint bit.

Here is an example using MACCS fingerprints for ChEBI 187 generated by RDKit:

>>> import chemfp
>>> bitarr = arena.to_numpy_bitarray()
>>> bitarr.shape
(107207, 166)


This is a normal NumPy array, so the usual NumPy methods work. For example, here are the bits for the fingerprint at index 1000:

>>> bitarr[1000]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0], dtype=uint8)


and here are the number of occurrences of each bit:

>>> bitarr.sum(axis=0)
array([     0,      2,    514,     31,     22,     53,    264,   3663,
452,    254,   4309,    215,    405,    326,    147,   1235,
2239,    256,   3632,    177,    407,   4063,   1112,   2929,
2780,   3946,    423,   2597,  11284,   2798,    481,   8993,
9241,   3179,    720,   6701,   9993,   8846,   2306,   2387,
2426,   9607,  10337,  11172,   2605,   1655,   6263,  13749,
18487,  12349,  10391,   7077,  29244,  28915,  11422,   1557,
50525,  11758,  10016,  11804,  11876,  20436,   1786,   9572,
29439,  16754,  12719,    843,  15222,   4294,   4281,  45510,
13238,  20715,  36963,  14132,  24909,   5101,  26283,  25017,
18533,  47630,  33626,  13009,  41392,  33512,  13809,  22733,
56840,  50194,  58465,  50896,  36946,  20788,  57521,  38904,
51137,  48868,  26627,  46736,  49780,  28319,   7660,  44893,
51824,  41062,  16770,  41733,  59879,  54221,  56176,  42384,
39082,  19636,  44832,  45619,  56784,  54437,   6965,  58186,
50183,  46442,  45119,  36041,      0,  48938,  64600,  55153,
58571,  28754,  62726,  71980,  41656,  18204,  31671,  61932,
72978,  56210,  69568,  68414,  40483,  60826,  64600,  45469,
62761,  81488,  55865,  56584,  57693,  69504,  57550,  78234,
75151,  83824,  79623,  71747,  89966,  73571,  93265,  78099,
77122,  66637,  83354, 100861,  88193,      0], dtype=uint64)


While the default returns the bits for each fingerprint, you can use the transpose to get which fingerprints indices contain a given bit.

For example, there are only 5 fingerprints which set the fifth bit. Key 5 is defined as “Lanthanide” and implemented as the SMARTS pattern: [La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]. Which fingerprints contain a lanthanide?

>>> bitarr.T[4].nonzero()
(array([  334,   335,   338,   339,   340,   444,   455,   553,   554,
1135,  1169,  1739,  1863,  3194,  3263,  3264,  3595,  4257,
6573,  6574, 42598, 45728]),)


To make that useful I need the compound ids, so I’ll use the indices to get the ids from the arena:

>>> [arena.ids[idx] for idx in bitarr.T[4].nonzero()[0]]
['CHEBI:33330', 'CHEBI:33331', 'CHEBI:33341', 'CHEBI:33342',
'CHEBI:33343', 'CHEBI:52622', 'CHEBI:52635', 'CHEBI:49962',
'CHEBI:49978', 'CHEBI:63020', 'CHEBI:134455', 'CHEBI:139502',
'CHEBI:32234', 'CHEBI:77566', 'CHEBI:134436', 'CHEBI:134440',
'CHEBI:53479', 'CHEBI:139496', 'CHEBI:50950', 'CHEBI:51000',
'CHEBI:59824', 'CHEBI:139501']


Picking out a few of these:

so at least I wasn’t able to find a false positive!

The above example created the entire bit array but only used the third column. If you only want the third column then it’s faster to pass an explicit list of the bit columns you want to to_numpy_bitarray:

>>> arena.to_numpy_bitarray([4])
array([[0],
[0],
[0],
...,
[0],
[0],
[0]], dtype=uint8)
>>> arena.to_numpy_bitarray([2]).sum()
22


You can ask for more than one bit column. The following computes the Pearson product-moment correlation coefficients between columns 163 and 158 (column 163 has the most often set bit, and 158 has the second most often):

>>> bitarr = arena.to_numpy_bitarray([163, 158])
>>> bitarr
array([[0, 0],
[0, 0],
[0, 0],
...,
[1, 1],
[1, 1],
[1, 1]], dtype=uint8)
>>> import numpy
>>> numpy.corrcoef(bitarr, rowvar=0)
array([[ 1.        ,  0.64876171],
[ 0.64876171,  1.        ]])


When this section was originally written, extracting 1 column with to_numpy_bitarray was about 20x faster than extracting all of the columns and selecting just the desired column. The break-even point for 166 bits was around 45 columns.

## Computing a distance matrix for clustering¶

In this section you’ll learn how to compute a distance matrix using the chemfp API. The next section shows an alternative way to get the similarity matrix.

chemfp does not do clustering. There’s a huge number of tools which already do that. A goal of chemfp in the future is to provide some core components which clustering algorithms can use.

That’s in the future, because I know little about how people want to cluster with chemfp. Right now you can use the following to build a distance matrix and pass that to one of those tools. (I’ll use a distance matrix of 1 - the similarity matrix.)

Since we’re using the same fingerprint arena for both queries and targets, we know the distance matrix will be symmetric along the diagonal, and the diagonal terms will be 1.0. The chemfp.search.threshold_tanimoto_search_symmetric() functions can take advantage of the symmetry for a factor of two performance gain. There’s also a way to limit it to just the upper triangle, which cuts the memory use in half.

Most of those tools use NumPy, which is a popular third-party package for numerical computing. You will need to have it installed for the following to work.

import numpy  # NumPy must be installed
from chemfp import search

# Compute distance[i][j] = 1-Tanimoto(fp[i], fp[j])

def distance_matrix(arena):
n = len(arena)

# Start off a similarity matrix with 1.0s along the diagonal
similarities = numpy.identity(n, "d")

## Compute the full similarity matrix.
# The implementation computes the upper-triangle then copies
# the upper-triangle into lower-triangle. It does not include
# terms for the diagonal.
results = search.threshold_tanimoto_search_symmetric(arena, threshold=0.0)

# Copy the results into the NumPy array.
# NOTE: see below for an implementation which is much faster.
for row_index, row in enumerate(results.iter_indices_and_scores()):
for target_index, target_score in row:
similarities[row_index, target_index] = target_score

# Return the distance matrix using the similarity matrix
return 1.0 - similarities


With the distance matrix in hand, it’s easy to cluster. The SciPy package contains many clustering algorithms, as well as an adapter to generate a matplotlib graph. I’ll use it to compute a single linkage clustering using 100 randomly selected fingerprints:

from __future__ import print_function  # Only for Python 2
import chemfp

# ... insert the 'distance_matrix' function definition here ...

dataset = dataset.sample(100)  # select 100
distances  = distance_matrix(dataset)

orientation="right",
labels = dataset.ids)

import pylab
pylab.show()


NOTE: The above code created an empty NumPy array then filled it in with the scores. This is slow because much of the work is in Python.

Another possibility is to convert the results into a SciPy compressed sparse row matrix (see the next section), then turn that sparse array into a NumPy array. The following distance_matrix version is about 5x faster than the earlier one, even though it makes an intermediate csr matrix, because more of the work is done at the C level:

def distance_matrix(arena):
n = len(arena)

## Compute the full similarity matrix.
# The implementation computes the upper-triangle then copies
# the upper-triangle into lower-triangle. It does not include
# terms for the diagonal.
results = search.threshold_tanimoto_search_symmetric(arena, threshold=0.0)

# Extract the results as a SciPy compressed sparse row matrix
csr = results.to_csr()
# Convert it to a NumPy array
similarities = csr.toarray()
# Fill in the diagonal
numpy.fill_diagonal(similarities, 1)

# Return the distance matrix using the similarity matrix
return 1.0 - similarities


## Convert SearchResults to a SciPy csr matrix¶

In this section you’ll learn how to convert a SearchResults object into a SciPy compressed sparse row matrix.

In the previous section you learned how to use the chemfp API to create a NumPy similarity matrix, and convert that into a distance matrix. The result is a dense matrix, and the amount of memory goes as the square of the number of structures.

If you have a reasonably high similarity threshold, like 0.7, then most of the similarity scores will be zero. Internally the SearchResults object only stores the non-zero values for each row, along with an index to specify the column. This is a common way to compress sparse data.

SciPy has its own compressed sparse row (“csr”) matrix data type, which can be used as input to many of the scikit-learn clustering algorithms.

If you want to use those algorithms, call the SearchResults.to_csr() method to convert the SearchResults scores (and only the scores) into a csr matrix. The rows will be in the same order as the SearchResult (and the original queries), and the columns will be in the same order as the target arena, including its ids.

I don’t know enough about scikit-learn to give a useful example. (If you do, let me know!) Instead, I’ll start by doing an NxM search of two sets of fingerprints:

from __future__ import print_function # Only for Python 2
import chemfp
from chemfp import search

results = search.threshold_tanimoto_search_arena(queries, targets, threshold = 0.8)


The SearchResults attribute shape describes the number of rows and columns:

>>> results.shape
(10826, 14967)
>>> len(queries)
10826
>>> len(targets)
14967
>>> >>> results[426].get_indices_and_scores()
[(133, 0.85), (153, 0.8064516129032258)]


I’ll turn it into a SciPy csr:

>>> csr = results.to_csr()
>>> csr
<10826x14967 sparse matrix of type '<class 'numpy.float64'>'
with 369471 stored elements in Compressed Sparse Row format>
>>> csr.shape
(10826, 14967)


and look at the same row to show it has the same indices and scores:

>>> csr[426]
<1x14967 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
>>> csr[426].indices
array([133, 153], dtype=int32)
>>> csr[6].data
array([ 0.85      ,  0.80645161])


## Taylor-Butina clustering¶

For the last clustering example, here’s my (non-validated) variation of the Butina algorithm from JCICS 1999, 39, 747-750. See also http://www.redbrick.dcu.ie/~noel/R_clustering.html . You might know it as Leader clustering.

First, for each fingerprint find all other fingerprints with a threshold of 0.8:

from __future__ import print_function # Only for Python 2
import chemfp
from chemfp import search

results = search.threshold_tanimoto_search_symmetric(arena, threshold = 0.8)


Sort the results so that fingerprints with more hits come first. This is more likely to be a cluster centroid. Break ties arbitrarily by the fingerprint id; since fingerprints are ordered by the number of bits this likely makes larger structures appear first:

# Reorder so the centroid with the most hits comes first.
# (That's why I do a reverse search.)
# Ignore the arbitrariness of breaking ties by fingerprint index
results = sorted( (  (len(indices), i, indices)
for (i, indices) in enumerate(results.iter_indices())  ),
reverse=True)


Apply the leader algorithm to determine the cluster centroids and the singletons:

# Determine the true/false singletons and the clusters
true_singletons = []
false_singletons = []
clusters = []

seen = set()
for (size, fp_idx, members) in results:
if fp_idx in seen:
# Can't use a centroid which is already assigned
continue

# True singletons have no neighbors within the threshold
if not members:
true_singletons.append(fp_idx)
continue

# Figure out which ones haven't yet been assigned
unassigned = set(members) - seen

if not unassigned:
false_singletons.append(fp_idx)
continue

# this is a new cluster
clusters.append( (fp_idx, unassigned) )
seen.update(unassigned)


Once done, report the results:

print(len(true_singletons), "true singletons")
print("=>", " ".join(sorted(arena.ids[idx] for idx in true_singletons)))
print()

print(len(false_singletons), "false singletons")
print("=>", " ".join(sorted(arena.ids[idx] for idx in false_singletons)))
print()

# Sort so the cluster with the most compounds comes first,
# then by alphabetically smallest id
def cluster_sort_key(cluster):
centroid_idx, members = cluster
return -len(members), arena.ids[centroid_idx]

clusters.sort(key=cluster_sort_key)

print(len(clusters), "clusters")
for centroid_idx, members in clusters:
print(arena.ids[centroid_idx], "has", len(members), "other members")
print("=>", " ".join(arena.ids[idx] for idx in members))


The algorithm is quick for this small data set.

Out of curiosity, I tried this on 100,000 compounds selected arbitrarily from PubChem. It took 35 seconds on my desktop (a 3.2 GHZ Intel Core i3) with a threshold of 0.8. In the Butina paper, it took 24 hours to do the same, although that was with a 1024 bit fingerprint instead of 881. It’s hard to judge the absolute speed differences of a MIPS R4000 from 1998 to a desktop from 2011, but it’s less than the factor of about 2000 you see here.

More relevent is the comparison between these numbers for the 1.1 release compared to the original numbers for the 1.0 release. On my old laptop, may it rest it peace, it took 7 minutes to compute the same benchmark. Where did the roughly 16-fold peformance boost come from? Money. After 1.0 was released, Roche funded various optimizations, including taking advantage of the symmetery (2x) and using hardware POPCNT if available (4x). Roche and another company helped fund the OpenMP support, and when my desktop reran this benchmark it used 4 cores instead of 1.

The wary among you might notice that 2*4*4 = 32x faster, while I said the overall code was only 16x faster. Where’s the factor of 2x slowdown? It’s in the Python code! The chemfp.search.threshold_tanimoto_search_symmetric() step took only 13 seconds. The remaining 22 seconds was in the leader code written in Python. To make the analysis more complicated, improvements to the chemfp API sped up the clustering step by about 40%.

With chemfp 1.0 version, the clustering performance overhead was minor compared to the full similarity search, so I didn’t keep track of it. With chemfp 1.1, those roles have reversed!

The most recent version now is chemfp 3.4, which is about 20% faster than chemfp 1.4 for this benchmark. And of course the hardware is faster still.

## MinMax Diversity Selection using RDKit¶

In this section you’ll learn how to do diversity selection using RDKit’s MaxMin picker. You will also learn how to convert chemfp fingerprints into RDKit fingerprints. You will need to install RDKit for the following to work. You will also need to download a dataset of benzodiazepine structures.

Diversity selection finds elements which are unlike each other. One way to implement diversity selection is to cluster all of the compounds then pick a compound from each cluster, but this requires quadratic time to compute the similarity/distance matrix.

Chemfp does not implement diversity selection, though it may be added in the future if there is enough demand. I recommend people use the optimized version of the MaxMin from RDKit, which does diversity selection without needing to compute the full matrix.

While it is possible to have RDKit’s MaxMinPicker use native chemfp fingerprints, there is a huge performance overhead (about 100x!) because every fingerprint distance requires a Python function call. It is far faster to convert chemfp fingerprints to RDKit fingerprints so that all of the processing can be done in C.

I’ll start with an example of selecting 100 diverse fingerprints from the benzodiazepine data set. The first step is to generate fingerprints. I’ll use rdkit2fps to generate RDKit Morgan fingerprints.

% rdkit2fps --morgan benzodiazepine.sdf.gz -o benzodiazepine_morgan2.fps.gz


and then use the chemfp Python API to load the fingerprints. I’ll use reorder=False so the arena fingerprints are in the same order as the input file. (The order isn’t important for this case, but may be important if you, say, merge two data sets together where you know you want to keep the first data set and select diverse compounds from the second.)

>>> import chemfp
...      reorder=False)


The next step is to read the FPS file and convert the chemfp fingerprints into RDKit fingerprints. This is easy because RDKit function CreateFromBinaryText converts a chemfp fingerprint, which is just a byte string, into the equivalent ExplicitBitVect fingerprint.

>>> from rdkit import DataStructs
>>> rdkit_fps = [DataStructs.CreateFromBinaryText(fp) for fp in arena.fingerprints]


The fingerprints attribute was added in chemfp 3.4. For older chemfp versions use:

>>> rdkit_fps = [DataStructs.CreateFromBinaryText(fp) for id, fp in arena]


Finally, use RDKit to pick 100 diverse record indices:

>>> from rdkit import SimDivFilters
>>> picker = SimDivFilters.MaxMinPicker()
>>> picks = picker.LazyBitVectorPick(rdkit_fps, len(rdkit_fps), 100)
>>> len(picks)
100
>>> list(picks)
[10879, 8375, 2390, 4683, 3549, 6257, 9194, 9953, 96, 6860, 8016,
6034, 3197, 4213, 5762, 2323, 7531, 9894, 12279, 3398, 4607, 4827,
2874, 1608, 3234, 6128, 8710, 7691, 3006, 4898, 4372, 11609, 11401,
10614, 3861, 1295, 6936, 6192, 7121, 11577, 5092, 2523, 4926, 4614,
4956, 8762, 2261, 9184, 11666, 2828, 7767, 12027, 5000, 6126, 6266,
6097, 7966, 9208, 8064, 1327, 6241, 3392, 5730, 7744, 8485, 9299,
358, 5332, 4434, 2935, 8405, 5480, 4648, 1665, 5848, 9053, 5735,
6583, 8407, 1706, 5347, 11779, 12022, 2598, 8378, 3565, 7394, 4888,
10454, 6611, 11472, 2146, 6101, 295, 6632, 6717, 2442, 5638, 5372,
8279]


The indices match the arena order, so you can use arena.ids to get the corresponding id for each index; in this case, PubChem ids:

>>> arena.ids[10879]
'22984485'


The RDKit MaxMinPicker also lets you initialize the pick list with a set of indicies. This is useful if you have a in-house compound data set X and want to select N diverse fingerprints from a vendor data set Y. That algorithm might look like:

import chemfp
from rdkit import DataStructs, SimDivFilters

# Merge the two fingerprint sets together, but keep track
# of which came from X.
fps = [DataStructs.CreateFromBinaryText(fp) for fp in have_arena.fingerprints]
num_have = len(fps)
fps.extend(DataStructs.CreateFromBinaryText(fp) for fp in want_arena.fingerprints)

# Do the picking
num_to_pick = 100
picker = SimDivFilters.MaxMinPicker()
have_ids = list(range(num_have))
picks = picker.LazyBitVectorPick(fps, len(fps), num_have+num_to_pick, have_ids)
newly_picked = picks[-num_to_pick:]
want_indices = [idx-num_have for idx in newly_picked]

# Report the picked compounds
print("Compound to evaluate:")
for idx in want_indices:
print(want_arena.ids[idx])


To learn more about the RDKit MaxMin picker and how to use it, see Roger Sayle’s slides from the 2017 RDKit User Group meeting and Tim Dudgeon’s commentary.

In this section you’ll learn about chemfp and OpenMP threads, including how to set the number of threads to use.

OpenMP is an API for shared memory multiprocessing programming. Chemfp uses it to parallelize the similarity search algorithms. Support for OpenMP is a compile-time option for chemfp, and can be disabled with --without-openmp in setup.py. Versions 4.2 of gcc (released in 2007) and later support it, as do other compilers, though chemfp has only been tested with gcc.

Chemfp uses one thread per query fingerprint. This means that single fingerprint queries are not parallelized. There is no performance gain even if four cores are available.

(A note about nomenclature: a CPU can have one core, or it can have several cores. A single processor computer has one CPU while a multiprocessor computer has several CPUs. I think some cores can even run multiple threads. So it’s possible to have many more hardware threads than CPUs.)

Chemfp uses multiple threads when there are many queries, which occurs when using a query arena against a target arena. These search methods include the high-level API in the top-level chemfp module (like ‘knearest_tanimoto_search’), and the arena search function in chemfp.search.

By default, OpenMP and therefore chemfp will use four threads:

>>> import chemfp
4


You can change this through the standard OpenMP environment variable OMP_NUM_THREADS in the shell:

% env OMP_NUM_THREADS=2 python
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
>>> import chemfp
2


>>> chemfp.set_num_threads(3)
3


If you specify 0 or 1 thread then chemfp will not use OpenMP at all and stick with a single-threaded implementation. (You probably want to disable OpenMP in multi-threaded programs like web servers. See the next section for details.)

Throwing more threads at a task doesn’t always make it faster. My old desktop has one CPU with two cores, so it’s pointless to have more than two OpenMP threads running, as you can see from some timings:

threshold_tanimoto_search_symmetric (threshold=0.8) (desktop)
1       22.6
2       13.1
3       12.3
4       12.9
5       12.6


On the other hand, my old laptop has 1 CPU with four cores, and while my desktop beats my laptop with single threaded peformance, once I have three cores going, my laptop is faster:

threshold_tanimoto_search_symmetric (threshold=0.8) (laptop)
1       27.4
2       14.6
3       10.3
4        8.2
5        9.0


How many cores/hardware threads are available? That’s a really good question. chemfp implements chemp.get_max_threads(), but that doesn’t seem to do what I want. So don’t use it, and I’ll figure out a real solution in a future release.

In this section you’ll learn some of the problems of mixing OpenMP and multi-threaded code.

Do not use OpenMP and POSIX threads on a Mac. It will crash. This includes Django, which is a multi-threaded web server. In multi-threaded code on a Mac you must either tell chemfp to be single-threaded, using:

chemfp.set_num_threads(1)


or figure out some way to put the chemfp search code into its own process space, which is a much harder solution.

Other OSes will let you mix POSIX and OpenMP threads, but life gets confusing. Might your web server handle three search requests at the same time? If so, should all of those get four OpenMP threads, so that 12 threads are running in total? Can your hardware handle that many threads?

It may be better to have chemfp not use OpenMP threads when under a multi-threaded system, or have some way to limit the number of chemfp search tasks running at the same time. Figuring out the right solution will depend on your hardware and requirements.

## Fingerprint Substructure Screening (experimental)¶

In this section you’ll learn how to find target fingerprints which contain the query fingerprint bit patterns as a subset. Bear in mind that this is an experimental API.

Substructure search often uses a screening step to remove obvious mismatches before doing the subgraph isomorphism. One way is to generate a binary fingerprint such that if a query molecule is a substructure of a target molecule then the corresponding query fingerprint is completely contained in the target fingerprint, that is, the target fingerprint must have ‘on’ bits for all of the query fingerprints which have ‘on’ bits.

I’ll start by loading a fingerprint arena with four fingerprints, where the identifiers are Unicode strings and the fingerprint are byte strings of length 1, with the binary form shown to the right:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> from chemfp import bitops
...    (u"A1", b"\x44"),   # 0b01000100
...    (u"B2", b"\x6c"),   # 0b01101100
...    (u"C3", b"\x95"),   # 0b10010101
...    (u"D4", b"\xea"),   # 0b11101010
>>> for id, fp in arena:
...   print(bitops.hex_encode(fp), id)
...
44 A1
6c B2
95 C3
ea D4


I could use bitops.byte_contains() to search for fingerprints in a loop, in this case with a query fingerprint which requires that the 7th bit be set (they must fit the pattern 0b*1******):

>>> query_fingerprint = b"\x40"   # 0b01000000
>>> bitops.hex_encode(query_fingerprint)
'40'
>>> for id, target_fingerprint in arena:
...   if bitops.byte_contains(query_fingerprint, target_fingerprint):
...     print(id)
...
A1
B2
D4


This is slow because it uses Python to do almost all of the work. Instead, use contains_fp() from the chemfp.search module, which is faster because it’s all implemented in C:

>>> from chemfp import search
>>> result = search.contains_fp(query_fingerprint, arena)
>>> result
<chemfp.search.SearchResult object at 0x10195e090>
>>> print(result.get_ids())
['A1', 'B2', 'D4']


This is the same SearchResult instance that the similarity search code returns, though the scores are all 0.0:

>>> result.get_ids_and_scores()
[('A1', 0.0), ('B2', 0.0), ('D4', 0.0)]


This API is experimental and likely to change. Please provide feedback. While I don’t think the current call parameters will change, I might have it return the Tanimoto score (or Hamming distance?) instead of 0.0. Or I might have a way to compute new scores given a SearchResult.

I also plan to support start/end parameters, to search only a subset of the arena.

There’s also a search.contains_arena() function which takes a query arena instead of only a query fingerprint as the query, and returns a SearchResults:

>>> results = search.contains_arena(arena, arena)
>>> results
<chemfp.search.SearchResults object at 0x10195c2b8>
>>> for result in results:
...   print(result.get_ids_and_scores())
...
[('A1', 0.0), ('B2', 0.0)]
[('B2', 0.0)]
[('C3', 0.0)]
[('D4', 0.0)]


I don’t think the NxN version of the “contains” search is all that useful, so there’s no function for that case.

The implementation doesn’t yet support OpenMP, contains_arena() is only slightly faster than multiple calls to contains_fp().

## Substructure screening with RDKit¶

In this section you’ll learn how to use RDKit’s pattern fingerprint for substructure screening.

RDKit has a fingerprint tuned for substructure search, though it’s marked as ‘experimental’ and subject to change. This is the “pattern” fingerprint.

I’ll use it to make a screen for one of the PubChem files. Normally you would start with something like:

% rdkit2fps --pattern Compound_048500001_049000000.sdf.gz -o pubchem_screen.fpb


but that only gives me the identifiers and fingerprints. I want to show some of the struture as well, so I’ll do a bit of a cheat - I’ll have an augmented identifier which is the PubChem id, followed by a space, followed by the SMILES string.

I can do this because chemfp supports almost anything as the “identifier”, except newline, tab, and the NUL character, and because I don’t need to support id lookup.

However, I have to write Python code to generate the augmented identifiers:

import chemfp

fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
T = fptype.toolkit

smiles = T.create_string(mol, "smistring")  # use the isomeric SMILES string
fp = fptype.compute_fingerprint(mol)
# Create an "identifier" of the form:
#    PubChem id + " " + canonical SMILES string
writer.write_fingerprint(id + " " + smiles, fp)


Now that I have the screen, I’ll write some code to actually do the screen. I’ll make this be an interactive prompt, which asks for the query SMILES string (or “quit” or “exit” to quit), parses the SMILES to a molecule, generates the fingerprint, does the screen, and displays the first 10 results:

from __future__ import print_function # Only for Python 2
import itertools
import chemfp
import chemfp.search

fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")
T = fptype.toolkit

while 1:
# Ask for the query SMILES string
query = input("Query? ")  # use "raw_input()" for Python 2.7
if query in ("quit", "exit"):
break

# See if it's a valid SMILES
mol = T.parse_molecule(query, "smistring", errors="ignore")
if mol is None:
print("Could not parse query")
continue

# Compute the fingerprint and do the substructure screeening
fp = fptype.compute_fingerprint(mol)
result = chemfp.search.contains_fp(fp, screen)

# Print the results, up to 10.
n = len(result)
if n > 10:
print(len(result), "matches. First 10 displayed")
n = 10
else:
print(len(result), "matches.")

for augmented_id in itertools.islice(result.iter_ids(), 0, n):
id, smiles = augmented_id.split()
print(id, "=>", smiles)
print()


(In case you haven’t seen it before, the “itertools.islice()” gives me an easy way to get up to the first N items from an iterator.)

I’ll try out the above code:

Loaded 5208 screen fingerprints
Query? c1ccccc1
12376 matches. First 10 displayed
48650571 => CCCOCC(=O)NCc1ccccc1
48672998 => CCCOCC(=O)NOCc1ccccc1
48845178 => C=C(Br)CNC(=S)Nc1ccccc1
48548090 => CCNC(=O)N/N=C/c1ccc(C)cc1
48654127 => CCCOCC(=O)NCCSc1ccccc1
48548029 => CCNC(=O)N/N=C/c1cccc(C)c1
48685277 => COCC(C)CNC(=O)c1ccccc1
48915892 => CNC(=O)NCCc1ccccc1Br
48653583 => CCCOCC(=O)N(C)c1ccccc1
48650670 => CCCOCC(=O)Nc1cccc(C)c1

Query? c1ccccc1O
4946 matches. First 10 displayed
48548137 => CCNC(=O)N/N=C/c1cccc(OC)c1
48651969 => CCCOCC(=O)NCc1cccc(OC)c1
48980706 => CCCCNC(=O)CCCc1ccc(OC)cc1
48661290 => CCCOCC(=O)Nc1cccc(OCC)c1
48653813 => CCCOCC(=O)NCCOc1cccc(C)c1
48651499 => CCCOCC(=O)NCc1ccccc1OC
48981063 => COc1ccc(CCCC(=O)NCC(C)C)cc1
48659995 => CCCOCC(=O)Nc1cccc(OCC#N)c1
48916672 => CCCCCCOc1cccc(/C=N/NC(N)=O)c1
48653272 => CCCOCC(=O)NCCc1ccccc1OC

Query? c1ccccc1I
10 matches.
48731386 => Cc1cc(CNC(=O)c2ccc(I)cc2)on1
48671550 => NC(=O)Cc1ccc(OCC(=O)Nc2ccc(I)cc2)cc1
48731482 => Cc1cc(CNC(=O)c2cccc(I)c2)on1
48731331 => Cc1cc(CNC(=O)c2ccccc2I)on1
48741344 => CN(C)C(=O)c1cccc(NC(=O)Nc2ccc(I)cc2)c1
48584231 => O=C(Nc1cccc(COCC2CC2)c1)c1ccccc1I
48688164 => CC1CN(C(=O)c2ccc(I)cc2)CC(C)(C)O1
48688205 => CC1CN(C(=O)c2cccc(I)c2)CC(C)(C)O1
48946427 => N#CC1CCN(S(=O)(=O)c2ccc(I)cc2)CC1
48522115 => CC1(C)COCCN1C(=O)c1ccc(F)cc1I

Query? Fc1c(F)c(F)c(F)c(F)c1F
3 matches.
48759600 => O=C(Nc1cccnc1)Nc1c(F)c(F)c(F)c(F)c1F
48980959 => Cc1cccc2cc(C(=O)Nc3ccc(F)cc3F)oc12
48981022 => Cc1cccc2cc(C(=O)Nc3c(F)cccc3F)oc12

Query? quit


Looks reasonable.

It’s not hard to add full substructure matching, but it requires toolkit-specific code. Chemfp doesn’t try to abstract that detail, and I’m not sure it should be part of chemfp. Instead, I’ll write some RDKit-specific code. Chemfp uses native toolkit molecules, so there’s actually only a single line of RDKit code.

I’ll also completely rewrite the code so it takes the query string on the command-line, reports all of the screening results, identifies the true positives, and then does a brute-force verification that the screen results are correct. Oh, and report statistics:

# This program is called 'search.py'
from __future__ import print_function # Only for Python 2
import sys
import chemfp
import chemfp.search
from chemfp import rdkit_toolkit as T # Will only work with RDKit
import time

fptype = chemfp.get_fingerprint_type("RDKit-Pattern fpSize=1024")

if len(sys.argv) != 2:
raise SystemExit("Usage: %s <smiles>" % (sys.argv[0],))

query_smiles = sys.argv[1]

start_time = time.time()
try:
query_mol = T.parse_molecule(query_smiles, "smistring")
except ValueError as err:
raise SystemExit(str(err))

# Compute the fingerprint and do the substructure screeening
fp = fptype.compute_fingerprint(query_mol)
result = chemfp.search.contains_fp(fp, screen)
search_time = time.time()

num_matches = 0

for augmented_id in result.get_ids():
id, smiles = augmented_id.split()
target_mol = T.parse_molecule(smiles, "smistring")
if target_mol.HasSubstructMatch(query_mol):   # RDKit specific!
print(id, "matches", smiles)
num_matches += 1
else:
print(id, "       ", smiles)
report_time = time.time()

# Report the results
print()
print("= Screen search =")
print("num targets:", len(screen))
print("screen size:", len(result))
print("num matches:", num_matches)
print("screenout: %.1f%%" % (100.0 * (len(screen)-len(result)) / len(screen),))
if len(result) == 0:
precision = 100.0
else:
precision = (100.0*num_matches) / len(result)
print("precision: %.1f%%" % (precision,))
print("screen time: %.2f" % (search_time - start_time,))
print("atom-by-atom-search and report time: %.2f" % (report_time - search_time,))
print("total time: %.2f" % (report_time - start_time,))

# Reduce the computations without any screening
num_actual = 0
actual_start_time = time.time()
for augmented_id in screen.ids:
id, smiles = augmented_id.split()
target_mol = T.parse_molecule(smiles, "smistring")
if target_mol.HasSubstructMatch(query_mol):   # RDKit specific!
num_actual += 1
actual_end_time = time.time()

print()
print("= Brute force search =")
print("num matches:", num_actual)
print("time to test all molecules: %.2f" % (actual_end_time - actual_start_time,))
print("screening speedup: %.1f" % ((actual_end_time - actual_start_time) / (report_time - start_time),))


Here’s the output with ‘c1ccccc1O’ on the command-line:

% python search.py c1ccccc1O
48548137 matches CCNC(=O)N/N=C/c1cccc(OC)c1
48651969 matches CCCOCC(=O)NCc1cccc(OC)c1
48980706 matches CCCCNC(=O)CCCc1ccc(OC)cc1
48661290 matches CCCOCC(=O)Nc1cccc(OCC)c1
48653813 matches CCCOCC(=O)NCCOc1cccc(C)c1
... many lines omitted ...
48930672 matches CS(=O)(=O)c1ccc(Oc2nc(C3CC3)nc3sc4c(c23)CCCC4)cc1
48673774 matches COc1ccc(C)cc1-n1ccc(C(=O)N2CCc3[nH]c4ccccc4c3C2)n1
48551088 matches Cc1cc(C)c(CN2C(=O)NC3(CCOc4ccccc43)C2=O)c(C)c1
48944841 matches CC(C)(CNS(=O)(=O)CC12CCC(CC1=O)C2(C)C)c1ccc2c(c1)OCO2
48729925 matches O=C(Cn1c(-c2ccccc2)noc1=O)Nc1ccc2c(c1)OC1(CCCC1)O2

= Screen search =
num targets: 14967
screen size: 4946
num matches: 4943
screenout: 67.0%
precision: 99.9%
screen time: 0.00
atom-by-atom-search and report time: 2.99
total time: 3.00

= Brute force search =
num matches: 4943
time to test all molecules: 5.00
screening speedup: 1.7


It’s a relief to see that the versions with and without the screen give the same number of matches!

Next, ‘c1ccccc1I’ (that’s iodobenzene):

% python search.py 'c1ccccc1I'
48731386 matches Cc1cc(CNC(=O)c2ccc(I)cc2)on1
48671550 matches NC(=O)Cc1ccc(OCC(=O)Nc2ccc(I)cc2)cc1
48731482 matches Cc1cc(CNC(=O)c2cccc(I)c2)on1
48731331 matches Cc1cc(CNC(=O)c2ccccc2I)on1
48741344 matches CN(C)C(=O)c1cccc(NC(=O)Nc2ccc(I)cc2)c1
48584231 matches O=C(Nc1cccc(COCC2CC2)c1)c1ccccc1I
48688164 matches CC1CN(C(=O)c2ccc(I)cc2)CC(C)(C)O1
48688205 matches CC1CN(C(=O)c2cccc(I)c2)CC(C)(C)O1
48946427 matches N#CC1CCN(S(=O)(=O)c2ccc(I)cc2)CC1
48522115 matches CC1(C)COCCN1C(=O)c1ccc(F)cc1I

= Screen search =
num targets: 14967
screen size: 10
num matches: 10
screenout: 99.9%
precision: 100.0%
screen time: 0.01
atom-by-atom-search and report time: 0.01
total time: 0.02

= Brute force search =
num matches: 10
time to test all molecules: 5.17
screening speedup: 281.4


Now for some bad news. Try ‘[Pu]’. This doesn’t screen out many structures yet has no matched. I’ll report the search statistics:

= Screen search =
num targets: 14967
screen size: 14967
num matches: 0
screenout: 0.0%
precision: 0.0%
screen time: 0.00
atom-by-atom-search and report time: 8.40
total time: 8.40

= Brute force search =
num matches: 0
time to test all molecules: 5.24
screening speedup: 0.6


That’s horrible! It’s slower! What happened is that ‘[Pu]’ generates a fingerprint with only two bits set:

% echo '[Pu] plutonium' | rdkit2fps --pattern --fpSize 1024
#FPS1
#num_bits=1024
#type=RDKit-Pattern/4 fpSize=1024
#software=RDKit/2019.09.1 chemfp/3.4
#date=2020-05-13T12:12:48
00000000200000000000000000000000000000000000000000000000000000000000000000000000
00008000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000      plutonium


You know, that’s really hard to see. I’ll use a bit of perl to replace the zeros with “.”s:

% echo '[Pu] plutonium' | python ../rdkit2fps --pattern --fpSize 1024 \
?      | perl -pe 's/0/./g'
#FPS1
#num_bits=1.24
#type=RDKit-Pattern/4 fpSize=1.24
#software=RDKit/2.19..9.1 chemfp/3.4
#date=2.2.-.5-13T12:15:19
........2.......................................................................
....8...........................................................................
................................................................................
................      plutonium


Ha! And it converted zeros in the header lines to “.” (and it would have converted any zeros in the identifier). I’ll just omit the header lines in the following.

Unfortunately, so many other structures also set those two bits that it isn’t an effective screen for plutonium.

## Reading structure fingerprints using a toolkit¶

In this section you’ll learn how to use a chemistry toolkit to compute fingerprints from a given structure file.

What happens if you’re given a structure file and you want to find the two nearest matches in an FPS file? You’ll have to generate the fingerprints for the structures in the structure file, then do the comparison.

For this section you’ll need to have a chemistry toolkit. I’ll use the “chebi_maccs.fps” file generated in Using a toolkit to process the ChEBI dataset as the targets, and the PubChem file Compound_099000001_099500000.sdf.gz as the source of query structures:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> from chemfp import search
>>> for (query_id, hits) in chemfp.knearest_tanimoto_search(queries, targets, k=2, threshold=0.0):
...   print(query_id, "=>", end=" ")
...   for (target_id, score) in hits.get_ids_and_scores():
...     print("%s %.3f" % (target_id, score), end=" ")
...   print()
...
99000039 => CHEBI:116650 0.870 CHEBI:105034 0.812
99000230 => CHEBI:120636 0.840 CHEBI:127468 0.839
99002251 => CHEBI:92604 0.756 CHEBI:92191 0.733
99003537 => CHEBI:112376 0.745 CHEBI:32193 0.696
99003538 => CHEBI:112376 0.745 CHEBI:32193 0.696
... many, many lines omitted ...


That’s it! Pretty simple, wasn’t it? I didn’t even need to explicitly specify which toolkit I wanted to use because the read_molecule_fingerprints() got that information from the arena’s Metadata.

The new function is chemfp.read_molecule_fingerprints(), which reads a structure file and generates the appropriate fingerprints for each one. The first parameter of this is the metadata used to configure the reader. In my case it’s:

>>> print(targets.metadata)
#num_bits=166
#type=OpenBabel-MACCS/2
#software=OpenBabel/3.0.0 chemfp/3.4
#source=ChEBI_lite.sdf.gz
#date=2020-05-12T10:09:35


The metadata’s “type” told chemfp which toolkit to use to read molecules, and how to generate fingerprints from those molecules.

You can pass in your own metadata as the first parameter to read_molecule_fingerprints, and as a shortcut, if you pass in a string then it will be used as the fingerprint type.

For examples, if you have OpenBabel installed then you can do:

>>> import chemfp
>>> from chemfp import bitops
>>> for i, (id, fp) in enumerate(reader):
...   print(id, bitops.hex_encode(fp))
...   if i == 3:
...     break
...
99000039 000004000000300001c0004e9361b041dce1676e1f
99000230 000000800100649f0445a7fe2aeab1eb8f6bdfff1f
99002251 00000000001132000088004985601140dce4e3fe1f
99003537 00000000200020000156149a906994830c3159ae1f


If you have OEChem and OEGraphSim installed and licensed then you can do:

>>> import chemfp
>>> from chemfp import bitops
>>> for i, (id, fp) in enumerate(reader):
...   print(id, bitops.hex_encode(fp))
...   if i == 3:
...     break
...
99000039 000004000000300001c0404e93e19053dca06b6e1b
99000230 000000880100648f0445a7fe2aeab1738f2a5b7e1b
99002251 00000000001132000088404985e01152dca46b7e1b
99003537 00000000200020000156149a90e994938c30592e1b


If you have RDKit installed then you can do:

>>> import chemfp
>>> from chemfp import bitops
>>> for i, (id, fp) in enumerate(reader):
...   print(id, bitops.hex_encode(fp))
...   if i == 3:
...     break
...
99000039 000004000000300001c0004e9361b051dce1676e1f
99000230 000000800100649f0445a7fe2aeab1fb8f6bdfff1f
99002251 00000000001132000088004985601150dce4e3fe1f
99003537 00000000200020000156149a906994930c3159ae1f


And if you have the CDK JAR file on your CLASSPATH and the JPype Python/Java adapter installed then you can do:

>>> import chemfp
>>> from chemfp import bitops
>>> for i, (id, fp) in enumerate(reader):
...   print(id, bitops.hex_encode(fp))
...   if i == 3:
...     break
...
99000039 000004000000300001c4004e83e1b053dce16f6e1f
99000230 000000800100649f0645a7fe2aeab1eb8febffff1f
99002251 00000000001132000088004985e01172dce4ebfe1f
99003537 00000000200020000356149a80e994930cb179ae1f


## Select a fingerprint subset using a list of indices¶

In this section you’ll learn how to make a new arena given a list of indices for the fingerprints to select from an old arena.

For this section, one example will use indices will be a randomly selected subset of the indices in the fingerprint. If that’s your goal, see the next section, Sample N fingerprints at random to learn how to use FingerprintArena.sample(). If you want to split the arena into a training set and a test set, see the section after that, Split into training and test sets which shows how to use FingerprintArena.train_test_split().

A FingerprintArena slice creates a subarena. Technically speaking, this is a “view” of the original data. The subarena doesn’t actually copy its fingerprint data from the original arena. Instead, it uses the same fingerprint data, but keeps track of the start and end position of the range it needs. This is why it’s not possible to slice with a step size other than +1.

This also means that memory for a large arena won’t be freed until all of its subarenas are also removed.

You can see some evidence for this because a FingerprintArena stores the entire fingerprint data as a set of bytes named arena:

>>> import chemfp
>>> subset = targets[10:20]
>>> targets.arena is subset.arena
True


This shows that the targets and subset share the same raw data set. At least it does to me, the person who wrote the code.

You can ask an arena or subarena to make a copy. This allocates new memory for the new arena and copies all of its fingerprints there.

>>> new_subset = subset.copy()
>>> len(new_subset) == len(subset)
True
>>> new_subset.arena is subset.arena
False
>>> subset[7][0]
'48637548'
>>> new_subset[7][0]
'48637548'


The FingerprintArena.copy() method can do more than just copy the arena. You can give it a list of indices (or an iterable) and it will only copy those fingerprints:

>>> three_targets = targets.copy([3112, 0, 1234])
>>> three_targets.ids
['48942244', '48568841', '48628197']
>>> [targets.ids[3112], targets.ids[0], targets.ids[1234]]
['48628197', '48942244', '48568841']


Are you confused about why the identifiers aren’t in the same order? That’s because when you specify indicies, the copy automatically reorders them by popcount and stores the popcount information. This requires a bit extra overhead to sort, but makes future searches faster. Use reorder=False to leave the order unchanged

>>> my_ordering = targets.copy([3112, 0, 1234], reorder=False)
>>> my_ordering.ids
['48628197', '48942244', '48568841']


Suppose you want to partition the data set into two parts; one containing the fingerprints at positions 0, 2, 4, … and the other containing the fingerprints at positions 1, 3, 5, …. The range() function returns iterator for the right length, and you can have it start from either 0 or 1 and count by twos, like this:

>>> list(range(0, 10, 2))
[0, 2, 4, 6, 8]
>>> list(range(1, 10, 2))
[1, 3, 5, 7, 9]


so the following will create the correct indices and from that the correct arena subsets:

>>> evens = targets.copy(range(0, len(targets), 2))
>>> odds = targets.copy(range(1, len(targets), 2))
>>> len(evens)
7484
>>> len(odds)
7483


(Use FingerprintArena.train_test_split() if you want to select two disjoint subsets selected at random without replacement.)

What about getting a random subset of the data? I want to select m records at random, without replacement, to make a new data set. (See the next section for a better way to do this using FingerprintArena.sample().)

You can see this just means making a list with m different index values. Python’s built-in random.sample function makes this easy:

>>> import random
>>> random.sample("abcdefgh", 3)
['b', 'h', 'f']
>>> random.sample("abcdefgh", 2)
['d', 'a']
>>> random.sample([5, 6, 7, 8, 9], 2)
[7, 9]
>>> help(random.sample)
Help on method sample in module random:

sample(population, k) method of random.Random instance
Chooses k unique random elements from a population sequence or set.
...
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population:   sample(range(10000000), 60)


The last line of the help points out what do next!:

>>> random.sample(range(len(targets)), 5)
[610, 2850, 705, 1402, 2635]
>>> random.sample(range(len(targets)), 5)
[1683, 2320, 1385, 2705, 1850]


(Note: on Python 2.7 you’ll need to use “xrange()” not “range()”.)

Putting it all together, and here’s how to get a new arena containing 100 randomly selected fingerprints, without replacement, from the targets arena:

>>> sample_indices = random.sample(range(len(targets)), 100)
>>> sample = targets.copy(indices=sample_indices)
>>> len(sample)
100


But really, see the next section for an easier way to do this.

## Sample N fingerprints at random¶

In this section you’ll learn how to select a random subset of the fingerprints in an arena.

The previous section showed how to use the FingerprintArena.copy() method to create a new arena containing a randomly selected subset of the fingerprints in an arena. This required writing some code to specify the randomly samples indices.

Chemfp 3.4.1 added the method FingerprintArena.sample() which lets you make a random sample using a single call:

>>> import chemfp
>>> sample_arena = targets.sample(10000)
>>> len(sample_arena)
10000
>>> sample_arena.ids[:5]
['48941399', '48940284', '48943050', '48656867', '48839855']


If you do the sample a few times you’ll see that many of the elements occur often:

>>> targets.sample(10000).ids[:5]
['48942244', '48656867', '48966209', '48946425', '48946734']
>>> targets.sample(10000).ids[:5]
['48942244', '48940284', '48656359', '48839855', '48946668']
>>> targets.sample(10000).ids[:5]
['48942244', '48940284', '48656359', '48656867', '48839855']
>>> targets.sample(10000).ids[:5]
['48940284', '48656359', '48656867', '48839855', '48946668']


This is for two reasons. First, the sample size is about 2/3rds of the size of the the data set:

>>> len(targets)
14967


which means there’s a roughly 2/3rds chance that a given record will be in the sample. Second, by default the sampled fingerprints are reordered by popcount when making the arena, which means many of the first few identifiers are the same.

Set reorder to False to keep the fingerprints in random sample order:

>>> targets.sample(10000, reorder=False).ids[:5]
['48830242', '48946583', '48559359', '48836764', '48692192']
>>> targets.sample(10000, reorder=False).ids[:5]
['48868183', '48703234', '48577832', '48913224', '48659805']
>>> targets.sample(10000, reorder=False).ids[:5]
['48965603', '48596355', '48691077', '48688289', '48940955']
>>> targets.sample(10000, reorder=False).ids[:5]
['48560433', '48933559', '48662000', '48958077', '48675138']


Remember that similarity search performance is better if the the fingerprints are sorted by popcount.

The above examples used num_samples=10000. If num_samples is an integer, then it’s used as the number of samples to make. (Chemfp raises a ValueError if the size is negative or too large.) If num_samples is a float between 0.0 and 1.0 inclusive then it’s used as the fraction of the dataset to sample. For example, the following samples 10% of the arena, rounded down:

>>> len(targets.sample(0.1))
1496


If no rng is given then the underlying implementation uses Python’s random.sample function. That in turn uses a random number generator (RNG) which was initialized with a hard-to-guess seed.

If you need a reproducible sample, you can pass in an integer rng value. This is used to seed a new RNG for the sampling. In the following example, using the same seed always returns the same fingerprints:

>>> targets.sample(2, rng=123).ids
['48651340', '48778262']
>>> targets.sample(2, rng=123).ids
['48651340', '48778262']
>>> targets.sample(2, rng=789).ids
['48693989', '48507089']
>>> targets.sample(2, rng=789).ids
['48693989', '48507089']


Another option is pass in a random.Random() instance, which will be used directly as the RNG:

>>> import random
>>> my_rng = random.Random(123)
>>> targets.sample(2, rng=my_rng).ids
['48651340', '48778262']
>>> targets.sample(2, rng=my_rng).ids
['48730072', '48908385']
>>> targets.sample(2, rng=my_rng).ids
['48690445', '48502715']


This may be useful if you need to make several random samples, want reproducibility, and only want to specify one RNG seed. (Be aware that Python’s RNG may be subject to change across different versions of Python.)

## Split into training and test sets¶

In this section you’ll learn how to split an arena into two disjoint arenas, which can be then be used as a training set and a test set.

The previous section, Sample N fingerprints at random showed how to use chemfp to select N fingerprints at random from an arena. Sometimes you need two randomly selected subsets, with no overlap between the two. For example, one might be used as a training set and the other as a test set.

Chemfp 3.4.1 added the method FingerprintArena.train_test_split() which does that. You give it the number of fingerprints you want in the training set and/or the test set, and it returns two arenas; the first is the training set and the second is the test set:

>>> import chemfp
>>> len(targets)
14967
>>> train_arena, test_arena = targets.train_test_split(train_size=10, test_size=5)
>>> len(train_arena)
10
>>> len(test_arena)
5


This function is modeled on the scikit learn function train_test_split() , which allows for the sizes to be specified as an integer number or a floating point fraction.

If a specified size is an integer, it is interpreted at the number of fingerprints to have in the corresponding set. If a specified size is a float between 0.0 and 1.0 inclusive then it’s interpreted as the fraction of fingerprints to select. For example, the following puts 10% of the fingerprints into the training arena and 20 fingerprints

>>> train_arena, test_arena = targets.train_test_split(train_size=0.1, test_size=20)
>>> len(train_arena), len(test_arena)
(1496, 20)


If you don’t specify the test or arena size then the training set gets 75% of the fingerprints and the test set gets the rest:

>>> train_arena, test_arena = targets.train_test_split()
>>> len(train_arena), len(test_arena)
(11226, 3741)


If only one of train_size or test_size is specified then the other value is interpreted as the complement size, so the entire arena is split into the two sets. In the following, 75% of the fingerprints are put into the training arena and 25% into the test arena:

>>> train_arena, test_arena = targets.train_test_split(train_size=0.75)
>>> len(train_arena), len(test_arena)
(11225, 3742)


It is better to let chemp figure out the complement size than to specify both sizes as a float because integer rounding may cause a fingerprint to be left out (the test arena size is 3741 in the following when it should be 3742):

>>> train_arena, test_arena = targets.train_test_split(train_size=0.75, test_size=0.25)
>>> len(train_arena), len(test_arena)
(11225, 3741)


By default, after the random sampling the fingerprints in each set are reordered by population count and indexed for fast similarity search.

>>> from chemfp import bitops
>>> train_arena, test_arena = targets.train_test_split(10, 10)
>>> [bitops.byte_popcount(train_arena.get_fingerprint(i)) for i in range(10)]
[71, 118, 119, 145, 146, 159, 162, 167, 176, 196]
>>> [bitops.byte_popcount(test_arena.get_fingerprint(i)) for i in range(10)]
[87, 116, 117, 121, 129, 131, 139, 183, 184, 193]


To keep the fingerprints in random sample order, specify reorder=False:

>>> train_arena, test_arena = targets.train_test_split(10, 10, reorder=False)
>>> [bitops.byte_popcount(train_arena.get_fingerprint(i)) for i in range(10)]
[118, 53, 170, 110, 138, 169, 129, 125, 129, 151]
>>> [bitops.byte_popcount(test_arena.get_fingerprint(i)) for i in range(10)]
[172, 167, 123, 152, 147, 162, 156, 197, 45, 151]


The rng parameter affects how the fingerprints are samples. By default (if rng=None), Python’s default RNG is used. If rng is an integer then it’s used as the seed for a new random.Random() instance. Otherwise it’s assumed to be an RNG object and its sample() method is used to make the sample.

The rng parameter here works the same as in FingerprintArena.sample() so for examples see the previous section, Sample N fingerprints at random.

## Don’t reorder an arena by popcount¶

In this section you’ll learn about why you might want to store your fingerprints in specific order, rather than being ordered by population count.

The previous section showed how to make an arena where the fingerprints are in a user-specified order:

>>> import chemfp
>>> [targets.ids[i] for i in [3112, 0, 1234]]
['48628197', '48942244', '48568841']
>>> targets.copy([3112, 0, 1234], reorder=False).ids
['48628197', '48942244', '48568841']
>>> targets.copy([3112, 0, 1234], reorder=True).ids
['48942244', '48568841', '48628197']


If the reorder option is not specified, the fingerprints in the new arena will be in popcount order. Similarity search is faster when the arena is in popcount order because it lets chemfp make an index of the different regions, based on popcount, and use that for sublinear search.

Why would someone want search to be slower?

Sometimes data organization is more important. For one client I developed a SEA implementation, where I compared a set of query fingerprints to about 50 other sets of target fingerprint sets. The largest set had only few thousand fingerprints, so the overall search was fast without a popcount index.

I could have stored each target data set as its own file, but that would have resulted in about 50 data files to manage, in addition to the original fingerprint file and the configuration file containing the information about which identifiers are in which set.

Instead, I stored all of the target data sets in a single FPB file, where the fingerprints for the first set came first, then the fingerprints for the second set, and so on. I also made a range file to store the set name and the start/end range of that set in the FPB file. This reduced 50 files down to two, which was much easier to manage.

It’s a bit fiddly to go through the details of how this works, because it requires set membership information which is a bit complicated to extract and which won’t be used for the rest of this documentation. Instead of walking though an example here, I’ll refer you to my essay ChEMBL target sets association network.

You can use the subranges directly as an arena slice, like arena[54:91] as the target. This will work, but as I said earlier, the search time will be slower because the sublinear algorithm requires a popcount index.

If you need that search performance then during load time make a copy of the slice, as in arena[54:91].copy(reorder=True), and use that as the target.

A few paragraphs ago I wrote that “I stored all of the target data sets in a single FPB file.” When you load an FPB format, the fingerprint order will be exactly as given in the file. However, if you load fingerprints from an FPS file, the fingerprints are by default reordered. For example, given this data set:

% cat unordered_example.fps
#FPS1
0001  Record1
ffee  Record2
00f0  Record3


I’ll load it into chemfp and show that by default the records are in the order 1, 3, 2:

>>> import chemfp
['Record1', 'Record3', 'Record2']


On the other hand, if I ask it to not reorder then the records are in the input order, which is 1, 2, 3:

>>> chemfp.load_fingerprints("unordered_example.fps", reorder=False).ids
['Record1', 'Record2', 'Record3']


In short, if you want to preserve the fingerprint order as given in the input file then use the reorder=False argument in chemfp.load_fingerprints().

## Look up a fingerprint with a given id¶

In this section you’ll learn how to get a fingerprint record with a given id. You will need the “pubchem_targets.fps” file generated in Generate fingerprint files from PubChem SD tags in order to do this yourself.

All fingerprint records have an identifier and a fingerprint. Identifiers should be unique. (Duplicates are allowed, and if they exist then the lookup code described in this section will arbitrarily decide which record to return. Once made, the choice will not change.)

Let’s find the fingerprint for the record in “pubchem_targets.fps” which has the identifier “14564126”. One solution is to iterate over all of the records in a file, using the FPS reader:

>>> import chemfp
>>> for id, fp in chemfp.open("pubchem_targets.fps"):
...   if id == "48500164":
...     break
... else:
...
>>> id, fp[:5]
('48500164', b'\x07\xde\x0c\x00\x00')


I used the somewhat obscure else clause to the for loop. If the for finishes without breaking, which would happen if the identifier weren’t present, then it will raise an exception saying that it couldn’t find the given identifier.

If the fingerprint records are already in a FingerprintArena then there’s a better solution. Use the FingerprintArena.get_fingerprint_by_id() method to get the fingerprint byte string, or None if the identifier doesn’t exist:

>>> arena = chemfp.load_fingerprints("pubchem_targets.fps")
>>> fp = arena.get_fingerprint_by_id("48500164")
>>> fp[:5]
b'\x07\xde\x0c\x00\x00'
>>> missing_fp = arena.get_fingerprint_by_id("does-not-exist")
>>> missing_fp
>>> missing_fp is None
True


Internally this does about what you think it would. It uses the arena’s id list to make a lookup table mapping identifier to index, and caches the table for later use. Given the index, it’s very easy to get the fingerprint.

In fact, you can get the index and do the record lookup yourself:

>>> arena.get_index_by_id("48500164")
8168
>>> arena[8168]
('48500164', b'\x07\xde\x0c\x00\x00 .. rest omittted ..'')


## Sorting search results¶

In this section you’ll learn how to sort the search results.

The k-nearest searches return the hits sorted from highest score to lowest, and break ties arbitrarily. This is usually what you want, and the extra cost to sort is small (k*log(k)) compared to the time needed to maintain the internal heap (N*log(k)).

By comparison, the threshold searches return the hits in arbitrary order. Sorting takes up to N*log(N) time, which is extra work for those cases where you don’t want sorted data. If you actually want it sorted, then call SearchResult.reorder() method to sort the hits in-place:

>>> import chemfp
>>> query_fp = arena.get_fingerprint_by_id("48500164")
>>> from chemfp import search
>>> result = search.threshold_tanimoto_search_fp(query_fp, arena, threshold=0.90)
>>> len(result)
5
>>> result.get_ids_and_scores()
[('48530223', 0.9044585987261147), ('48533220', 0.9230769230769231),
('48533212', 0.9299363057324841), ('48500164', 1.0), ('48501504',
0.906832298136646)]
>>>
>>> result.reorder("decreasing-score")
>>> result.get_ids_and_scores()
[('48500164', 1.0), ('48533212', 0.9299363057324841), ('48533220',
0.9230769230769231), ('48501504', 0.906832298136646),
('48530223', 0.9044585987261147)]
>>>
>>> result.reorder("increasing-score")
>>> result.get_ids_and_scores()
[('48530223', 0.9044585987261147), ('48501504', 0.906832298136646),
('48533220', 0.9230769230769231), ('48533212', 0.9299363057324841),
('48500164', 1.0)]


There are currently eight different sort methods, all specified by a name string. These are

• increasing-score - sort by increasing score
• increasing-score-plus - sort by increasing score, breaking ties on increasing index
• decreasing-score - sort by decreasing score
• decreasing-score-plus - sort by decreasing score, breaking ties on increasing index
• increasing-index - sort by increasing target index
• decreasing-index - sort by decreasing target index
• reverse - reverse the current ordering
• move-closest-first - move the hit with the highest score to the first position

The first two sort by increasing score. The first one only considers the scores when sorting, and preserves the original order of the indices for hits with the same score. The second one sorts by increasing score and, if multiple hits have the same score, ties are broken by increasing index. This gives an absolute ordering to the results, independent of history, but is a bit slower than the first option.

The third and fourth options sort by decreasing score. The third one only considers the scores when sorting, and preserves the original order of the indices for hits with the same score. The fourth one sorts by decreasing score and, if multiple hits have the same score, ties are broken by increasing index. This gives an absolute ordering to the results, independent of history, but is a bit slower than the third option.

Note

Before chemfp 3.5 the increasing-score and decreasing-score implementations gave an absolute order (the same as increasing-score-plus and decreasing-score-plus do now). This implementation detail was never documented. The behavior was changed to allow a faster sort implementation which does not break ties for the same score.

If you find something useful for increasing-index and decreasing-index, let me know. They sort exactly like their names describe.

The reverse method reverses the current ordering, and is mostly useful if you want to reverse the sorted results from a k-nearest search.

The move-closest-first option exists to improve the leader algorithm stage used by the Taylor-Butina algorithm. The newly seen compound is either in the same cluster as its nearest neighbor or it is the new centroid. I felt it best to implement this as a special reorder term, rather than one of the other possible options.

If you have suggestions for alternate orderings which might help improve your clustering performance, let me know.

If you want to reorder all of the search results then you could use the SearchResult.reorder() method on each result, but it’s easier to use SearchResults.reorder_all() and change everything in a single call. It takes the same ordering names as reorder:

>>> from __future__ import print_function # Only for Python 2
>>> similarity_matrix = search.threshold_tanimoto_search_symmetric(
...                         arena, threshold=0.8)
>>> for query_id, row in zip(arena.ids, similarity_matrix):
...   print(query_id, "->", row.get_ids_and_scores()[:3])
...
48942244 -> []
48941399 -> []
48940284 -> []
48943050 -> []
48656359 -> [('48656867', 0.9761904761904762), ('48656360', 0.9111111111111111), ('48650490', 0.851063829787234)]
48656867 -> [('48656360', 0.8913043478260869), ('48650490', 0.8333333333333334), ('48521769', 0.8)]
48839855 -> [('48839869', 0.9148936170212766), ('48839845', 0.8775510204081632), ('48839868', 0.8269230769230769)]
... lines deleted ....
>>>
>>> similarity_matrix.reorder_all("increasing-score")
>>> for query_id, row in zip(arena.ids, similarity_matrix):
...   print(query_id, "->", row.get_ids_and_scores()[:3])
...
48942244 -> []
48941399 -> []
48940284 -> []
48943050 -> []
48656359 -> [('48680086', 0.803921568627451), ('48693263', 0.803921568627451), ('48693634', 0.803921568627451)]
48656867 -> [('48521769', 0.8), ('48521768', 0.803921568627451), ('48653206', 0.803921568627451)]
48839855 -> [('48839868', 0.8269230769230769), ('48839845', 0.8775510204081632), ('48839869', 0.9148936170212766)]
... lines deleted ....


For display purposes, I used [:3] to display only the first three matches. In the first block the results are in arbitrary order, while in the second the elements are sorted so the smallest score is first.

## Working with raw scores and counts in a range¶

In this section you’ll learn how to get the hit counts and raw scores for an interval.

The length of a SearchResult is the number of hits it contains:

>>> import chemfp
>>> from chemfp import search
>>> fp = arena.get_fingerprint_by_id("48500164")
>>> result = search.threshold_tanimoto_search_fp(fp, arena, threshold=0.2)
>>> len(result)
14888


This gives you the number of hits at or above a threshold of 0.2, which you can also get by doing chemfp.search.count_tanimoto_hits_fp():

>>> search.count_tanimoto_hits_fp(fp, arena, threshold=0.2)
14888


The advantage to the first version is the result also stores the hits. You can query the hit to get the number of hits which are within a specified interval. Here are the counts of the number of hits at or above 0.5, 0.80, and 0.95:

>>> result.count(0.5)
7785
>>> result.count(0.8)
42
>>> result.count(0.95)
1


The first parameter, min_score, specifies the minimum threshold. If not specified it’s -infinity. The second, max_score, specifies the maximum, and is +infinity if not specified. Here’s how to get the number of hits with a score of at most 0.95 and 0.5:

>>> result.count(max_score=0.95)
14887
>>> result.count(max_score=0.5)
7209


If you double-check the math, and add the number above 0.5 (7785) and the number below 0.5 (7209) you’ll get 14994, even through there are only 14888 records. The extra 106 is because by default the count interval uses a closed range. There are 106 hits with a score of exactly 0.5:

>>> result.count(0.5, 0.5)
106


The third parameter, interval, specifies the end conditions. The default is “[]” which means that both ends are closed. The interval “()” means that both ends are open, and “[)” and “(]” are the two half-open/half-closed ranges. To get the number of hits below 0.5 and the number of hits at or above 0.5 then you might use:

>>> result.count(None, 0.5, "[)")
7103
>>> result.count(0.5, None, "[]")
7785
>>> 7103+7785
14888


This total matches the expected count. (A min or max of None means -infinity and +infinity, respectively.)

## Cumulative search result counts and scores¶

In this section you’ll learn some more advanced ways to work with SearchResults and SearchResult instances.

I wanted to title this section “Going to SEA”, but decided to use a more descriptive name. “SEA” refers to the “Similarity Ensemble Approach” (SEA) work of Keiser, Roth, Armbruster, Ernsberger, and Irwin. The paper is available online from http://sea.bkslab.org/ , though I won’t actually implement it here. Why do I mention it? Because these chemfp methods were added specifically to make it easier to support a SEA implementation for one of the chemfp customers.

Suppose you have two sets of structures. How well do they compare to each other? I can think of various ways to do it. One is to look at a comparison profile. Find all NxM comparisons between the two sets. How many of the hits have a threshold of 0.2? How many at 0.5? 0.95?

If there are “many”, then the two sets are likely more similar than not. If the answer is “few”, then they are likely rather distinct.

I’ll be more specific. I want to know if the coenzyme A-like structures in ChEBI are more similar to the penicillin-like structures than one would expect by comparing two randomly chosen subsets. To quantify “similar”, I’ll use Tanimoto similarity of the “chebi_maccs.fps” fingerprints, which are the 166 MACCS key-like fingerprints from RDMACCS for the ChEBI data set. See Using a toolkit to process the ChEBI dataset for details about why I use the --id-tag options:

# Use one of the following to create chebi_maccs.fps
oe2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
ob2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
rdkit2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps
cdk2fps --id-tag "ChEBI ID" --rdmaccs ChEBI_lite.sdf.gz -o chebi_maccs.fps


I used oe2fps to create RDMACCS-OpenEye fingerprints.

The CHEBI id for coenzyme A is CHEBI:15346 and for penicillin is CHEBI:17334. I’ll define the “coenzyme A-like” structures as the 256 structures where the fingerprint is at least 0.95 similar to coenzyme A, and “penicillin-like” as the 24 structures at least 0.85 similar to penicillin. This gives 6144 total comparisons.

You know enough to do this, but there’s a nice optimization I haven’t told you about. You can get the total count of all of the threshold hits using the chemfp.search.SearchResults.count_all() method instead of looping over each SearchResult and calling chemfp.search.SearchResult.count():

from __future__ import print_function # Only for Python 2
import chemfp
from chemfp import search

def get_neighbors_as_arena(arena, id, threshold):
fp = arena.get_fingerprint_by_id(id)
neighbor_results =  search.threshold_tanimoto_search_fp(fp, chebi, threshold=threshold)
neighbor_arena = arena.copy(neighbor_results.get_indices())
return neighbor_arena

# Find the 256 neighbors of coenzyme A
coA_arena = get_neighbors_as_arena(chebi, "CHEBI:15346", threshold=0.95)
print(len(coA_arena), "coenzyme A-like structures")

# Find the 24 neighbors of penicillin
penicillin_arena = get_neighbors_as_arena(chebi, "CHEBI:17334", threshold=0.85)
print(len(penicillin_arena), "penicillin-like structures")

# I'll compute a profile at different thresholds
thresholds = [0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.7, 0.8, 0.9, 0.95]

# Compare the two sets. (For this case the speed difference between a threshold
# of 0.25 and 0.0 is not noticible, but having it makes me feel better.)
coA_against_penicillin_result = search.threshold_tanimoto_search_arena(
coA_arena, penicillin_arena, threshold=min(thresholds))

# Show a similarity profile
print("Counts  coA/penicillin")
for threshold in thresholds:
print(" %.2f      %5d" % (threshold,
coA_against_penicillin_result.count_all(min_score=threshold)))


This gives a not very useful output:

272 coenzyme A-like structures
24 penicillin-like structures
Counts  coA/penicillin
0.30       6528
0.35       6528
0.40       6523
0.45       4403
0.50       1193
0.55          0
0.60          0
0.70          0
0.80          0
0.90          0
0.95          0


It’s not useful because it’s not possible to make any decisions from this. Are the numbers high or low? It should be low, because these are two quite different structure classes, but there’s nothing to compare it against.

I need some sort of background reference. What I’ll do is construct two randomly chosen sets, one with 256 fingerprints and the other with 24, and generate the same similarity profile with them. That isn’t quite fair, since randomly chosen sets will most likely be diverse. Instead, I’ll pick one fingerprint at random, then get its 256 or 24, respectively, nearest neighbors as the set members (place the following code at the end of the file with the previous code):

# Get background statistics for random similarity groups of the same size
import random

# Find a fingerprint at random, get its k neighbors, return them as a new arena
def get_random_fp_and_its_k_neighbors(arena, k):
fp = arena[random.randrange(len(arena))][1]
similar_search = search.knearest_tanimoto_search_fp(fp, arena, k)
return arena.copy(similar_search.get_indices())


I’ll construct 1000 pairs of sets this way, accumulate the threshold profile, and compare the CoA/penicillin profile to it:

# Initialize the threshold counts to 0
total_background_counts = dict.fromkeys(thresholds, 0)

REPEAT = 1000
for i in range(REPEAT):
# Select background sets of the same size and accumulate the threshold count totals
set1 = get_random_fp_and_its_k_neighbors(chebi, len(coA_arena))
set2 = get_random_fp_and_its_k_neighbors(chebi, len(penicillin_arena))
background_search = search.threshold_tanimoto_search_arena(set1, set2, threshold=min(thresholds))
for threshold in thresholds:
total_background_counts[threshold] += background_search.count_all(min_score=threshold)

print("Counts  coA/penicillin  background")
for threshold in thresholds:
print(" %.2f      %5d          %5d" % (threshold,
coA_against_penicillin_result.count_all(min_score=threshold),
total_background_counts[threshold] / (REPEAT+0.0)))


Your output should now have something like this at the end:

Counts  coA/penicillin  background
0.30       6528           2798
0.35       6528           2273
0.40       6523           1789
0.45       4403           1301
0.50       1193            988
0.55          0            656
0.60          0            411
0.70          0            160
0.80          0             54
0.90          0             15
0.95          0              0


This is a bit hard to interpret. Clearly the coenzyme A and penicillin sets are not closely similar, but for low Tanimoto scores the similarity is higher than expected. That difficulty is okay for now because I mostly wanted to show an example of how to use the chemfp API. If you want to dive deeper into this sort of analysis then read a three-part series I wrote at http://www.dalkescientific.com/writings/diary/archive/2017/03/20/fingerprint_set_similarity.html on using chemfp to build a target set association network using ChEMBL.

The SEA paper actually wants you to use the raw score, which is the sum of the hit scores in a given range, and not just the number of hits. No problem! Use SearchResult.cumulative_score() for the cumulative scores for an individual result, or SearchResults.cumulative_score_all() for the cumulative scores across all of the results. The two functions compute almost identical values for the whole data set:

>>> sum(row.cumulative_score(min_score=0.5, max_score=0.9)
...             for row in coA_against_penicillin_result)
605.5158868869943
>>> coA_against_penicillin_result.cumulative_score_all(min_score=0.5, max_score=0.9)
605.5158868869953


The cumulative methods, like the count method you learned about in the previous section, also take the interval parameter for when you don’t want the default of “[]”.

You may wonder why these two values aren’t exactly the same. They differ because floating point addition is not associative. The first computes the sum for each result, then the sum of sums. The second computes the sum by adding each score to the cumulative sum.

I get a different result if I sum up the values in reverse order:

>>> sum(list(row.cumulative_score(min_score=0.5, max_score=0.9)
...                for row in coA_against_penicillin_result)[::-1])
605.5158868869959


Which is the “right” score? The cumulative_score_all() method at least matches the one you might write if you computed the sum directly:

>>> total_score = 0.0
>>> for row_scores in coA_against_penicillin_result.iter_scores():
...   for score in row_scores:
...     if 0.5 <= score <= 0.9:
...       total_score += score
...
>>> total_score
605.5158868869953


## Writing fingerprints with a fingerprint writer¶

In this section you’ll learn how to create a fingerprint file using the chemfp fingerprint writer API.

You probably don’t need this section. In most cases you can save the contents of an FPS reader or fingerprint arena by using the FingerprintReader.save() method, as in the following examples:

chemfp.open("pubchem_targets.fps").save("example.fps")
chemfp.open("pubchem_targets.fps").save("example.fpb")
chemfp.open("pubchem_targets.fpb").save("example.fps.gz")


The structure-based fingerprint readers also implement the save method so you could simply write:

import chemfp


However, if you generate the fingerprints yourself, or want more fine-grained control over the writer parameters then read on!

(If you don’t have RDKit installed then use “OpenBabel-MACCS” for Open Babel’s MACCS fingerprints, “OpenEye-MACCS166” for OpenEye’s, or “CDK-MACCS” for CDK’s.)

Here’s an example of the fingerprint writer API. I open the writer, ask it to write a fingerprint id and the fingerprint, and then close it.

>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("example.fps")
>>> writer.write_fingerprint("ABC123", b"\0\0\0\0\0\3\2\1")
>>> writer.close()


>>> from __future__ import print_function # Only for Python 2
#FPS1
0000000000030201      ABC123


Of course you don’t need to use chemfp to write this file. It’s simple enough that you could get the same result in fewer lines of normal Python code. The advantage starts to be useful when you want to include metadata.

>>> metadata = chemfp.Metadata(num_bits=64, type="Example-FP/0")
>>> writer.write_fingerprint("ABC123", b"\0\0\0\0\0\3\2\1")
>>> writer.close()
>>>
#FPS1
#num_bits=64
#type=Example-FP/0
0000000000030201      ABC123


Even then, native Python code is probably easier to use if you know what the header lines will be, because it’s a bit of a nuisance to create the chemfp.Metadata yourself.

On the other hand, if you have a chemfp fingerprint type you can just ask it for the correct metadata instance:

>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
aromaticity=None, sources=[], software='RDKit/2019.09.1
chemfp/3.4', date='2020-05-13T13:34:37')


Putting the two together, and switching to a 21 byte fingerprint instead of an 8 byte fingerprint, gives:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
>>> writer.write_fingerprint("ABC123", b"\0\1\2\3\4\5\6\7\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14")
>>> writer.close()
>>>
#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1 chemfp/3.4
#date=2020-05-13T13:35:23
000102030405060708090a0b0c0d0e0f1011121314    ABC123


In real life that fingerprint comes from somewhere. The high-level structure-based fingerprint reader has a handy metadata attribute:

>>> filename = "Compound_099000001_099500000.sdf.gz"
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1chemfp/3.4
#source=Compound_099000001_099500000.sdf.gz
#date=2020-05-13T13:36:11


By the way, note that this includes the source filename, which FingerprintType.get_metadata() can’t automatically do. (See Merging multiple structure-based fingerprint sources for an example of how to pass that information to get_metadata().)

A structure-based fingerprint reader is just like any other reader, so you can iterate over the (id, fingerprint) pairs:

>>> from chemfp import bitops
>>> for count, (id, fp) in enumerate(reader):
...   print(id, "=>", bitops.hex_encode(fp))
...   if count == 5:
...     break
...
99000039 => 000004000000300001c0004e9361b051dce1676e1f
99000230 => 000000800100649f0445a7fe2aeab1fb8f6bdfff1f
99002251 => 00000000001132000088004985601150dce4e3fe1f
99003537 => 00000000200020000156149a906994930c3159ae1f
99003538 => 00000000200020000156149a906994930c3159ae1f
99005028 => 00000000000000008000004e84683ca49100f7fa1f


You probably already see how to combine this with FingerprintWriter.write_fingerprint() to generate the FPS output. The key part would look like:

for id, fp in reader:
writer.write_fingerprint(id, fp)


While that would work, there’s a better way. The chemfp fingerprint writer has a FingerprintWriter.write_fingerprints() method which takes a list or iterator of (id, fingerprint) pairs. Here’s a better way to write the code:

import chemfp
filename = "Compound_099000001_099500000.sdf.gz"
writer.close()
# Note: See the next section for an even better  solution
# which uses a context manager.


This produces output which starts:

#FPS1
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1 chemfp/3.4
#source=Compound_099000001_099500000.sdf.gz
#date=2020-05-13T13:38:31
000004000000300001c0004e9361b051dce1676e1f    99000039
000000800100649f0445a7fe2aeab1fb8f6bdfff1f    99000230
00000000001132000088004985601150dce4e3fe1f    99002251
00000000200020000156149a906994930c3159ae1f    99003537


Why is write_fingerprints “better” than multiple calls to write_fingerprint? I think it more directly describes the goal of writing all of the fingerprints, rather than the mechanics of unpacking and repacking the (id, fingerprint) pairs. I had hoped that there would be performance improvement, because there’s less Python function call overhead, but my timings show no differences.

However, there’s a still better way, which is to use a context manager to close the files automatically, rather than calling close() explicitly. I’ll leave that for the next section.

## Fingerprint readers and writers are context managers¶

In this section you’ll learn how the fingerprint readers and writers can be used as a context manager.

The previous section ended with the following code:

import chemfp
filename = "Compound_099000001_099500000.sdf.gz"
writer.close()


This reads a PubChem file with RDKit, generates MACCS fingerprints, and saves the results to “example.fps”.

The two FingerprintWriter.close() lines ensure that the reader and writer files are closed. This isn’t required for a simple script, because Python will close the files automatically at the end of the script, or when the garbage collector kicks in.

However, since the writer may buffer the output, you have to close the file before you or another program can read it. It’s good practice to always close the file when you’re done with it, as otherwise there are ways to get really confused about why you don’t have a complete file.

Even with the explicit close calls, if there’s an exception in FingerprintWriter.write_fingerprints() then the files will be left open. In older-style Python this was handled with a try/finally block, but that’s verbose. Instead, chemfp’s readers and writers implement modern Python’s context manager API, to make it easier to close files automatically at just the right place. Here’s what the above looks like with a context manager:

import chemfp
filename = "Compound_099000001_099500000.sdf.gz"


Isn’t that nice and short? Just bear in mind that it’s even more succinctly written as:

import chemfp
filename = "Compound_099000001_099500000.sdf.gz"


## Write fingerprints to stdout or a file-like object¶

In this section you’ll learn how to write fingerprints to stdout, and how to write them to a BytesIO instance.

The previous section showed examples of passing a filename string to chemfp.open_fingerprint_writer(). If the filename argument is None then the writer will write to stdout in uncompressed FPS format:

>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer(None,
#FPS1
#num_bits=16
#type=Experiment/1
>>> writer.write_fingerprint("QWERTY", b"AA")
4141    QWERTY
>>> writer.write_fingerprint("SHRDLU", b"\0\1")
0001    SHRDLU
>>> writer.close()


The filename argument may also be a file-like object, which is defined as any object which implements the method write(s) where s is a byte string. A io.BytesIO instance is one such file-like object. It gives access to the output as a byte string:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> from io import BytesIO
>>> f = BytesIO()
>>> writer = chemfp.open_fingerprint_writer(f, chemfp.Metadata(num_bits=16, type="Experiment/1"))
>>> print(f.getvalue().decode("utf8"))  # convert byte string to text
#FPS1
#num_bits=16
#type=Experiment/1

>>> writer.write_fingerprint("ETAOIN", b"00")
>>> writer.close()
>>> print(f.getvalue().decode("utf8")) # convert byte string to text
#FPS1
#num_bits=16
#type=Experiment/1
3030  ETAOIN


You can see that closing the fingerprint writer does not close the underlying file-like object. (If it did then you couldn’t get access to the string content, which gets deleted when the StringIO is closed.)

You can also write an FPB file to a file-like object, if it supports seek() and tell() and binary writes. This means that you cannot write an FPB format to stdout, but you can write it to a BytesIO instance.

>>> import chemfp
>>> from io import BytesIO
>>> f = BytesIO()
>>> writer = chemfp.open_fingerprint_writer(f, format="fpb")
>>> writer.write_fingerprint("ID123", b"\x01\xfe")
>>> writer.close()
>>> len(f.getvalue())
2269


## Writing fingerprints to an FPB file¶

In this section you’ll learn how to write an FPB file.

The FPS file is a text format which was designed to be easy to read and write. The FPB file is a binary format which is designed to be fast to load. Internally it stores the fingerprints in a way which can be mapped directly to the arena data structure. However, writing this format yourself is not easy.

Instead, let chemfp do it for you. With the chemfp.open_fingerprint_writer() function, the difference between writing an FPS file and an FPB file is a matter of changing the extension. Here’s a simple example:

>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("simple.fpb")
>>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
>>> writer.close()


Almost all you need to know is to use the “.fpb” extension instead of “.fps”. The rest of this section goes into low-level details that might be enlightening, but probably aren’t that directly useful for most people.

It’s hard to show the content of the FPB file, because it is binary. I’ll do a character dump to show the first 96 bytes:

% od -c simple.fpb
0000000    F   P   B   1  \r  \n  \0  \0  \r  \0  \0  \0  \0  \0  \0  \0
0000020    M   E   T   A   #   n   u   m   _   b   i   t   s   =   1   6
0000040   \n   #  \0  \0  \0  \0  \0  \0  \0   A   R   E   N 002  \0  \0
0000060   \0  \b  \0  \0  \0 002  \0  \0 001 002  \0  \0  \0  \0  \0  \0
0000100    Z   Z  \0  \0  \0  \0  \0  \0 377 377  \0  \0  \0  \0  \0  \0
0000120    H  \0  \0  \0  \0  \0  \0  \0   P   O   P   C  \0  \0  \0  \0
...


The first eight bytes are the file signature. Following that are a set of blocks, with eight bytes for the length, a four byte block type name, and then the block content. Here you can see the “META”data block, followed by the “AREN”a block containing the fingerprint data, followed by the start of the “POPC”ount block with the popcount index information.

That’s probably a bit too much detail for you. I’ll use chemfp to read the file and show the contents:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
#num_bits=16

>>> from chemfp import bitops
>>> for id, fp in reader:
...   print(id, "=>", bitops.hex_encode(fp))
...
third => 0102
second => 5a5a
first => ffff


Unlike the FPS format, the FPB format requires a num_bits in the metadata. Since I didn’t give the writer that information, it figured it out from the number of bytes in the first written fingerprint.

You can see that record order is different than the input order. While the FPS fingerprint writer preserves input order, the FPB writer will reorder the records by population count, so the records with fewer ‘on’ bits come first. It then creates a popcount index, to mark the start and end location of all of the fingerprints with a given popcount. This is used to pre-compute the popcount for a fingerprint, and quickly reject some of the similarity search space.

Use the reorder parameter to control if the fingerprints should be reordered. The default is True, and False will preserve the input order:

>>> writer = chemfp.open_fingerprint_writer("simple.fpb", reorder=False)
>>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
>>> writer.close()
>>>
>>> for id, fp in reader:
...   print(id, "=>", bitops.hex_encode(fp))
...
first => ffff
second => 5a5a
third => 0102


You might think it’s a bit useless to preserve input order, because the performance won’t be as fast. It’s actually proved useful for one project, where the targets were broken up into clusters, and cluster membership was done using a SEA analysis. Rather than have a few dozen separate fingerprint files, I stored everything in the same file (including duplicate fingerprints), and used a configuration file which specified the cluster name and its range in the file. This made it a lot easier to organize the data, and since there were only a few thousand fingerprints sublinear search performance wasn’t needed.

The FPB fingerprint writer also has an alignment option. If you look very carefully at the character dump you can see that the fingerprints are eight byte aligned:

0000040   \n   #  \0  \0  \0  \0  \0  \0  \0   A   R   E   N 002  \0  \0
0000060   \0  \b  \0  \0  \0 002  \0  \0 001 002  \0  \0  \0  \0  \0  \0
0000100    Z   Z  \0  \0  \0  \0  \0  \0 377 377  \0  \0  \0  \0  \0  \0
0000120    H  \0  \0  \0  \0  \0  \0  \0   P   O   P   C  \0  \0  \0  \0


The “AREN” is the start of the arena block, the next four bytes (“002 0 0 0 0”) are the number of bytes in a fingerprint, in this case 2. The four bytes after that (“b 0 0 0”) are the number of bytes allocated for each fingerprint; “b” is the escape code for backspace, or ASCII 8. Yes, 8 bytes are used even though the fingerprints only have 2 bytes in them. This is because the FPB format expects to be able to use the 8 byte “POPC” assembly instruction, if available, because that has the fastest performance.

After the storage size field is a byte for the spacer length. The “002” means two NUL spacer characters follow. This is used to put the start of the first fingerprint on the eight byte boundary, so there will be no alignment issues with using the POPC instruction. (This is not that important for recent Intel processors, but Intel isn’t the only processor in the world.)

Finally you see the fingerprints; the first fingerprint is “001 002”, followed by six NUL characters to fill up the 8 bytes of storage, the second is “Z Z” followed by six more NUL pad characters, etc.

If you are really working with a two byte fingerprint, then six NUL characters is likely a waste of space. You can ask chemfp to use a two byte alignment instead:

>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("simple.fpb", alignment=2)
>>> writer.write_fingerprints( [("first", b"\xff\xff"), ("second", b"ZZ"), ("third", b"\1\2")] )
>>> writer.close()


giving:

% od -c simple.fpb
0000000    F   P   B   1  \r  \n  \0  \0  \r  \0  \0  \0  \0  \0  \0  \0
0000020    M   E   T   A   #   n   u   m   _   b   i   t   s   =   1   6
0000040   \n 017  \0  \0  \0  \0  \0  \0  \0   A   R   E   N 002  \0  \0
0000060   \0 002  \0  \0  \0  \0 001 002   Z   Z 377 377   H  \0  \0  \0
0000100   \0  \0  \0  \0   P   O   P   C  \0  \0  \0  \0  \0  \0  \0  \0


If you stare at it long enough you’ll see that the storage size is now two bytes, and that the fingerprints are arranged without any padding. (Actually, since chemfp’s two byte popcount uses character pointers, you could even use 1 byte alignment without a performance hit. But all this will do is save you at most one byte of spacer.)

Going in the other direction, it’s possible to specify up to 256 bytes of alignment. This is far beyond any conceivable use. Even the AVX instructions need only 256 bits, or 32 byte alignment, and that’s not a requirement, only a performance optimization to avoid a cache line split.

(If some future instruction set needs a larger alignment then the FPB format acquire a new block type which provides the right alignment.)

## Specify the output fingerprint format¶

In this section you’ll learn about the format option to the fingerprint writer.

By default chemfp.open_fingerprint_writer() uses the destination filename’s extension to determine if it should write an FPS file (“.fps”), a gzip compressed FPS file (“.fps.gz”), a zstandard compressed FPS file (“.fps.zst”) or an FPB file (“.fpb”). If it doesn’t recognize the extension, or if the filename is None (to write to stdout) then it will assume the FPS format.

If the destination is a file-like object then things become a bit more complicated. If the object has a name attribute, which is the case with real file objects, then that will be examined for any known extension. That’s why the following writes the output in fps.gz format:

>>> import chemfp
>>> f = open("example.fps.gz", "wb") # must be in binary mode!
>>> writer = chemfp.open_fingerprint_writer(f)
>>> writer.write_fingerprint("ABC", b"\0\0\0\0")
>>> writer.close()
>>> f.close()
>>> open("example.fps.gz", "rb").read() # must be in binary mode!
b"\x1f\x8b\x08\x08K\xfc\xbb^\x02\xffexample.fps\x00S ... mode deleted
>>>
>>> import gzip
b'#FPS1\n00000000\tABC\n'
#FPS1
00000000        ABC


There’s a large amount of magic behind the scenes to connect the filename in the Python open() call to the chemfp output format.

The other solution is to just tell it which format to use, with the format parameter. For example, if you want to send the output to stdout in gzip compressed FPS format then do:

writer = chemfp.open_fingerprint_writer(None, format="fps.gz")


If you want to save an FPB file to a BytesIO instance then do:

from io import BytesIO
f = BytesIO()
writer = chemfp.open_fingerprint_writer(f, format="fpb")


And if you really want to save to a file with an “.fpb” extension but have it as an FPS file, then do:

writer = chemfp.open_fingerprint_writer("really_an_fps_file.fpb", format="fps")


But that would be silly.

## Merging multiple structure-based fingerprint sources¶

In this section you’ll learn how to merge multiple fingerprint scores into a single file, and include the full list of source filenames.

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> filename = "Compound_099000001_099500000.sdf.gz"
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1 chemfp/3.4
#source=Compound_099000001_099500000.sdf.gz
#date=2020-05-13T13:57:58


If you have a single input file and a single output file then you can save the reader to an FPS or FPB file directly:

>>> reader.save("example.fpb")


Strictly speaking, the close() is rarely necessary as the garbage collector will close the file during finalization. Still, it’s good practice to close file, and to use a context manager to ensure that the file is always closed. Here’s what that looks like:

>>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166", filename) as reader:


However you create it, the output file will have the original metadata:

>>> arena = chemfp.open("example.fpb")
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1 chemfp/3.4
#source=Compound_099000001_099500000.sdf.gz
#date=2020-05-13T13:58:42


What happens if you want to want to merge multiple files? How does the output fingerprint file get the correct metadata?

I’ll demonstrate the problem by computing fingerprints from two structure files. I’ll get the fingerprint type and ask it to create a metadata instance:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> filenames = ["Compound_099000001_099500000.sdf.gz", "Compound_048500001_049000000.sdf.gz"]
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1 chemfp/3.4
#date=2020-05-13T14:00:13


The problem is that I also want to include the filenames as source fields in the metadata. The fingerprint type doesn’t have this information. Instead, I’ll them in through the sources parameter, which takes a string or a list of strings:

>>> metadata = fptype.get_metadata(sources=filenames)
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1 chemfp/3.4
#source=Compound_099000001_099500000.sdf.gz
#source=Compound_048500001_049000000.sdf.gz
#date=2020-05-13T14:00:34


What remains is to pass this metadata to the fingerprint writer, then loop through the structure filenames to compute the fingerprints and send them to the writer:

>>> with chemfp.open_fingerprint_writer("example.fpb", metadata=metadata) as writer:
...   for filename in filenames:
...


Here’s a quick check to see that the metadata was saved correctly:

>>> print(chemfp.open("example.fpb").metadata)
#num_bits=166
#type=RDKit-MACCS166/2
#software=RDKit/2019.09.1 chemfp/3.4
#source=Compound_099000001_099500000.sdf.gz
#source=Compound_048500001_049000000.sdf.gz
#date=2020-05-13T14:00:34


If your toolkit can’t parse one of the records then it will raise an exception. You likely want it to ignore errors, which you can do with the errors option to chemfp.read_molecule_fingerprints(). The final code for this section looks like:

import chemfp

filenames = ["Compound_099000001_099500000.sdf.gz", "Compound_048500001_049000000.sdf.gz"]

fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")

for filename in filenames:


## Merging multiple fingerprint files¶

In this section you’ll learn how to make a modified copy of a metadata instance.

The previous section merged multiple structure-based fingerprints, and used the fingerprint type to get the correct metadata instance.

What if you want to merge several existing fingerprint files, and those use a fingerprint type that chemfp doesn’t understand? In that case there is no chemfp fingerprint type, and therefore no get_metadata() method to call. Instead, you’ll need some other way to make a chemfp.Metadata instance.

I’ll work through a solution, and start by using sdf2fps to extract the PubChem/CACTVS fingerprints from two PubChem SD files:

% sdf2fps --pubchem Compound_099000001_099500000.sdf.gz -o Compound_099000001_099500000.fps
% sdf2fps --pubchem Compound_048500001_049000000.sdf.gz -o Compound_048500001_049000000.fps
% head -7 Compound_099000001_099500000.fps | fold
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_099000001_099500000.sdf.gz
#date=2020-05-13T14:03:21
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000        99000039
% head -7 Compound_048500001_049000000.fps | fold
#FPS1
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_048500001_049000000.sdf.gz
#date=2020-05-13T14:03:34
07de05000000000000000000000000000080060000000c060000000000001a802f00007800080000
00b01483f920cc0b6d9309001de0e44e2e004501b48548059099051d2e1911174503998d29041016
69401313f40801007010000000000000040800000000000002000000000000        48500020


Of course you could just ignore the header data, which is what the following does:

import chemfp

filenames = ["Compound_099000001_099500000.fps", "Compound_048500001_049000000.fps"]

with chemfp.open_fingerprint_writer("merged_pubchem.fps") as writer:
for filename in filenames:


but then you’ll be left with no metadata in the FPS header:

% head -3 merged_pubchem.fps | fold
#FPS1
07de0d000000000000000000000000000000000000003c060100a0010000008d2f00007800080000
0030148379203c034f13080015c0acee2a00410104ac4004101b851d261b10065f03ab8f29a41106
69001393e338d1017100000000204000000000000010200000000000000000        99000039
07de1c000200000000000000000000000080040000003c0200000000000000800300007820080200
6b881995e1398a405000010000000000008000000000000000000000000000        99000230


While you could do that, the metadata keeps track of potentially useful information, so it’s better to add it. For that matter, metadata usually isn’t useful until some time after the fingerprints are generated. People tend to put off writing code until it’s needed, but by then it’s too late. I’ve tried to make chemfp’s API easy, to encourage people to add the right metadata from the start.

There are a couple of ways to add the right metadata. The classic way is to make your own chemfp.Metadata with the right values:

>>> metadata = chemfp.Metadata(num_bits=881, type="CACTVS-E_SCREEN/1.0 extended=2",
...   software="CACTVS/unknown", sources=["Compound_099000001_099500000.sdf.gz",
...       "Compound_048500001_049000000.sdf.gz"])
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_099000001_099500000.sdf.gz
#source=Compound_048500001_049000000.sdf.gz


The downside is this requires knowing all of the fields beforehand. Another option is to copy the metadata from the first fingerprint file, and ask the copy() to use a new list of sources:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
['Compound_099000001_099500000.sdf.gz']
...     u"Compound_099000001_099500000.sdf.gz",
...     u"Compound_048500001_049000000.sdf.gz"])
#num_bits=881
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_099000001_099500000.sdf.gz
#source=Compound_048500001_049000000.sdf.gz
#date=2020-05-13T14:03:21


Now to put the pieces together. I’ll make one pass through the fingerprint files to get the sources, and then another pass to generate the output. If you only have a handful of files then this works nicely:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> filenames = ["Compound_099000001_099500000.fps", "Compound_048500001_049000000.fps"]
>>> sources = []
>>> for filename in filenames:
...
>>> sources
['Compound_048500001_049000000.sdf.gz', 'Compound_048500001_049000000.sdf.gz']
#type=CACTVS-E_SCREEN/1.0 extended=2
#software=CACTVS/unknown
#source=Compound_048500001_049000000.sdf.gz
#source=Compound_048500001_049000000.sdf.gz
#date=2020-05-13T14:03:34

...     for filename in filenames:
...


This code assumes that the fingerprints are compatible, that is, that the fingerprints are the same size, and the fingerprint types and other metadata fields are compatible. The next section shows how to detect if there are compatibility problems.

## Check for metadata compatibility problems¶

In this section you’ll learn how to detect compatibility mismatches between two metadata instances, and between a metadata and a fingerprint.

In the previous section you learned how to merge multiple fingerprint files, which all happened to have the same fingerprint type. What happens if they are different types?

There are actually a few possible problems:

• the fingerprint lengths are different (very bad)
• the fingerprint types are different (probably bad)
• the software is from different versions (probably okay)

The chemfp.check_metadata_problems() function compares two metadata objects and returns a list of possible problems:

>>> from __future__ import print_function # Only for Python 2
>>> import chemfp
>>> len(problems)
2
>>> for problem in problems:
...   print(problem)
...
WARNING: query has fingerprints of type 'RDKit-MACCS166/2' but
target has fingerprints of type 'OpenEye-MACCS166/3'
INFO: query comes from software 'RDKit/2020.03.1 chemfp/3.4' but
target comes from software 'OEGraphSim/2.4.3 (20191016) chemfp/3.4'


In this case the fingerprint types are different, but since the fingerprint lengths are the same it’s not an error, only a warning. The software field is also not identical, but as that’s not so significant it’s listed as “info”.

The returned problem objects are chemfp.ChemFPProblem() instances, which have useful attributes:

>>> for problem in problems:
...   print("Problem:")
...   print("  severity:", problem.severity)
...   print("  category:", problem.category)
...   print("  description:", problem.description)
...
Problem:
severity: warning
category: type mismatch
description: query has fingerprints of type 'RDKit-MACCS166/2' but target has fingerprints of type 'OpenEye-MACCS166/3'
Problem:
severity: info
category: software mismatch
description: query comes from software 'RDKit/2020.03.1 chemfp/3.4' but target comes from software 'OEGraphSim/2.4.3 (20191016) chemfp/3.4'


The idea is that the category text won’t change, so your code can figure out what’s going on, while the description is subject to change and hopefully improvement. The severity is one of “info”, “warning” and “error”.

>>> rdkit1_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=512").get_metadata()
>>> for problem in problems:
...   print(problem)
...
ERROR: query has 512 bit fingerprints but target has 1024 bit fingerprints
WARNING: query has fingerprints of type 'RDKit-Fingerprint/2 minPath=1
maxPath=7 fpSize=512 nBitsPerHash=2 useHs=1' but target has
fingerprints of type 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024
nBitsPerHash=2 useHs=1'


A chemfp.ChemFPProblem is derived from Exception, so you can raise it directly if you want:

>>> for problem in chemfp.check_metadata_problems(rdkit1_metadata, rdkit2_metadata):
...   if problem.severity == "error":
...     raise problem
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
chemfp.ChemFPProblem: ERROR: query has 512 bit fingerprints but target has 1024 bit fingerprints


You might have noticed that the error message uses the words “query” and “target”. Chemfp is designed around similarity searches, so I expect the default to compare query metadata to target metadata.

On the other hand, the previous section merged multiple fingerprint files, where “query” and “target” don’t make sense. Instead, you can give alternative names via the query_name and target_name parameters:

>>> rdkit1_metadata = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=512").get_metadata()
...                                                "file #1", "file #14"):
...   if problem.severity == "error":
...     print(problem)
...
ERROR: file #1 has 512 bit fingerprints but file #14 has 1024 bit fingerprints


I’ll use this to update the code from the previous section to raise an exception on errors, print warnings to stderr, and do nothing about “info” problems, and add a MACCS fingerprint file to the list of files to process, so I can show what happens if there’s a problem:

from __future__ import print_function   # Only for Python 2
import sys
import chemfp

filenames = ["Compound_099000001_099500000.fps",
"Compound_048500001_049000000.fps",
"chebi_maccs.fps"]

# Create the correct metadata with all of the sources from all of the files.
sources = []
for filename in filenames:
first_filename = filename
else:
# Check for compatibility problems
repr(first_filename),
repr(filename)):
if problem.severity == "error":
raise problem
elif problem.severity == "warning":
sys.stderr.write(str(problem) + "\n")

# Merge the files using the new metadata
for filename in filenames:


When I run that code with the mismatched fingerprint types, I get the error message:

Traceback (most recent call last):
File "x.py", line 23, in <module>
raise problem
chemfp.ChemFPProblem: ERROR: 'Compound_099000001_099500000.fps' has 881 bit fingerprints but 'chebi_maccs.fps' has 166 bit fingerprints


I then removed the chebi_maccs.fps and manually changed the fingerprint type in Compound_048500001_049000000.fps, so I could demonstrate what a warning message looks like:

WARNING: 'Compound_099000001_099500000.fps' has fingerprints of type
'CACTVS-E_SCREEN/1.0 extended=2' but 'Compound_048500001_049000000.fps'
has fingerprints of type 'CACTVS-E_SCREEN/1.0 extended=DIFFERENT_VALUE'
Traceback (most recent call last):
File "/Users/dalke/cvses/cfp-3x/docs/x.py", line 23, in <module>
raise problem
chemfp.ChemFPProblem: ERROR: 'Compound_099000001_099500000.fps' has
881 bit fingerprints but 'chebi_maccs.fps' has 166 bit fingerprints


(In case you’re wondering what the type string means, those are the actual CACTVS parameters that PubChem uses, according to the CACTVS author, Wolf-Dietrich Ihlenfeldt.)

Lastly, sometimes the query is a simple byte string. There’s not really much to compare, but you use chemfp.check_fingerprint_problems() to see if the fingerprint length is compatible with a metadata instance:

>>> import chemfp
[ChemFPProblem('error', 'num_bytes mismatch', 'query contains 4
bytes but target has 21 byte fingerprints')]


The simsearch command-line tool uses this function to check if the query fingerprint, which is entered as hex as a command-line parameter, is compatible with the target fingerprints.

## How to write very large FPB files¶

In this section you’ll learn how to write an FPB file even when fingerprint data is so large that the intermediate data doesn’t all fit into memory at once.

By default the FPB format will reorder the fingerprints to be in popcount order. (Use reorder=False option to preserve the input order.) This requires intermediate storage in order to sort all of the records. By default the writer will use memory for this, but the implementation may require about two to three times as much memory as the raw fingerprint size.

That is, if you have 50 million fingerprints, with 1024 bits per fingerprint, plus 10 bytes for the name, then the fingerprint arena requires about 6 GiB of memory, plus 0.5 GiB for the ids, and another ~1 GiB for the id lookup table.

That calculation gives the minimum amount of memory needed. The actual implementation may preallocate up to twice as much memory as the current size, in order to handle growth gracefully, and there is some additional overhead. You may be left with the case where you have 12 GiB of RAM, and where the final FPB file is only 8 GiB in size, but where the intermediate storage requires 15 GiB of RAM.

Or you may want to build that data set on a machine with 6 GiB of RAM, and copy the result over to the production machine with a lot more memory.

If that happens, then use the max_spool_size option to specify the maximum number of bytes to store in memory before switching to temporary files for additional storage. This should be about 1/3 of the available RAM because there can be two different temporary file “spools”, each of which can use up to max_spool_size bytes of RAM.

For example, the following will use at most about 4 GiB of RAM:

writer = chemfp.open_fingerprint_writer(
"pubchem.fpb", max_spool_size = 2 * 1024 * 1024 * 1024)


Note: do not make this too small. The merge step opens all of the temporary files in order to make the final FPB output file. If you specify a spool size of 50 MiB then you’ll end up creating several hundred files for PubChem, which may exceed the resource limits for the number of open file descriptors for a process. When that happens you’ll get an exception like:

IOError: [Errno 24] Too many open files


Where does the FPB writer store the temporary files? It uses Python’s tempfile module to create the temporary files in a directory. Quoting from that documentation, “The default directory is chosen from a platform-dependent list, but the user of the application can control the directory location by setting the TMPDIR, TEMP or TMP environment variables.”

Environment variables give one way to specify an alternate directory. Or you can specify it directly using the tmpdir parameter, as in:

writer = chemfp.open_fingerprint_writer(
"pubchem.fpb", max_spool_size = 2 * 1024 * 1024 * 1024,
tmpdir = "/scratch")


This can be very important on some cluster machines with a small local /tmp but a large networked scratch disk.

## FPS fingerprint writer errors¶

In this section you’ll learn how the FPS fingerprint writer handles errors, and how to change the error handling behavior.

It’s hard but not impossible to have the FPS writer raise an exception:

>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer(None)
#FPS1
>>> writer.write_fingerprint("Tab\tHere", b"\0")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/fps_io.py", line 550, in write_fingerprint
raise_tb(err[0], err[1])
File "chemfp/fps_io.py", line 467, in _fps_writer_gen
location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "<string>", line 1, in raise_tb
chemfp.ParseError: Unable to write an identifier containing a tab: 'Tab\tHere', file '<stdout>', line 1, record #1


The FPS file format simply doesn’t support tab characters in the indentifier, nor newline characters, for that matter. It also doesn’t allow empty identifiers.

As you saw, the default error action is to raise an exception.

Sometimes it’s okay to ignore errors. For example, you might process a large number of structures, where you know that a few of them have missing, or poorly formed, identifiers, and where it’s okay to omit those records.

The errors parameter can be used to change the error handler. The value of “report” tells the parser to skip failing record and write an error message written to stderr. The value of “ignore” simply skips the record:

>>> writer = chemfp.open_fingerprint_writer(None, errors="report")
#FPS1
>>> writer.write_fingerprint("", b"\0\0\0\0")
ERROR: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 1, record #1. Skipping.
>>>
>>> writer = chemfp.open_fingerprint_writer(None, errors="ignore")
#FPS1
>>> writer.write_fingerprint("", b"\0")
>>> writer.write_fingerprint("Tab\tHere", b"\0")


Granted, this feature isn’t so important for FingerprintWriter.write_fingerprint() because catching the exception isn’t hard to do. It’s a bit more useful for bulk conversions with FingerprintWriter.write_fingerprints(), like:

import chemfp


Note that the FPB writer ignores the errors parameter and treats all errors as “strict”.

## FPS fingerprint writer location¶

In this section you’ll learn how to get information like the number of lines and number of records written to an FPS file.

I’ll start by saying that this feature isn’t all that useful. It exists because of parallelism to the toolkit structure writers, and I wanted to experiment to see if it could be useful in the future.

The FPS fingerprint writer has a location attribute. This can be used to get some information about the state of the output writer. The most basic is the output filename. If the output is None or an unnamed file object then a fake filename will be used:

>>> import chemfp
>>> writer = chemfp.open_fingerprint_writer("example.fps")
>>> writer.location.filename
'example.fps'
>>> writer = chemfp.open_fingerprint_writer(None)
#FPS1
>>> writer.location.filename
'<stdout>'


At this point the signature line has been written, so the file is at line 1, but no record have been written:

>>> writer.location.lineno
1
>>> writer.location.recno
0
>>> writer.location.output_recno
0


Each of these values is incremented by one after adding a valid record:

>>> writer.write_fingerprint("FP001", b"\xA0\xFE")
a0fe    FP001
>>> writer.location.lineno
2
>>> writer.location.recno
1
>>> writer.location.output_recno
1


If however the record is invalid then the recno will increase by one because it’s the number of records sent to the writer, but the other values do not increase because they only change when a record is written successfully:

>>> writer.write_fingerprint("", b"\xA0\xFE")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "chemfp/fps_io.py", line 550, in write_fingerprint
raise_tb(err[0], err[1])
File "chemfp/fps_io.py", line 475, in _fps_writer_gen
location)
File "chemfp/io.py", line 87, in error
_compat.raise_tb(ParseError(msg, location), None)
File "<string>", line 1, in raise_tb
chemfp.ParseError: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 2, record #2
>>> writer.location.lineno
2
>>> writer.location.recno
2
>>> writer.location.output_recno
1


This is perhaps more clearly shown if I try to write four records at one, where two contain errors, and where I’ve asked the writer to “report” errors rather than raise an exception:

>>> metadata = chemfp.Metadata(type="Experiment/1", software="AndrewDalke/1")
#FPS1
#type=Experiment/1
#software=AndrewDalke/1
>>> writer.location.lineno
3
>>> writer.location.recno
0
>>> writer.location.output_recno
0
>>> writer.write_fingerprints( [("A", b"\0\0"), ("\t", b"\0\1"), ("", b"\0\2"), ("B", b"\0\3")] )
0000  A
ERROR: Unable to write an identifier containing a tab: '\t', file '<stdout>', line 4, record #2. Skipping.
ERROR: Unable to write a fingerprint with an empty identifier, file '<stdout>', line 4, record #3. Skipping.
0003  B
>>> writer.location.recno
4
>>> writer.location.output_recno
2
>>> writer.location.lineno
5


There are three lines in the header; the signature, the type line, and the software line. I tried to write four fingerprints, but two were invalid. It wrote the valid fingerprint “A” to stdout, report the two invalid records to stderr, and write the valid fingerprint “B” to stdout. Thus, two records were actually output, which is why output_recno is 2, while four records were sent to the writer, which is why recno is 4. The three header lines and the two lines of output give five lines of output, so the final lineno is 5.

In case you hadn’t figured it out, the location information is used to make the exception and error message. That explains why both of the error reports say the error is on “line 4”; that’s the line that would have been output if there were no error.

Note that the FPB writer does not have a location, and it ignores the location parameter.

## MACCS dependency on hydrogens¶

In this section you’ll learn how the RDKit MACCS fingerprints differ if there are explicit or implicit hydrogens.

Note: A goal of this is to show that MACCS key generation isn’t as easy as you might think it is!

One of my long-term goals is to get a good cross-toolkit implementation of the MACCS keys. It’s very odd how the MACCS keys are the de facto fingerprint for cheminformatics, but the toolkits don’t give the same answers. Over the years, I’ve found bugs or incomplete definitions in all of the toolkits I’ve looked at, which I’ve reported and have since been fixed.

If you use RDKit, Open Babel, or CDK then your toolkit implements MACCS keys that were derived from the ones that Greg Landrum developed for RDKit. The portable portion uses hand-translated SMARTS definitions for most of the MACCS key definitions. A couple keys, like key 125 (“at least two aromatic rings”) cannot be represented as SMARTS. RDKit had special code for these definitions, but Open Babel does not.

Even with a portable SMARTS definition, I would expect to see some differences between the toolkits, if only because they have different aromaticity models. One toolkit might call something an aromatic ring, while another says it’s aliphatic.

Unfortunately, the SMARTS patterns used in those programs give different results if you have explicit hydrogens or implicit hydrogens. I’ll demonstrate with using RDKit, because that has a reader_arg to specify if I want to remove hydrogens from the input structure or not. (Here “remove” means to make them implicit.)

I’ll use RDKit twice to read the first molecule from a file and compute the RDKit fingerprint; the first time I keep the hydrogens and the second time I remove them:

>>> import chemfp
>>> from chemfp import bitops
>>> filename = "Compound_099000001_099500000.sdf.gz"
>>> fptype = chemfp.get_fingerprint_type("RDKit-MACCS166")
>>>
>>> with_h_id, bitops.hex_encode(with_h_fp)
('99000039', '000004000000300001c4004e93e1b053dce16f6e1f')
>>>
>>> without_h_id, bitops.hex_encode(without_h_fp)
('99000039', '000004000000300001c0004e9361b051dce1676e1f')


If you look closely you’ll see that they have two different fingerprints! I’ll make it easier to see by reporting the bits which are only in one or the other fingerprint:

>>> with_h_bits = set(bitops.byte_to_bitlist(with_h_fp))
>>> without_h_bits = set(bitops.byte_to_bitlist(without_h_fp))
>>> sorted(with_h_bits - without_h_bits) # only with hydrogens
[74, 111, 121, 147]
>>> sorted(without_h_bits - with_h_bits) # only without hydrogens
[]


The molecule with explicit hydrogens sets four more bits than the one with implicit hydrogens.

Why is that? The RDKit (and hence Open Babel and CDK) definitions often use “*” to match an atom, when the corresponding MACCS definition is supposed to exclude hydrogens. A hydrogen-independent version would use “[!#1]” instead. By default RDKit removes normal explicit hydrogens, so this isn’t usually a problem. As far as I can tell, Open Babel always removes them from an SD file, so again this isn’t really a problem. (Well, except for hydrogens with an explicit isotope number.)

The list [74, 111, 121, 147] are bit numbers. The corresponding keys are 75, 112, 122, and 148. I looked at how key 122 is defined in various sources:

Definitions for key 112 (bit 111)
MACCS: AA(A)(A)A
RDKit:  *~*(~*)(~*)~*
OpenBabel: *~*(~*)(~*)~*
CDK: *~*(~*)(~*)~*
chemfp's RDMACCS-*: [!#1]~*(~[!#1])(~[!#1])~[!#1]
O'Donnell: *~*(~*)(~*)~*


(“O’Donnell” here comes from Table A.4 of TJ O’Donnell’s Design and Use of Relational Databases in Chemistry.)

If you know SMARTS you can see how an explicit H will lead to a different match than an implicit one, except for chemfp’s own attempt at making a cross-toolkit MACCS implementation. I’ll test out RDMACCS-RDKit, which is chemfp’s implementation of the MACCS 166 fingerprint using RDKit:

>>> chemfp_maccs = chemfp.get_fingerprint_type("RDMACCS-RDKit")
>>>
>>> with_h_id, bitops.hex_encode(with_h_fp)
('99000039', '000004000000300001c0004e9361b051dce1676e1f')
>>>
>>> without_h_id, bitops.hex_encode(without_h_fp)
('99000039', '000004000000300001c0004e9361b051dce1676e1f')
>>>
>>> with_h_fp == without_h_fp
True


What a relief that they are the same!

If you want to use the OEChem, Open Babel, or CDK-based RDMACSS implemenations, the corresponding fingerprint type names are “RDMACCS-OpenEye”, “RDMACCS-OpenBabel”, or “RDMACCS-CDK” respectively, and the command-line option for oe2fps, ob2fps and cdk2fps is --rdmaccs.

WARNING: the RDMACCS fingerprints have not been fully validated! Validation is hard. A chemfp goal is to make that easier.

To finish, I was curious about the differences in RDKit’s native MACCS166 implementation across all of the records in the file, so I wrote some code. It’s a direct evolution of the code you already saw. (Note: for Python 2 I use itertools.izip() as a replacement for the generator-based zip() in Python 3.)

from __future__ import print_function # Only for Python 2
import itertools
from collections import Counter
import chemfp
from chemfp import bitops

zip = getattr(itertools, "izip", zip) # Support Python2 and Python3

filename = "Compound_099000001_099500000.sdf.gz"

extra_with_h = Counter()
extra_without_h = Counter()
num_records = 0
for (id1, with_h_fp), (id2, without_h_fp) in zip(with_h_fingerprints,
without_h_fingerprints):
num_records += 1
assert id1 == id2, (id1, id2)
if with_h_fp != without_h_fp:
with_h_keys = set(bitno+1 for bitno in bitops.byte_to_bitlist(with_h_fp))
without_h_keys = set(bitno+1 for bitno in bitops.byte_to_bitlist(without_h_fp))
only_with_h = sorted(with_h_keys - without_h_keys)
only_without_h = sorted(without_h_keys - with_h_keys)
print(id1, "with:", only_with_h, "without:", only_without_h)
extra_with_h.update(only_with_h)
extra_without_h.update(only_without_h)

print("\nNumber of records:", num_records)
print("\nCounts that were only with hydrogens:")
for key, count in extra_with_h.most_common():
print("  %d %d" % (key, count))
print("\nCounts that were only without hydrogens:")
for key, count in extra_without_h.most_common():
print("  %d %d" % (key, count))


In case you were wondering, the report summary starts:

Number of records: 10826

Counts that were only with hydrogens:
112 6851
150 3345
144 3209
122 2807
138 2767
66 2763
148 2372
155 2311
126 684
76 682
75 412
81 407
128 344
118 173
156 96
107 24
90 18
108 15
129 9
132 2


Now you can see why I used key 112 in my elaboration - it’s the one that causes the most problems!

## Create similarity search web service¶

In this section you’ll learn how to write a simple WSGI-based web service which does a similarity search given an SDF record.

I found it a bit difficult to write this section because few people will write a WSGI service directly. I think most people use Django, but a Django example would require several different files to make it work. There are other web frameworks I could use, like Flask, but I eventually decided to limit myself to what’s available in the standard library, that is, the wsgiref module.

I’m going to write a WSGI server named “simple_server.py” which takes an SDF record as input and returns the top 5 hits from a specified database. If there’s a GET request then the result is a simple form. The form sends a POST request to the server, with the SDF record in the query parameter q.

By the way, if the target fingerprint data set is large then you should use an FPB file to get the best startup performance.

Let’s get started. The first part is a comment about what the code does and some imports:

# This is a very simple fingerprint search server.
# I call it 'simple_server.py'.
#
# Usage: simple_server <fingerprint_filename> [port]
#
# A GET to the server (default uses port 80) returns a simple form.
# The form has a single text box, to paste the SDF query or queries.
# The POST query variable 'q' contains the SDF contents.
# The search finds the nearest 5 queries for each query record.
# The result is a simple list of query ids and its matches.

import argparse
from wsgiref.simple_server import make_server
import cgi

import chemfp


The server will return an HTML form for a GET request:

# Create a simple form.
def query_form(environ, start_response):
status = '200 OK' # HTTP Status

# The returned object is going to be printed.
# Must be a byte string for Python 3.
return [b"""<html>
<title>Simple fingerprint search</title>
<body>
<form method="POST">
Paste in SDF records(s):<br />
<textarea name="q" type="text" rows="20" cols="80"></textarea><br />
<button type="submit">Search!</button>
</form>
</body>
</html>
"""]


I’ll use the argparse module to handle the command-line arguments:

# Command-line parameters
parser = argparse.ArgumentParser("simple_search",
description="Simple fingerprint web server with SDF input")
help="chemfp fingerprint filename")

help="port to use (default is 8080)")


The heavy work is in the ‘main’ function. It starts with some setup to load the fingerprints and make sure the fingerprint type is available:

def main():
args = parser.parse_args()

# Load the arena, get the type, and make sure I can handle the type.
print("Loaded %s fingerprints from %r" % (len(arena), args.filename))

if type is None:
parser.error("File %r does not contain a fingerprint type" % (args.filename,))

try:
fptype = chemfp.get_fingerprint_type(type)
except KeyError as err:
parser.error(str(err))


It then defines the WSGI app, which returns the query_form() for a GET request, or processes the form for a POST request. I think the embedded comments explain things enough:

# ... continue the 'main' function ...
# This is the WSGI app, defined inside of main

def fingerprint_search_app(environ, start_response):
# Is this a GET or a POST? If a GET, return the query form
if environ["REQUEST_METHOD"] != "POST":
return query_form(environ, start_response)

# Get the query data from the POST
post_env = environ.copy()
post = cgi.FieldStorage(
fp=environ['wsgi.input'],
environ=post_env,
keep_blank_values=True,
)
q = post.getfirst("q", "")
# The underlying toolkit code may require "\n" instead of "\r\n" strings.
q = q.replace("\r\n", "\n")

# For each input record, do a search, get the results, and build up the output lines.
# Ignore any records that can't be parsed.

output = ["Search against %r using k=5 and threshold=0.0\n\n" % (args.filename,)]

# The next three lines use chemfp to convert the record into a
# fingerprint, do the search for the top 5 hits, get the ids
# and scores for the hits, and make the output text.

for query_id, fp in fptype.read_molecule_fingerprints_from_string(q, "sdf", errors="ignore"):
results = arena.knearest_tanimoto_search_fp(fp, k=5, threshold=0.0)
text = " ".join( "%s (%.3f)" % (id, score) for (id, score) in results.get_ids_and_scores())
output.append("%s => %s\n" % (query_id, text))

# Return the results in plain text.

status = '200 OK' # HTTP Status

# Python 3 requires bytes, not strings, so convert to UTF-8
return [line.encode("utf8") for line in output]


The main function ends with some code to start the WSGI server using the correct port:

# ... end of the 'main' function ...
# Make the server and run it. (Use ^C to kill it.)
httpd = make_server('', args.port, fingerprint_search_app)
print("Serving fingerprint search on port %s..." % (args.port,))

httpd.serve_forever()


Finally, code to start things rolling:

if __name__ == "__main__":
main()


I’ll start the server using a ChEBI-derived data set:

% python simple_server.py rdkit_chebi.fps

Search against 'rdkit_chebi.fps' using k=5 and threshold=0.0