chemfp.encodings module

Decode different fingerprint representations into a chemfp byte string.

Chemfp fingerprints are stored as byte strings, with the bytes in least-significant bit order (bit #0 is stored in the first/left-most byte) and with the bits in most-significant bit order (bit #0 is stored in the first/right-most bit of the first byte).

Other systems use different encodings. These include:
  • the ‘0 and ‘1’ characters, as in ‘00111101’
  • hex encoding, like ‘3d’
  • base64 encoding, like ‘SGVsbG8h’
  • CACTVS’s variation of base64 encoding

plus variations of different LSB and MSB orders.

This module decodes most of the fingerprint encodings I have come across. The fingerprint decoders return a 2-ple of the bit length and the chemfp fingerprint. The bit length is None unless the bit length is known exactly, which currently is only the case for the binary and CACTVS fingerprints. (The hex and other encoders must round the fingerprints up to a multiple of 8 bits.)

chemfp.encodings.from_base64(text)

Decode a base64 encoded fingerprint string

The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order.

>>> from_base64("SGk=")
(None, b'Hi')
>>> from binascii import hexlify
>>> hexlify(from_base64("SGk=")[1])
b'4869'
>>> 
chemfp.encodings.from_binary_lsb(text)

Convert a string like ‘00010101’ (bit 0 here is off) into ‘xa8’

The encoding characters ‘0’ and ‘1’ are in LSB order, so bit 0 is the left-most field. The result is a 2-ple of the fingerprint length and the decoded chemfp fingerprint

>>> from_binary_lsb('00010101')
(8, b'\xa8')
>>> from_binary_lsb('11101')
(5, b'\x17')
>>> from_binary_lsb('00000000000000010000000000000')
(29, b'\x00\x80\x00\x00')
>>>
chemfp.encodings.from_binary_msb(text)

Convert a string like ‘10101000’ (bit 0 here is off) into ‘xa8’

The encoding characters ‘0’ and ‘1’ are in MSB order, so bit 0 is the right-most field.

>>> from_binary_msb(b'10101000')
(8, b'\xa8')
>>> from_binary_msb(b'00010101')
(8, b'\x15')
>>> from_binary_msb(b'00111')
(5, b'\x07')
>>> from_binary_msb(b'00000000000001000000000000000')
(29, b'\x00\x80\x00\x00')
>>>
chemfp.encodings.from_cactvs(text)

Decode a 881-bit CACTVS-encoded fingerprint used by PubChem

>>> from_cactvs(b"AAADceB7sQAEAAAAAAAAAAAAAAAAAWAAAAAwAAAAAAAAAAABwAAAHwIYAAAADA" +
...             b"rBniwygJJqAACqAyVyVACSBAAhhwIa+CC4ZtgIYCLB0/CUpAhgmADIyYcAgAAO" +
...             b"AAAAAAABAAAAAAAAAAIAAAAAAAAAAA==")
(881, b'\x07\xde\x8d\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x06\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x80\x03\x00\x00\xf8@\x18\x00\x00\x000P\x83y4L\x01IV\x00\x00U\xc0\xa4N*\x00I \x00\x84\xe1@X\x1f\x04\x1df\x1b\x10\x06D\x83\xcb\x0f)%\x10\x06\x19\x00\x13\x93\xe1\x00\x01\x00p\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00\x00')
>>>
For format details, see
ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt
chemfp.encodings.from_daylight(text)

Decode a Daylight ASCII fingerprint

>>> from_daylight(b"I5Z2MLZgOKRcR...1")
(None, b'PyDaylight')

See the implementation for format details.

chemfp.encodings.from_hex(text)

Decode a hex encoded fingerprint string

The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order.

>>> from_hex(b'10f2')
(None, b'\x10\xf2')
>>>

Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.

chemfp.encodings.from_hex_lsb(text)

Decode a hex encoded fingerprint string where the bits and bytes are in LSB order

>>> from_hex_lsb(b'102f')
(None, b'\x08\xf4')
>>> 

Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.

chemfp.encodings.from_hex_msb(text)

Decode a hex encoded fingerprint string where the bits and bytes are in MSB order

>>> from_hex_msb(b'10f2')
(None, b'\xf2\x10')
>>>

Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character.

chemfp.encodings.from_on_bit_positions(text, num_bits=1024, separator=' ')

Decode from a list of integers describing the location of the on bits

>>> from_on_bit_positions("1 4 9 63", num_bits=32)
(32, b'\x12\x02\x00\x80')
>>> from_on_bit_positions("1,4,9,63", num_bits=64, separator=",")
(64, b'\x12\x02\x00\x00\x00\x00\x00\x80')

The text contains a sequence of non-negative integer values separated by the separator text. Bit positions are folded modulo num_bits.

This is often used to convert sparse fingerprints into a dense fingerprint.

Note: if you have a list of bit position as integer values then you probably want to use chemfp.bitops.byte_from_bitlist().

chemfp.encodings.get_decoder(decoder)

Return a decoder function given a decoder name or function.

If decoder is a string then return the named decoder function, or raise a KeyError if the named function does not exist.

If decoder is a callable then return it.

Otherwise, raise a TypeError.

chemfp.encodings.get_decoder_names()

Return a sorted list of registered decoder names

chemfp.encodings.import_decoder(path)

Find a decoder function given its full name, as in ‘chemfp.decoders.from_cactvs’

This function imports any intermediate modules, which may be a security concern.