.. py:module:: chemfp .. _chemfp-api: #################### chemfp API #################### This chapter contains the docstrings for the public portion of the chemfp API. .. _chemfp-toplevel-api: chemfp top-level API ==================== The following functions and classes are in the top-level chemfp module. is_licensed ----------- .. py:function:: is_licensed() Return True if the chemfp license is valid, otherwise return False. :returns: True or False New in chemfp 3.2.1. get_license_date ---------------- .. py:function:: get_license_date() Return expiration date as a 3-element tuple in the form (year, month, day). If the license key is not found or does not pass the security check then the function returns None. If this version of chemfp does not need a license key then it returns (9999, 12, 25). :returns: a 3-element tuple or None New in chemfp 3.2.1. open ---- .. py:function:: open(source, format=None, location=None) Read fingerprints from a fingerprint file Read fingerprints from *source*, using the given format. If *source* is a string then it is treated as a filename. If *source* is None then fingerprints are read from stdin. Otherwise, *source* must be a Python file object supporting the ``read`` and ``readline`` methods. If *format* is None then the fingerprint file format and compression type are derived from the source filename, or from the ``name`` attribute of the source file object. If the source is None then the stdin is assumed to be uncompressed data in "fps" format. The supported format strings are: * "fps", "fps.gz", or "fps.zst" for fingerprints in FPS format * "fpb", "fpb.gz" or "fpb.zst" for fingerprints in FPB format The optional *location* is a :class:`chemfp.io.Location` instance. It will only be used if the source is in FPS format. If the source is in FPS format then ``open`` will return a :class:`chemfp.fps_io.FPSReader`, which will use the *location* if specified. If the source is in FPB format then ``open`` will return a :class:`chemfp.arena.FingerprintArena` and the *location* will not be used. Here's an example of printing the contents of the file:: from chemfp.bitops import hex_encode reader = chemfp.open("example.fps.gz") for id, fp in reader: print(id, hex_encode(fp)) :param source: The fingerprint source. :type source: A filename string, a file object, or None :param format: The file format and optional compression. :type format: string, or None :returns: a :class:`chemfp.fps_io.FPSReader` or :class:`chemfp.arena.FingerprintArena` load_fingerprints ----------------- .. py:function:: load_fingerprints(reader, metadata=None, reorder=True, alignment=None, format=None) Load all of the fingerprints into an in-memory FingerprintArena data structure The function reads all of the fingerprints and identifers from *reader* and stores them into an in-memory :class:`chemfp.arena.FingerprintArena` data structure which supports fast similarity searches. If *reader* is a string or has a ``read`` attribute then it will be passed to the :func:`chemfp.open` function and the result used as the reader. If that returns a FingerprintArena then the *reorder* and *alignment* parameters are ignored and the arena returned. If *reader* is a FingerprintArena then the *reorder* and *alignment* parameters are ignored. If *metadata* is None then the input reader is returned without modifications, otherwise a new FingerprintArena is created, whose metadata attribue is *metadata*. Otherwise the *reader* or the result of opening the file must be an iterator which returns (id, fingerprint) pairs. These will be used to create a new arena. *metadata* specifies the metadata for all returned arenas. If not given the default comes from the source file or from ``reader.metadata``. The loader may reorder the fingerprints for better search performance. To prevent ordering, use ``reorder=False``. The *reorder* parameter is ignored if the reader is an arena or FPB file. The *alignment* option specifies the alignment data alignment and padding size for each fingerprint. A value of 8 means that each fingerprint will start on a 8 byte alignment, and use storage space which a multiple of 8 bytes long. The default value of None will determine the best alignment based on the fingerprint size and available popcount methods. This parameter is ignored if the reader is an arena or FPB file. :param reader: An iterator over (id, fingerprint) pairs :type reader: a string, file object, or (id, fingerprint) iterator :param metadata: The metadata for the arena, if other than reader.metadata :type metadata: Metadata :param reorder: Specify if fingerprints should be reordered for better performance :type reorder: True or False :param alignment: Alignment size in bytes (both data alignment and padding); None autoselects the best alignment. :type alignment: a positive integer, or None :param format: The file format name if the reader is a string :type format: None, "fps", "fps.gz", "fps.zst", "fpb", "fpb.gz" or "fpb.zst" :returns: :class:`chemfp.arena.FingerprintArena` read_molecule_fingerprints -------------------------- .. py:function:: read_molecule_fingerprints(type, source=None, format=None, id_tag=None, reader_args=None, errors="strict") Read structures from *source* and return the corresponding ids and fingerprints This returns an :class:`chemfp.fps_io.FPSReader` which can be iterated over to get the id and fingerprint for each read structure record. The fingerprint generated depends on the value of *type*. Structures are read from *source*, which can either be the structure filename, or None to read from stdin. *type* contains the information about how to turn a structure into a fingerprint. It can be a string or a metadata instance. String values look like ``OpenBabel-FP2/1``, ``OpenEye-Path``, and ``OpenEye-Path/1 min_bonds=0 max_bonds=5 atype=DefaultAtom btype=DefaultBond``. Default values are used for unspecified parameters. Use a Metadata instance with *type* and *aromaticity* values set in order to pass aromaticity information to OpenEye. If *format* is None then the structure file format and compression are determined by the filename's extension(s), defaulting to uncompressed SMILES if that is not possible. Otherwise *format* may be "smi" or "sdf" optionally followed by ".gz" or ".bz2" to indicate compression. The OpenBabel and OpenEye toolkits also support additional formats. If *id_tag* is None, then the record id is based on the title field for the given format. If the input format is "sdf" then *id_tag* specifies the tag field containing the identifier. (Only the first line is used for multi-line values.) For example, ChEBI omits the title from the SD files and stores the id after the "> " line. In that case, use ``id_tag = "ChEBI ID"``. The *reader_args* is a dictionary with additional structure reader parameters. The parameters depend on the toolkit and the format. Unknown parameters are ignored. *errors* specifies how to handle errors. The value "strict" raises an exception if there are any detected errors. The value "report" sends an error message to stderr and skips to the next record. The value "ignore" skips to the next record. Here is an example of using fingerprints generated from structure file:: from chemfp.bitops import hex_encode fp_reader = chemfp.read_molecule_fingerprints("OpenBabel-FP4/1", "example.sdf.gz") print("Each fingerprint has", fp_reader.metadata.num_bits, "bits") for (id, fp) in fp_reader: print(id, hex_encode(fp)) See also :func:`chemfp.read_molecule_fingerprints_from_string`. :param type: information about how to convert the input structure into a fingerprint :type type: string or Metadata :param source: The structure data source. :type source: A filename (as a string), a file object, or None to read from stdin :param format: The file format and optional compression. Examples: "smi" and "sdf.gz" :type format: string, or None to autodetect based on the source :param id_tag: The tag containing the record id. Example: "ChEBI ID". Only valid for SD files. :type id_tag: string, or None to use the default title for the given format :param reader_args: additional parameters for the structure reader :type reader_args: dict, or None to use the default arguments :param errors: specify how to handle parse errors :type errors: one of "strict", "report", or "ignore" :returns: a :class:`chemfp.FingerprintReader` read_molecule_fingerprints_from_string -------------------------------------- .. py:function:: read_molecule_fingerprints_from_string( type, content, format, id_tag=None, reader_args=None, errors="strict") Read structures from the content string and return the corresponding ids and fingerprints The parameters are identical to :func:`chemfp.read_molecule_fingerprints` except that the entire content is passed through as a *content* string, rather than as a *source* filename. See that function for details. You must specify the format! As there is no *source* filename, it's not possible to guess the format based on the extension, and there is no support for auto-detecting the format by looking at the string content. :param type: information about how to convert the input structure into a fingerprint :type type: string or Metadata :param content: The structure data as a string. :type content: string :param format: The file format and optional compression. Examples: "smi" and "sdf.gz" :type format: string :param id_tag: The tag containing the record id. Example: "ChEBI ID". Only valid for SD files. :type id_tag: string, or None to use the default title for the given format :param reader_args: additional parameters for the structure reader :type reader_args: dict, or None to use the default arguments :param errors: specify how to handle parse errors :type errors: one of "strict" (raise exception), "report" (send a message to stderr and continue processing), or "ignore" (continue processing) :returns: a :class:`chemfp.FingerprintReader` open_fingerprint_writer ----------------------- .. py:function:: open_fingerprint_writer( destination, metadata=None, format=None, alignment=8, reorder=True, level=None, tmpdir=None, max_spool_size=None, errors="strict", location=None) Create a fingerprint writer for the given destination The fingerprint writer is an object with methods to write fingerprints to the given *destination*. The output format is based on the `format`. If that's None then the format depends on the *destination*, or is "fps" if the attempts at format detection fail. The *metadata*, if given, is a :class:`Metadata` instance, and used to fill the header of an FPS file or META block of an FPB file. If the output format is "fps", "fps.gz", or "fps.zst" then *destination* may be a filename, a file object, or None for stdout. If the output format is "fpb" then *destination* must be a filename or seekable file object. A fingerprint writer with compressed FPB output is not supported; use arena.save() instead, or post-process the file. Use `level` to change the compression level. The default is 9 for gzip and 3 for ztd. Use "min", "default", or "max" as aliases for the minimum, default, and maximum values for each range. Some options only apply to FPB output. The *alignment* specifies the arena byte alignment. By default the fingerprints are reordered by popcount, which enables sublinear similarity search. Set *reorder* to ``False`` to preserve the input fingerprint order. The default FPB writer stores everything into memory before writing the file, which may cause performance problems if there isn't enough available free memory. In that case, set *max_spool_size* to the number of bytes of memory to use before spooling intermediate data to a file. (Note: there are two independent spools so this may use up to roughly twice as much memory as specified.) Use *tmpdir* to specify where to write the temporary spool files if you don't want to use the operating system default. You may also set the TMPDIR, TEMP or TMP environment variables. Some options only apply to FPS output. *errors* specifies how to handle recoverable write errors. The value "strict" raises an exception if there are any detected errors. The value "report" sends an error message to stderr and skips to the next record. The value "ignore" skips to the next record. The *location* is a :class:`Location` instance. It lets the caller access state information such as the number of records that have been written. :param destination: the output destination :type destination: a filename, file object, or None :param metadata: the fingerprint metadata :type metadata: a Metadata instance, or None :param format: the output format :type format: None, "fps", "fps.gz", "fps.zst", or "fpb" :param alignment: arena byte alignment for FPB files :type alignment: positive integer :param reorder: True reorders the fingerprints by popcount, False leaves them in input order :type reorder: True or False :param level: True reorders the fingerprints by popcount, False leaves them in input order :type level: an integer, the strings "min", "default" or "max", or None for default :param tmpdir: the directory to use for temporary files, when max_spool_size is specified :type tmpdir: string or None :param max_spool_size: number of bytes to store in memory before using a temporary file. If None, use memory for everything. :type max_spool_size: integer, or None :param location: a location object used to access output state information :type location: a Location instance, or None :returns: a :class:`chemfp.FingerprintWriter` ChemFPError ----------- .. py:class:: ChemFPError Base class for all of the chemfp exceptions ParseError ---------- .. py:class:: ParseError Exception raised by the molecule and fingerprint parsers and writers The public attributes are: .. py:attribute:: msg a string describing the exception .. py:attribute:: location a :class:`chemfp.io.Location` instance, or None Metadata -------- .. py:class:: Metadata Store information about a set of fingerprints The public attributes are: .. py:attribute:: num_bits the number of bits in the fingerprint .. py:attribute:: num_bytes the number of bytes in the fingerprint .. py:attribute:: type the fingerprint type string .. py:attribute:: aromaticity aromaticity model (only used with OEChem, and now deprecated) .. py:attribute:: software software used to make the fingerprints .. py:attribute:: sources list of sources used to make the fingerprint .. py:attribute:: date a `datetime `_ timestamp of when the fingerprints were made .. py:method:: __repr__() Return a string like ``Metadata(num_bits=1024, num_bytes=128, type='OpenBabel/FP2', ....)`` .. py:method:: __str__() Show the metadata in FPS header format .. py:method:: copy(num_bits=None, num_bytes=None, type=None, aromaticity=None, software=None, sources=None, date=None) Return a new Metadata instance based on the current attributes and optional new values When called with no parameter, make a new Metadata instance with the same attributes as the current instance. If a given call parameter is not None then it will be used instead of the current value. If you want to change a current value to None then you will have to modify the new Metadata after you created it. :param num_bits: the number of bits in the fingerprint :type num_bits: an integer, or None :param num_bytes: the number of bytes in the fingerprint :type num_bytes: an integer, or None :param type: the fingerprint type description :type type: string or None :param aromaticity: obsolete :type aromaticity: None :param software: a description of the software :type software: string or None :param sources: source filenames :type sources: list of strings, a string (interpreted as a list with one string), or None :param date: creation or processing date for the contents :type date: a datetime instance, or None :returns: a new Metadata instance FingerprintReader ----------------- .. py:class:: FingerprintReader Base class for all chemfp objects holding fingerprint records All FingerprintReader instances have a ``metadata`` attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record. .. py:method:: __iter__() iterate over the (id, fingerprint) pairs .. py:method:: iter_arenas(arena_size=1000) iterate through *arena_size* fingerprints at a time, as subarenas Iterate through *arena_size* fingerprints at a time, returned as :class:`chemfp.arena.FingerprintArena` instances. The arenas are in input order and not reordered by popcount. This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer. If *arena_size* is None then this makes an iterator which returns a single arena containing all of the fingerprints. :param arena_size: The number of fingerprints to put into each arena. :type arena_size: positive integer, or None :returns: an iterator of :class:`chemfp.arena.FingerprintArena` instances .. py:method:: save(destination, format=None, level=None) Save the fingerprints to a given destination and format The output format is based on the *format*. If the format is None then the format depends on the *destination* file extension. If the extension isn't recognized then the fingerprints will be saved in "fps" format. If the output format is "fps", "fps.gz", or "fps.zst" then *destination* may be a filename, a file object, or None; None writes to stdout. If the output format is "fpb" then *destination* must be a filename or seekable file object. Chemfp cannot save to compressed FPB files. :param destination: the output destination :type destination: a filename, file object, or None :param format: the output format :type format: None, "fps", "fps.gz", "fps.zst", or "fpb" :param level: compression level when writing .gz or .zst files :type level: an integer, or "min", "default", or "max" for compressor-specific values :returns: None .. py:method:: get_fingerprint_type() Get the fingerprint type object based on the metadata's type field This uses ``self.metadata.type`` to get the fingerprint type string then calls :func:`chemfp.get_fingerprint_type` to get and return a :class:`chemfp.types.FingerprintType` instance. This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn't available. :returns: a :class:`chemfp.types.FingerprintType` FingerprintIterator ------------------- .. py:class:: FingerprintIterator A :class:`chemfp.FingerprintReader` for an iterator of (id, fingerprint) pairs This is often used as an adapter container to hold the metadata and (id, fingerprint) iterator. It supports an optional location, and can call a close function when the iterator has completed. A FingerprintIterator is a context manager which will close the underlying iterator if it's given a close handler. Like all iterators you can use next() to get the next (id, fingerprint) pair. .. py:method:: __init__(metadata, id_fp_iterator, location=None, close=None) Initialize with a Metadata instance and the (id, fingerprint) iterator The *metadata* is a :class:`Metadata` instance. The *id_fp_iterator* is an iterator which returns (id, fingerprint) pairs. The optional *location* is a :class:`chemfp.io.Location`. The optional *close* callable is called (as ``close()``) whenever ``self.close()`` is called and when the context manager exits. .. py:method:: __iter__() Iterate over the (id, fingerprint) pairs .. py:method:: close() Close the iterator The call will be forwarded to the ``close`` callable passed to the constructor. If that ``close`` is None then this does nothing. Fingerprints ------------ .. py:class:: Fingerprints A :class:`chemf.FingerprintReader` containing a metadata and a list of (id, fingerprint) pairs. This is typically used as an adapater when you have a list of (id, fingerprint) pairs and you want to pass it (and the metadata) to the rest of the chemfp API. This implements a simple list-like collection of fingerprints. It supports: - for (id, fingerprint) in fingerprints: ... - id, fingerprint = fingerprints[1] - len(fingerprints) More features, like slicing, will be added as needed or when requested. .. py:method:: __init__(metadata, id_fp_pairs) Initialize with a Metadata instance and the (id, fingerprint) pair list The *metadata* is a :class:`Metadata` instance. The *id_fp_iterator* is an iterator which returns (id, fingerprint) pairs. FingerprintWriter ----------------- .. py:class:: FingerprintWriter Base class for the fingerprint writers The three fingerprint writer classes are: * :class:`chemfp.fps_io.FPSWriter` - write an FPS file * :class:`chemfp.fpb_io.OrderedFPBWriter` - write an FPB file, sorted by popcount * :class:`chemfp.fpb_io.InputOrderFPBWriter` - write an FPB file, preserving input order If the chemfp_converters package is available then its FlushFingerprintWriter will be used to write fingerprints in flush format. Use :func:`chemfp.open_fingerprint_writer` to create a fingerprint writer class; do not create them directly. All classes have the following attributes: * metadata - a :class:`chemfp.Metadata` instance * format - a string describing the base format type (without compression); either 'fps' or 'fpb' * closed - False when the file is open, else True Fingerprint writers are also their own context manager, and close the writer on context exit. .. py:method:: write_fingerprint(id, fp) Write a single fingerprint record with the given id and fp to the destination :param string id: the record identifier :param fp: the fingerprint :type fp: byte string .. py:method:: write_fingerprints(id_fp_pairs) Write a sequence of (id, fingerprint) pairs to the destination :param id_fp_pairs: An iterable of (id, fingerprint) pairs. *id* is a string and *fingerprint* is a byte string. .. py:method:: close() Close the writer This will set self.closed to False. ChemFPProblem ------------- .. py:class:: ChemFPProblem Information about a compatibility problem between a query and target. Instances are generated by :func:`chemfp.check_fingerprint_problems` and :func:`chemfp.check_metadata_problems`. The public attributes are: .. py:attribute:: severity one of "info", "warning", or "error" .. py:attribute:: error_level 5 for "info", 10 for "warning", and 20 for "error" .. py:attribute:: category a string used as a category name. This string will not change over time. .. py:attribute:: description a more detailed description of the error, including details of the mismatch. The description depends on *query_name* and *target_name* and may change over time. The current category names are: * "num_bits mismatch" (error) * "num_bytes_mismatch" (error) * "type mismatch" (warning) * "aromaticity mismatch" (info) * "software mismatch" (info) check_fingerprint_problems -------------------------- .. py:function:: check_fingerprint_problems(query_fp, target_metadata, query_name="query", target_name="target") Return a list of compatibility problems between a fingerprint and a metadata If there are no problems then this returns an empty list. If there is a bit length or byte length mismatch between the *query_fp* byte string and the *target_metadata* then it will return a list containing a :class:`ChemFPProblem` instance, with a severity level "error" and category "num_bytes mismatch". This function is usually used to check if a query fingerprint is compatible with the target fingerprints. In case of a problem, the default message looks like:: >>> problems = check_fingerprint_problems("A"*64, Metadata(num_bytes=128)) >>> problems[0].description 'query contains 64 bytes but target has 128 byte fingerprints' You can change the error message with the *query_name* and *target_name* parameters:: >>> import chemfp >>> problems = check_fingerprint_problems("z"*64, chemfp.Metadata(num_bytes=128), ... query_name="input", target_name="database") >>> problems[0].description 'input contains 64 bytes but database has 128 byte fingerprints' :param query_fp: a fingerprint (usually the query fingerprint) :type query_fp: byte string :param target_metadata: the metadata to check against (usually the target metadata) :type target_metadata: Metadata instance :param query_name: the text used to describe the fingerprint, in case of problem :type query_name: string :param target_name: the text used to describe the metadata, in case of problem :type target_name: string :return: a list of :class:`ChemFPProblem` instances check_metadata_problems ----------------------- .. py:function:: check_metadata_problems(query_metadata, target_metadata, query_name="query", target_name="target") Return a list of compatibility problems between two metadata instances. If there are no probelms then this returns an empty list. Otherwise it returns a list of :class:`ChemFPProblem` instances, with a severity level ranging from "info" to "error". Bit length and byte length mismatches produce an "error". Fingerprint type and aromaticity mismatches produce a "warning". Software version mismatches produce an "info". This is usually used to check if the query metadata is incompatible with the target metadata. In case of a problem the messages look like:: >>> import chemfp >>> m1 = chemfp.Metadata(num_bytes=128, type="Example/1") >>> m2 = chemfp.Metadata(num_bytes=256, type="Counter-Example/1") >>> problems = chemfp.check_metadata_problems(m1, m2) >>> len(problems) 2 >>> print(problems[1].description) query has fingerprints of type 'Example/1' but target has fingerprints of type 'Counter-Example/1' You can change the error message with the *query_name* and *target_name* parameters:: >>> problems = chemfp.check_metadata_problems(m1, m2, query_name="input", target_name="database") >>> print(problems[1].description) input has fingerprints of type 'Example/1' but database has fingerprints of type 'Counter-Example/1' :param fp: a fingerprint :type fp: byte string :param metadata: the metadata to check against :type metadata: Metadata instance :param query_name: the text used to describe the fingerprint, in case of problem :type query_name: string :param target_name: the text used to describe the metadata, in case of problem :type target_name: string :return: a list of :class:`ChemFPProblem` instances count_tanimoto_hits ------------------- .. py:function:: count_tanimoto_hits(queries, targets, threshold=0.7, arena_size=100) Count the number of targets within *threshold* of each query term For each query in *queries*, count the number of targets in *targets* which are at least *threshold* similar to the query. This function returns an iterator containing the (query_id, count) pairs. Example:: queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, count) in chemfp.count_tanimoto_hits(queries, targets, threshold=0.9): print(query_id, "has", count, "neighbors with at least 0.9 similarity") Internally, queries are processed in batches with *arena_size* elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch. Note: an :class:`chemfp.fps_io.FPSReader` may be used as a target but it will only process one batch and not reset for the next batch. It's faster to search a :class:`chemfp.arena.FingerprintArena`, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader. If you know the targets are in an arena then you may want to use :func:`chemfp.search.count_tanimoto_hits_fp` or :func:`chemfp.search.count_tanimoto_hits_arena`. :param queries: The query fingerprints. :type queries: any fingerprint container :param targets: The target fingerprints. :type targets: :class:`chemfp.arena.FingerprintArena` or the slower :class:`chemfp.fps_io.FPSReader` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param arena_size: The number of queries to process in a batch :type arena_size: a positive integer, or None :returns: iterator of the (query_id, score) pairs, one for each query count_tanimoto_hits_symmetric ----------------------------- .. py:function:: count_tanimoto_hits_symmetric(fingerprints, threshold=0.7) Find the number of other fingerprints within *threshold* of each fingerprint For each fingerprint in the *fingerprints* arena, find the number of other fingerprints in the same arena which are at least *threshold* similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself. This function returns an iterator of (fingerprint_id, count) pairs. Example:: arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, count) in chemfp.count_tanimoto_hits_symmetric(arena, threshold=0.6): print(fp_id, "has", count, "neighbors with at least 0.6 similarity") You may also be interested in :func:`chemfp.search.count_tanimoto_hits_symmetric`. :param fingerprints: The arena containing the fingerprints. :type fingerprints: a FingerprintArena with precomputed popcount_indices :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: An iterator of (fp_id, count) pairs, one for each fingerprint threshold_tanimoto_search ------------------------- .. py:function:: threshold_tanimoto_search(queries, targets, threshold=0.7, arena_size=100) Find all targets within *threshold* of each query term For each query in *queries*, find all the targets in *targets* which are at least *threshold* similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs. Example:: queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, hits) in chemfp.id_threshold_tanimoto_search(queries, targets, threshold=0.8): print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity") non_identical = [target_id for (target_id, score) in hits if score != 1.0] print(" The non-identical hits are:", non_identical) Internally, queries are processed in batches with *arena_size* elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use ``arena_size=None`` to process the input as a single batch. Note: an :class:`chemfp.fps_io.FPSReader` may be used as a target but it will only process one batch and not reset for the next batch. It's faster to search a :class:`chemfp.arena.FingerprintArena`, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader. If you know the targets are in an arena then you may want to use :func:`chemfp.search.threshold_tanimoto_search_fp` or :func:`chemfp.search.threshold_tanimoto_search_arena`. :param queries: The query fingerprints. :type queries: any fingerprint container :param targets: The target fingerprints. :type targets: :class:`chemfp.arena.FingerprintArena` or the slower :class:`chemfp.fps_io.FPSReader` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param arena_size: The number of queries to process in a batch :type arena_size: positive integer, or None :returns: An iterator containing (query_id, hits) pairs, one for each query. 'hits' contains a list of (target_id, score) pairs. threshold_tanimoto_search_symmetric ----------------------------------- .. py:function:: threshold_tanimoto_search_symmetric(fingerprints, threshold=0.7) Find the other fingerprints within *threshold* of each fingerprint For each fingerprint in the *fingerprints* arena, find the other fingerprints in the same arena which share at least *threshold* similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself. This function returns an iterator of (fingerprint, SearchResult) pairs. The :class:`chemfp.search.SearchResult` hit order is arbitrary. Example:: arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.threshold_tanimoto_search_symmetric(arena, threshold=0.75): print(fp_id, "has", len(hits), "neighbors:") for (other_id, score) in hits.get_ids_and_scores(): print(" %s %.2f" % (other_id, score)) You may also be interested in the :func:`chemfp.search.threshold_tanimoto_search_symmetric` function. :param fingerprints: The arena containing the fingerprints. :type fingerprints: a FingerprintArena with precomputed popcount_indices :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint knearest_tanimoto_search ------------------------ .. py:function:: knearest_tanimoto_search(queries, targets, k=3, threshold=0.7, arena_size=100) Find the *k*-nearest targets within *threshold* of each query term For each query in *queries*, find the *k*-nearest of all the targets in *targets* which are at least *threshold* similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted. This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary. Example:: # Use the first 5 fingerprints as the queries queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5)) targets = chemfp.load_fingerprints("pubchem_subset.fps") # Find the 3 nearest hits with a similarity of at least 0.8 for (query_id, hits) in chemfp.id_knearest_tanimoto_search(queries, targets, k=3, threshold=0.8): print(query_id, "has", len(hits), "neighbors with at least 0.8 similarity") if hits: target_id, score = hits[-1] print(" The least similar is", target_id, "with score", score) Internally, queries are processed in batches with *arena_size* elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use ``arena_size=None`` to process the input as a single batch. Note: an :class:`chemfp.fps_io.FPSReader` may be used as a target but it will only process one batch and not reset for the next batch. It's faster to search a :class:`chemfp.arena.FingerprintArena`, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader. If you know the targets are in an arena then you may want to use :func:`chemfp.search.knearest_tanimoto_search_fp` or :func:`chemfp.search.knearest_tanimoto_search_arena`. :param queries: The query fingerprints. :type queries: any fingerprint container :param targets: The target fingerprints. :type targets: :class:`chemfp.arena.FingerprintArena` or the slower :class:`chemfp.fps_io.FPSReader` :param k: The maximum number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param arena_size: The number of queries to process in a batch :type arena_size: positive integer, or None :returns: An iterator containing (query_id, hits) pairs, one for each query. The *hits* are a list of (target_id, score) pairs, sorted by score. knearest_tanimoto_search_symmetric ---------------------------------- .. py:function:: knearest_tanimoto_search_symmetric(fingerprints, k=3, threshold=0.7) Find the *k*-nearest fingerprints within *threshold* of each fingerprint For each fingerprint in the *fingerprints* arena, find the nearest *k* fingerprints in the same arena which have at least *threshold* similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself. This function returns an iterator of (fingerprint, SearchResult) pairs. The :class:`chemfp.search.SearchResult` hits are ordered from highest score to lowest, with ties broken arbitrarily. Example:: arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.knearest_tanimoto_search_symmetric(arena, k=5, threshold=0.5): print(fp_id, "has", len(hits), "neighbors, with scores", end="") print(", ".join("%.2f" % x for x in hits.get_scores())) You may also be interested in the :func:`chemfp.search.knearest_tanimoto_search_symmetric` function. :param fingerprints: The arena containing the fingerprints. :type fingerprints: a FingerprintArena with precomputed popcount_indices :param k: The maximum number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint count_tversky_hits ------------------ .. py:function:: count_tversky_hits(queries, targets, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100) Count the number of targets within *threshold* of each query term For each query in *queries*, count the number of targets in *targets* which are at least *threshold* similar to the query. This function returns an iterator containing the (query_id, count) pairs. Example:: queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, count) in chemfp.count_tversky_hits( queries, targets, threshold=0.9, alpha=0.5, beta=0.5): print(query_id, "has", count, "neighbors with at least 0.9 Dice similarity") Internally, queries are processed in batches with *arena_size* elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use arena_size=None to process the input as a single batch. Note: an :class:`chemfp.fps_io.FPSReader` may be used as a target but it will only process one batch and not reset for the next batch. It's faster to search a :class:`chemfp.arena.FingerprintArena`, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader. If you know the targets are in an arena then you may want to use :func:`chemfp.search.count_tversky_hits_fp` or :func:`chemfp.search.count_tversky_hits_arena`. :param queries: The query fingerprints. :type queries: any fingerprint container :param targets: The target fingerprints. :type targets: :class:`chemfp.arena.FingerprintArena` or the slower :class:`chemfp.fps_io.FPSReader` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param arena_size: The number of queries to process in a batch :type arena_size: a positive integer, or None :returns: iterator of the (query_id, score) pairs, one for each query count_tversky_hits_symmetric ---------------------------- .. py:function:: count_tversky_hits_symmetric(fingerprints, threshold=0.7, alpha=1.0, beta=1.0) Find the number of other fingerprints within *threshold* of each fingerprint For each fingerprint in the *fingerprints* arena, find the number of other fingerprints in the same arena which are at least *threshold* similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself. This function returns an iterator of (fingerprint_id, count) pairs. Example:: arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, count) in chemfp.count_tversky_hits_symmetric( arena, threshold=0.6, alpha=0.5, beta=0.5): print(fp_id, "has", count, "neighbors with at least 0.6 Dice similarity") You may also be interested in :func:`chemfp.search.count_tversky_hits_symmetric`. :param fingerprints: The arena containing the fingerprints. :type fingerprints: a FingerprintArena with precomputed popcount_indices :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: An iterator of (fp_id, count) pairs, one for each fingerprint threshold_tversky_search ------------------------ .. py:function:: threshold_tversky_search(queries, targets, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100) Find all targets within *threshold* of each query term For each query in *queries*, find all the targets in *targets* which are at least *threshold* similar to the query. This function returns an iterator containing the (query_id, hits) pairs. The hits are stored as a list of (target_id, score) pairs. Example:: queries = chemfp.open("queries.fps") targets = chemfp.load_fingerprints("targets.fps.gz") for (query_id, hits) in chemfp.id_threshold_tanimoto_search( queries, targets, threshold=0.8, alpha=0.5, beta=0.5): print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity") non_identical = [target_id for (target_id, score) in hits if score != 1.0] print(" The non-identical hits are:", non_identical) Internally, queries are processed in batches with *arena_size* elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use ``arena_size=None`` to process the input as a single batch. Note: an :class:`chemfp.fps_io.FPSReader` may be used as a target but it will only process one batch and not reset for the next batch. It's faster to search a :class:`chemfp.arena.FingerprintArena`, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader. If you know the targets are in an arena then you may want to use :func:`chemfp.search.threshold_tversky_search_fp` or :func:`chemfp.search.threshold_tversky_search_arena`. :param queries: The query fingerprints. :type queries: any fingerprint container :param targets: The target fingerprints. :type targets: :class:`chemfp.arena.FingerprintArena` or the slower :class:`chemfp.fps_io.FPSReader` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param arena_size: The number of queries to process in a batch :type arena_size: positive integer, or None :returns: An iterator containing (query_id, hits) pairs, one for each query. 'hits' contains a list of (target_id, score) pairs. threshold_tversky_search_symmetric ---------------------------------- .. py:function:: threshold_tversky_search_symmetric(fingerprints, threshold=0.7, alpha=1.0, beta=1.0) Find the other fingerprints within *threshold* of each fingerprint For each fingerprint in the *fingerprints* arena, find the other fingerprints in the same arena which share at least *threshold* similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself. This function returns an iterator of (fingerprint, SearchResult) pairs. The :class:`chemfp.search.SearchResult` hit order is arbitrary. Example:: arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.threshold_tversky_search_symmetric( arena, threshold=0.75, alpha=0.5, beta=0.5): print(fp_id, "has", len(hits), "Dice neighbors:") for (other_id, score) in hits.get_ids_and_scores(): print(" %s %.2f" % (other_id, score)) You may also be interested in the :func:`chemfp.search.threshold_tversky_search_symmetric` function. :param fingerprints: The arena containing the fingerprints. :type fingerprints: a FingerprintArena with precomputed popcount_indices :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint knearest_tversky_search ----------------------- .. py:function:: knearest_tversky_search(queries, targets, k=3, threshold=0.7, alpha=1.0, beta=1.0, arena_size=100) Find the *k*-nearest targets within *threshold* of each query term For each query in *queries*, find the *k*-nearest of all the targets in *targets* which are at least *threshold* similar to the query. Ties are broken arbitrarily and hits with scores equal to the smallest value may have been omitted. This function returns an iterator containing the (query_id, hits) pairs, where hits is a list of (target_id, score) pairs, sorted so that the highest scores are first. The order of ties is arbitrary. Example:: # Use the first 5 fingerprints as the queries queries = next(chemfp.open("pubchem_subset.fps").iter_arenas(5)) targets = chemfp.load_fingerprints("pubchem_subset.fps") # Find the 3 nearest hits with a similarity of at least 0.8 for (query_id, hits) in chemfp.id_knearest_tversky_search( queries, targets, k=3, threshold=0.8, alpha=0.5, beta=0.5): print(query_id, "has", len(hits), "neighbors with at least 0.8 Dice similarity") if hits: target_id, score = hits[-1] print(" The least similar is", target_id, "with score", score) Internally, queries are processed in batches with *arena_size* elements. A small batch size uses less overall memory and has lower processing latency, while a large batch size has better overall performance. Use ``arena_size=None`` to process the input as a single batch. Note: an :class:`chemfp.fps_io.FPSReader` may be used as a target but it will only process one batch and not reset for the next batch. It's faster to search a :class:`chemfp.arena.FingerprintArena`, but if you have an FPS file then that takes extra time to load. At times, if there is a small number of queries, the time to load the arena from an FPS file may be slower than the direct search using an FPSReader. If you know the targets are in an arena then you may want to use :func:`chemfp.search.knearest_tversky_search_fp` or :func:`chemfp.search.knearest_tversky_search_arena`. :param queries: The query fingerprints. :type queries: any fingerprint container :param targets: The target fingerprints. :type targets: :class:`chemfp.arena.FingerprintArena` or the slower :class:`chemfp.fps_io.FPSReader` :param k: The maximum number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param arena_size: The number of queries to process in a batch :type arena_size: positive integer, or None :returns: An iterator containing (query_id, hits) pairs, one for each query. The *hits* are a list of (target_id, score) pairs, sorted by score. knearest_tversky_search_symmetric --------------------------------- .. py:function:: knearest_tversky_search_symmetric(fingerprints, k=3, threshold=0.7, alpha=1.0, beta=1.0) Find the *k*-nearest fingerprints within *threshold* of each fingerprint For each fingerprint in the *fingerprints* arena, find the nearest *k* fingerprints in the same arena which have at least *threshold* similar to it. The arena must have pre-computed popcounts. A fingerprint never matches itself. This function returns an iterator of (fingerprint, SearchResult) pairs. The :class:`chemfp.search.SearchResult` hits are ordered from highest score to lowest, with ties broken arbitrarily. Example:: arena = chemfp.load_fingerprints("targets.fps.gz") for (fp_id, hits) in chemfp.knearest_tversky_search_symmetric( arena, k=5, threshold=0.5, alpha=0.5, beta=0.5): print(fp_id, "has", len(hits), "neighbors, with Dice scores", end="") print(", ".join("%.2f" % x for x in hits.get_scores())) You may also be interested in the :func:`chemfp.search.knearest_tversky_search_symmetric` function. :param fingerprints: The arena containing the fingerprints. :type fingerprints: a FingerprintArena with precomputed popcount_indices :param k: The maximum number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: An iterator of (fp_id, SearchResult) pairs, one for each fingerprint get_fingerprint_families ------------------------ .. py:function:: get_fingerprint_families(toolkit_name=None) Return a list of available fingerprint families :param string toolkit_name: restrict fingerprints to the named toolkit :returns: a list of :class:`chemfp.types.FingerprintFamily` instances get_fingerprint_family ---------------------- .. py:function:: get_fingerprint_family(family_name) Return the named fingerprint family, or raise a ValueError if not available Given a *family_name* like ``OpenBabel-FP2`` or ``OpenEye-MACCS166`` return the corresponding :class:`chemfp.types.FingerprintFamily`. :param string family_name: the family name :returns: a :class:`chemfp.types.FingerprintFamily` instance get_fingerprint_family_names ---------------------------- .. py:function:: get_fingerprint_family_names(include_unavailable=False, toolkit_name=None) Return a set of fingerprint family name strings The function tries to load each known fingerprint family. The names of the families which could be loaded are returned as a set of strings. If *include_unavailable* is True then this will return a set of all of the fingerprint family names, including those which could not be loaded. The set contains both the versioned and unversioned family names, so both ``OpenBabel-FP2/1`` and ``OpenBabel-FP2`` may be returned. :param include_unavailable: Should unavailable family names be included in the result set? :type include_unavailable: True or False :returns: a set of strings get_fingerprint_type -------------------- .. py:function:: get_fingerprint_type(type, fingerprint_kwargs=None) Get the fingerprint type based on its type string and optional keyword arguments Given a fingerprint *type* string like ``OpenBabel-FP2``, or ``RDKit-Fingerprint/1 fpSize=1024``, return the corresponding :class:`chemfp.types.FingerprintType`. The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the *fingerprint_kwargs* dictionary, where the dictionary values are native Python values. If the same parameter is specified in the type string and the kwargs dictionary then the *fingerprint_kwargs* takes precedence. For example: >>> fptype = get_fingerprint_type("RDKit-Fingerprint fpSize=1024 minPath=3", {"fpSize": 4096}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1' Use :func:`get_fingerprint_type_from_text_settings` if your fingerprint parameter values are all string-encoded, eg, from the command-line or a configuration file. :param string type: a fingerprint type string :param fingerprint_kwargs: fingerprint type parameters :type fingerprint_kwargs: a dictionary of string names and Python types for values :returns: a :class:`chemfp.types.FingerprintType` get_fingerprint_type_from_text_settings --------------------------------------- .. py:function:: get_fingerprint_type_from_text_settings(type, settings=None) Get the fingerprint type based on its type string and optional settings arguments Given a fingerprint *type* string like ``OpenBabel-FP2``, or ``RDKit-Fingerprint/1 fpSize=1024``, return the corresponding :class:`chemfp.types.FingerprintType`. The fingerprint type string may include fingerprint parameters. Parameters can also be specified through the *settings* dictionary, where the dictionary values are string-encoded values. If the same parameter is specified in the *type* string and the *settings* dictionary then the *settings* take precedence. For example: >>> fptype = get_fingerprint_type_from_text_settings("RDKit-Fingerprint fpSize=1024 minPath=3", ... {"fpSize": "4096"}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=3 maxPath=7 fpSize=4096 nBitsPerHash=2 useHs=1' This function is for string settings from a configuration file or command-line. Use :func:`get_fingerprint_type` if your fingerprint parameters are Python values. :param type: a fingerprint type string :type type: string :param fingerprint_kwargs: fingerprint type parameters :type fingerprint_kwargs: a dictionary of string names and Python types for values :returns: a :class:`chemfp.types.FingerprintType` has_fingerprint_family ---------------------- .. py:function:: has_fingerprint_family(family_name) Test if the fingerprint family is available Return True if the fingerprint *family_name* is available, otherwise False. The *family_name* may be versioned or unversioned, like "OpenBabel-FP2/1" or "OpenEye-MACCS166". :param string family_name: the family name :returns: True or False get_max_threads --------------- .. py:function:: get_max_threads() Return the maximum number of threads available. WARNING: this likely doesn't do what you think it does. Do not use! If OpenMP is not available then this will return 1. Otherwise it returns the maximum number of threads available, as reported by omp_get_num_threads(). get_num_threads --------------- .. py:function:: get_num_threads() Return the number of OpenMP threads to use in searches Initially this is the value returned by omp_get_max_threads(), which is generally 4 unless you set the environment variable OMP_NUM_THREADS to some other value. It may be any value in the range 1 to get_max_threads(), inclusive. :returns: the current number of OpenMP threads to use set_num_threads --------------- .. py:function:: set_num_threads(num_threads) Set the number of OpenMP threads to use in searches If *num_threads* is less than one then it is treated as one, and a value greater than get_max_threads() is treated as get_max_threads(). :param int num_threads: the new number of OpenMP threads to use get_toolkit ----------- .. py:function:: get_toolkit(toolkit_name) Return the named toolkit, if available, or raise a ValueError If *toolkit_name* is one of "openbabel", "openeye", or "rdkit" and the named toolkit is available, then it will return :mod:`chemfp.openbabel_toolkit`, :mod:`chemfp.openeye_toolkit`, or :mod:`chemfp.rdkit_toolkit`, respectively.:: >>> import chemfp >>> chemfp.get_toolkit("openeye") >>> chemfp.get_toolkit("rdkit") Traceback (most recent call last): ... ValueError: Unable to get toolkit 'rdkit': No module named rdkit :param toolkit_name: the toolkit name :type toolkit_name: string :returns: the chemfp toolkit :raises: ValueError if *toolkit_name* is unknown or the toolkit does not exist get_toolkit_names ----------------- .. py:function:: get_toolkit_names() Return a set of available toolkit names The function checks if each supported toolkit is available by trying to import its corresponding module. It returns a set of toolkit names:: >>> import chemfp >>> chemfp.get_toolkit_names() set(['openeye', 'rdkit', 'openbabel']) :returns: a set of toolkit names, as strings has_toolkit ----------- .. py:function:: has_toolkit(toolkit_name) Return True if the named toolkit is available, otherwise False If *toolkit_name* is one of "openbabel", "openeye", or "rdkit" then this function will test to see if the given toolkit is available, and if so return True. Otherwise it returns False. >>> import chemfp >>> chemfp.has_toolkit("openeye") True >>> chemfp.has_toolkit("openbabel") False The initial test for a toolkit can be slow, especially if the underlying toolkit loads a lot of shared libraries. The test is only done once, and cached. :param toolkit_name: the toolkit name :type toolkit_name: string :returns: True or False .. py:module:: chemfp.types chemfp.types - fingerprint families and types ============================================= A "fingerprint type" is an object which knows how to convert a molecule into a fingerprint. A "fingerprint family" is an object which uses a set of parameters to make a specific fingerprint type. :: >>> import chemfp >>> fpfamily = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> fpfamily.get_defaults() {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1} >>> >>> fptype = fpfamily() # create the default fingerprint type >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1' >>> >>> fptype = fpfamily(fpSize=1024) # use a non-default value >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1' >>> mol = fptype.toolkit.parse_molecule("c1ccccc1O", "smistring") >>> fptype.compute_fingerprint(mol) '\x04\x00\x00\x00\x00\x00\x10\x00\x00\x00 ... x00\x00\x00\x00\x00' FingerprintFamily ----------------- .. py:class:: FingerprintFamily A FingerprintFamily is used to create a FingerprintType or get information about its parameters Two reasons to use a FingerprintFamily (instead of using :func:`chemfp.get_fingerprint_type` or :func:`chemfp.get_fingerprint_type_from_text_settings`) are: * figure out the default arguments; * given a text settings or parameter dictionary, use the keys from the default argument keys to remove other parameters before creating a FingerprintType (otherwise the creation function will raise an exception) All fingerprint families have the following attributes: * name - the type name, including version * toolkit - the toolkit API for the underlying chemistry toolkit, or None .. py:method:: __repr__() Return a string like 'FingerprintFamily()' .. py:attribute:: FingerprintFamily.name Read-only attribute. The full fingerprint name, including the version .. py:attribute:: FingerprintFamily.base_name Read-only attribute. The base fingerprint name, without the version .. py:attribute:: FingerprintFamily.version Read-only attribute. The fingerprint version .. py:attribute:: FingerprintFamily.toolkit Read-only attribute. The toolkit used to implement this fingerprint, or None .. py:method:: __call__(**fingerprint_kwargs) Create a fingerprint type; keyword arguments can override the defaults The argument values are native Python values, not string-encoded values:: >>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> fptype = family() >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1' >>> fptype = family(fpSize=1024) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1' The function will raise an exception for unknown arguments. :param fingerprint_kwargs: the fingerprint parameters :returns: an object implementing the :class:`chemfp.types.FingerprintType` API .. py:method:: from_kwargs(fingerprint_kwargs=None) Create a fingerprint type; items in the *fingerprint_kwargs* dictionary can override the defaults The dictionary values are native Python values, not string-encoded values:: >>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> fptype = family() >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1' >>> fptype = family.from_kwargs({"fpSize": 1024}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1' The function will raise an exception for unknown arguments. :param fingerprint_kwargs: the fingerprint parameters :type fingerprint_kwargs: a dictionary where the values are Python objects :returns: an object implementing the :class:`chemfp.types.FingerprintType` API .. py:method:: from_text_settings(settings=None) Create a fingerprint type; *settings* is a dictionary with string-encoded value that can override the defaults The dictionary values are string-encoded values, not native Python values. This function exists to help handle command-line arguments and setting files.:: >>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> fptype = family.from_text_settings() >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=2 useHs=1' >>> fptype = family.from_text_settings({"fpSize": "1024"}) >>> fptype.get_type() 'RDKit-Fingerprint/2 minPath=1 maxPath=7 fpSize=1024 nBitsPerHash=2 useHs=1' The function will raise an exception for unknown arguments. :param settings: the fingerprint text settings :type settings: a dictionary where the values are string-encoded :returns: an object implementing the :class:`chemfp.types.FingerprintType` API .. py:method:: get_kwargs_from_text_settings(settings=None) Convert a dictionary of string-encoded fingerprint parameters into native Python values String-encoded values ("text settings") can come from the command-line, a configuration file, a web reqest, or other text sources. The fingerprint types need actual Python values. This method converts the first to the second:: >>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> family.get_kwargs_from_text_settings() {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1} >>> family.get_kwargs_from_text_settings({"fpSize": "128", "maxPath": "5"}) {'maxPath': 5, 'fpSize': 128, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1} :param settings: the fingerprint text settings :type settings: a dictionary where the values are string-encoded :returns: an dictionary of (decoded) fingerprint parameters .. py:method:: get_defaults() Return the default parameters as a dictionary The dictionary values are native Python objects:: >>> import chemfp >>> family = chemfp.get_fingerprint_family("RDKit-Fingerprint") >>> family.get_defaults() {'maxPath': 7, 'fpSize': 2048, 'nBitsPerHash': 2, 'minPath': 1, 'useHs': 1} :returns: an dictionary of fingerprint parameters .. :py:module:: chemfp.types FingerprintType --------------- .. py:class:: FingerprintType The base to all fingerprint types A fingerprint type has the following public attributes: .. py:attribute:: name the fingerprint name, including the version .. py:attribute:: base_name the fingerprint name, without the version .. py:attribute:: version the fingerprint version .. py:attribute:: toolkit the toolkit API for the underlying chemistry toolkit, or None .. py:attribute:: software a string which characterizes the toolkit, including version information .. py:attribute:: num_bits the number of bits in this fingerprint type .. py:attribute:: fingerprint_kwargs a dictionary of the fingerprint arguments The built-in fingerprint types are: * :class:`chemfp.openbabel_types.OpenBabelFP2FingerprintType_v1` - ``OpenBabel-FP2/1`` - Open Babel FP2 * :class:`chemfp.openbabel_types.OpenBabelFP3FingerprintType_v1` - ``OpenBabel-FP3/1`` - Open Babel FP3 * :class:`chemfp.openbabel_types.OpenBabelFP4FingerprintType_v1` - ``OpenBabel-FP4/1`` - Open Babel FP4 * :class:`chemfp.openbabel_types.OpenBabelMACCSFingerprintType_v1` - ``OpenBabel-MACCS/1`` - Open Babel 166 MACCS keys * :class:`chemfp.openbabel_types.OpenBabelMACCSFingerprintType_v2` - ``OpenBabel-MACCS/2`` - Open Babel 166 MACCS keys * :class:`chemfp.openbabel_patterns.SubstructOpenBabelFingerprinter_v1` - ``ChemFP-Substruct-OpenBabel/1`` - chemfp's 881 CACTVS/PubChem-like keys implemented with Open Babel * :class:`chemfp.openbabel_patterns.RDMACCSOpenBabelFingerprinter_v1` - ``RDMACCS-OpenBabel/1`` - chemfp's own 166 MACCS keys implemented with Open Babel (does not include key 44) * :class:`chemfp.openbabel_patterns.RDMACCSOpenBabelFingerprinter_v2` - ``RDMACCS-OpenBabel/1`` - chemfp's own 166 MACCS keys implemented with Open Babel * :class:`chemfp.openeye_types.OpenEyeCircularFingerprintType_v2` - ``OpenEye-Circular/2`` - OEGraphSim circular fingerprints * :class:`chemfp.openeye_types.OpenEyeMACCSFingerprintType_v2` - ``OpenEye-MACCS166/2`` - OEGraphSim 166 MACCS keys * :class:`chemfp.openeye_types.OpenEyePathFingerprintType_v2` - ``OpenEye-Path/2`` - OEGraphSim path fingerprints * :class:`chemfp.openeye_types.OpenEyeTreeFingerprintType_v2` - ``OpenEye-Tree/2`` - OEGraphSim tree fingerprints * :class:`chemfp.openeye_patterns.SubstructOpenEyeFingerprinter_v1` - ``ChemFP-Substruct-OpenEye/1`` - chemfp's 881 CACTVS/PubChem-like keys implemented with OEChem * :class:`chemfp.openeye_patterns.RDMACCSOpenEyeFingerprinter_v1` - ``RDMACCS-OpenEye/1`` - chemfp's own 166 MACCS keys implemented with OEChem (does not include key 44) * :class:`chemfp.openeye_patterns.RDMACCSOpenEyeFingerprinter_v2` - ``RDMACCS-OpenEye/2`` - chemfp's own 166 MACCS keys implemented with OEChem * :class:`chemfp.rdkit_types.RDKitFingerprintType_v1` - RDKit-Fingerprint/1 - RDKit path and tree fingerprint * :class:`chemfp.rdkit_types.RDKitFingerprintType_v2` - RDKit-Fingerprint/2 - RDKit path and tree fingerprint * :class:`chemfp.rdkit_types.RDKitMACCSFingerprintType_v1` - ``RDKit-MACCS/1`` - RDKit 166 MACCS keys (does not include key 44) * :class:`chemfp.rdkit_types.RDKitMACCSFingerprintType_v2` - ``RDKit-MACCS/2`` - RDKit 166 MACCS keys * :class:`chemfp.rdkit_types.RDKitMorganFingerprintType_v1` - ``RDKit-Morgan/1`` - RDKit circular fingerprints * :class:`chemfp.rdkit_types.RDKitAtomPairFingerprint_v1` - ``RDKit-AtomPair/1`` - RDKit atom pair fingerprints * :class:`chemfp.rdkit_types.RDKitAtomPairFingerprint_v2` - ``RDKit-AtomPair/2`` - RDKit atom pair fingerprints * :class:`chemfp.rdkit_types.RDKitTorsionFingerprintType_v1` - ``RDKit-Torsion/1`` - RDKit torsion fingerprints * :class:`chemfp.rdkit_types.RDKitTorsionFingerprintType_v2` - ``RDKit-Torsion/2`` - RDKit torsion fingerprints * :class:`chemfp.rdkit_types.RDKitTorsionFingerprintType_v3` - ``RDKit-Torsion/3`` - RDKit torsion fingerprints * :class:`chemfp.rdkit_patterns.SubstructRDKitFingerprintType_v1` - ``ChemFP-Substruct-RDKit/1`` - chemfp's 881 CACTVS/PubChem-like keys implemented with RDKit * :class:`chemfp.rdkit_patterns.RDMACCSRDKitFingerprinter_v1` - ``RDMACCS-RDKit/1`` - chemfp's own 166 MACCS keys implemented with OEChem (does not include key 44) * :class:`chemfp.rdkit_patterns.RDMACCSRDKitFingerprinter_v2` - ``RDMACCS-RDKit/2`` - chemfp's own 166 MACCS keys implemented with OEChem .. py:method:: get_type() Get the full type string (name and parameters) for this fingerprint type :returns: a canonical fingerprint type string, including its parameters .. py:method:: get_metadata(sources=None) Return a Metadata appropriate for the given fingerprint type. This is most commonly used to make a :class:`chemfp.Metadata` that can be passed into a :class:`chemfp.FingerprintWriter`. If *sources* is a string or a list of strings then it will passed to the newly created Metadata instance. It should contain filenames or other description of the fingerprint sources. :param sources: fingerprint source filenames or other description :type sources: None, a string, or list of strings :returns: a :class:`chemfp.Metadata` .. py:method:: make_fingerprinter() Make a 'fingerprinter'; a callable which takes a molecule and returns a fingerprint :returns: a function object which takes a molecule and return a fingerprint .. py:method:: read_molecule_fingerprints(source, format=None, id_tag=None, reader_args=None, errors="strict", location=None) Read fingerprints from a structure source as a FingerprintIterator Iterate through the *format* structure records in *source*. If *format* is None then auto-detect the format based on the *source*. Use the fingerprint type to compute the fingerprint. For SD files, use *id_tag* to get the record id from the given SD tag instead of the title line. The *reader_args* dictionary parameters depend on the toolkit and format. For details see the docstring for ``self.toolkit.read_molecules``. The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a Location instance. If None then a default Location will be created. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a Location object, or None :returns: a :class:`chemfp.FingerprintIterator` which iterates over the (id, fingerprint) pair .. py:method:: read_molecule_fingerprints_from_string(content, format=None, id_tag=None, reader_args=None, errors="strict", location=None) Read fingerprints from structure records in a string, as a FingerprintIterator Iterate through the *format* structure records in *content*. Use the fingerprint type to compute the fingerprint. For SD files, use *id_tag* to get the record id from the given SD tag instead of the title line. The *reader_args* dictionary parameters depend on the toolkit and format. For details see the docstring for ``self.toolkit.read_molecules``. The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a Location instance. If None then a default Location will be created. :param content: the string containing structure records :type source: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a Location object, or None :returns: a :class:`chemfp.FingerprintIterator` which iterates over the (id, fingerprint) pair .. py:method:: parse_molecule_fingerprint(content, format, reader_args=None, errors="strict") Parse the first molecule record of the content then compute and return the fingerprint Read the first molecule from *content*, which contains records in the given *format*. Compute and return its fingerprint. The *reader_args* dictionary parameters depend on the toolkit and format. For details see the docstring for ``self.toolkit.read_molecules``. The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and return None for the fingerprint, and "ignore" returns None for the fingerprint without any extra message. :param content: the string containing at least one structure record :type source: a string :param format: the input structure format :type format: a format name string, or Format object :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: the fingerprint as a byte string .. py:method:: parse_id_and_molecule_fingerprint(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first molecule record of the content then compute and return the id and fingerprint Read the first molecule from *content*, which contains records in the given *format*. Compute its fingerprint and get the molecule id. For an SD record use *id_tag* to get the record id from the given SD tag instead of from the title line. Return the id and fingerprint as the (id, fingerprint) pair. The *reader_args* dictionary parameters depend on the toolkit and format. For details see the docstring for ``self.toolkit.read_molecules``. The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and return None for values it cannot compute, and "ignore" is like "report" but without the error message. For "report" and "ignore", if the molecule cannot be parsed then the result will be (None, None). If the fingerprint cannot be computed then the result will be (id, None). :param content: the string containing at least one structure record :type source: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a pair of (id string, fingerprint byte string) .. py:method:: make_id_and_molecule_fingerprint_parser(format, id_tag=None, reader_args=None, errors="strict") Make a function which parses molecule from a record and returns the id and computed fingerprint This is a very specialized function, designed for performance, but it doesn't appear to give any advantage. You likely don't need it. Return a function which parses a content string containing structure records in the given *format* to get a molecule. Use the molecule to compute the fingerprint and get its id. For an SD record use *id_tag* to get the record id from the given SD tag instead of from the title line. The new function will return the (id, fingerprint) pair. The *reader_args* dictionary parameters depend on the toolkit and format. For details see the docstring for ``self.toolkit.read_molecules``. The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and return None for values it cannot compute, and "ignore" is like "report" but without the error message. For "report" and "ignore", if the molecule cannot be parsed then the result will be (None, None). If the fingerprint cannot be computed then the result will be (id, None). :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a function which takes a content string and returns an (id, fingerprint) pair .. py:method:: compute_fingerprint(mol) Compute and return the fingerprint byte string for the toolkit molecule :param mol: a toolkit molecule :returns: the fingerprint as a byte string .. py:method:: compute_fingerprints(mols) Compute and return the fingerprint for each toolkit molecule in an iterator This function is a slightly optimized version of:: for mol in mols: yield self.compute_fingerprint(mol) :param mols: an iterable of toolkit molecules :returns: a generator of fingerprints, one per molecule .. py:method:: get_fingerprint_family() Return the fingerprint family for this fingerprint type :returns: a :class:`FingerprintFamily` Open Babel fingerprints ----------------------- Open Babel implements four fingerprints families and chemfp implements two fingerprint families using the Open Babel toolkit. These are: * OpenBabel-FP2 - Indexes linear fragments up to 7 atoms. * OpenBabel-FP3 - SMARTS patterns specified in the file patterns.txt * OpenBabel-FP4 - SMARTS patterns specified in the file SMARTS_InteLigand.txt * OpenBabel-MACCS - SMARTS patterns specified in the file MACCS.txt, which implements nearly all of the 166 MACCS keys * RDMACCS-OpenBabel - a chemfp implementation of nearly all of the MACCS keys * ChemFP-Substruct-OpenBabel - an experimental chemfp implementation of the PubChem keys Most people use FP2 and MACCS. Note: chemfp-2.0 implements both RDMACCS-OpenBabel/1 and RDMACCS-OpenBabel/2. Version 1 did not have a definition for key 44. .. py:module:: chemfp.openbabel_types OpenBabelFP2FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelFP2FingerprintType_v1 OpenBabel FP2 fingerprint based on path enumeration See http://openbabel.org/wiki/FP2 This is a Daylight-like path enumeration fingerprint with 1021 bits. The OpenBabel-FP2/1 :class:`.FingerprintType` has no parameters. OpenBabelFP3FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelFP3FingerprintType_v1 OpenBabel FP3 fingerprint See http://openbabel.org/wiki/FP3 55 bit fingerprints based on a set of SMARTS patterns defining functional groups. The OpenBabel-FP3/1 :class:`.FingerprintType` has no parameters. OpenBabelFP4FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelFP4FingerprintType_v1 OpenBabel FP4 fingerprint http://openbabel.org/wiki/FP4 307 bit fingerprints based on a set of SMARTS patterns defining functional groups. The OpenBabel-FP4/1 :class:`.FingerprintType` has no parameters. OpenBabelMACCSFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelMACCSFingerprintType_v1 Open Babel's implementation of the 166 MACCS keys WARNING: This implementation contains serious bugs! All of the ring sizes are wrong. See http://openbabel.org/wiki/Tutorial:Fingerprints and https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt . The OpenBabel-MACCS/1 :class:`.FingerprintType` has no parameters. Note: this version is only available in older (pre-2012) versions of Open Babel. OpenBabelMACCSFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelMACCSFingerprintType_v2 Open Babel's implementation of the 166 MACCS keys See http://openbabel.org/wiki/Tutorial:Fingerprints and https://github.com/openbabel/openbabel/blob/master/data/MACCS.txt . Note: Open Babel added support for key 44 on 20 October 2014. This should have been version 3. However, I didn't notice until 1 May 2017 that there was no chemfp test for it. Since everyone has been using it as v2, and very few people used the older version, I won't change the version number. The OpenBabel-MACCS/2 :class:`.FingerprintType` has no parameters. OpenBabelECFP0FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelECFP0FingerprintType_v1 Open Babel's implementation of the ECFP0 fingerprint This is a circular fingerprint of diameter 0. The OpenBabel-ECFP0/1 :class:`.FingerprintType` parameter is: * nBits - the number of bits in the fingerprint (default: 4096 and must be a power of 2) OpenBabelECFP2FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelECFP2FingerprintType_v1 Open Babel's implementation of the ECFP2 fingerprint This is a circular fingerprint of diameter 2. The OpenBabel-ECFP2/1 :class:`.FingerprintType` parameter is: * nBits - the number of bits in the fingerprint (default: 4096 and must be a power of 2) OpenBabelECFP4FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelECFP4FingerprintType_v1 Open Babel's implementation of the ECFP4 fingerprint This is a circular fingerprint of diameter 4. The OpenBabel-ECFP4/1 :class:`.FingerprintType` parameter is: * nBits - the number of bits in the fingerprint (default: 4096 and must be a power of 2) OpenBabelECFP6FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelECFP6FingerprintType_v1 Open Babel's implementation of the ECFP6 fingerprint This is a circular fingerprint of diameter 6. The OpenBabel-ECFP6/1 :class:`.FingerprintType` parameter is: * nBits - the number of bits in the fingerprint (default: 4096 and must be a power of 2) OpenBabelECFP8FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelECFP8FingerprintType_v1 Open Babel's implementation of the ECFP8 fingerprint This is a circular fingerprint of diameter 8. The OpenBabel-ECFP8/1 :class:`.FingerprintType` parameter is: * nBits - the number of bits in the fingerprint (default: 4096 and must be a power of 2) OpenBabelECFP10FingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenBabelECFP10FingerprintType_v1 Open Babel's implementation of the ECFP10 fingerprint This is a circular fingerprint of diameter 10. The OpenBabel-ECFP10/1 :class:`.FingerprintType` parameter is: * nBits - the number of bits in the fingerprint (default: 4096 and must be a power of 2) .. py:module:: chemfp.openbabel_patterns SubstructOpenBabelFingerprinter_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: SubstructOpenBabelFingerprinter_v1 chemfp's Substruct fingerprint implementation for OEChem, version 1 WARNING: these fingerprints have not been validated. The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits. The ChemFP-Substruct-OpenBabel/1 :class:`.FingerprintType` has no parameters. RDMACCSOpenBabelFingerprinter_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDMACCSOpenBabelFingerprinter_v1 chemfp's RDMACCS fingerprint implementation for Open Babel, version 1 The RDMACSS keys are MACCS-166-like fingerprints based on RDKit's MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits. This version does not define key 44. The RDMACSS-OpenBabel/1 :class:`.FingerprintType` has no parameters. RDMACCSOpenBabelFingerprinter_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDMACCSOpenBabelFingerprinter_v2 chemfp's RDMACCS fingerprint implementation for Open Babel, version 2 The RDMACSS keys are MACCS-166-like fingerprints based on RDKit's MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits. This version defines key 44. The RDMACSS-OpenBabel/2 :class:`.FingerprintType` has no parameters. OpenEye fingerprints -------------------- OpenEye's OEGraphSim library implements four bitstring-based fingerprint families, and chemfp implements two fingerprint families based on OEChem. These are: * OpenEye-Path - exhaustive enumeration of all linear fragments up to a given size * OpenEye-Circular - exhaustive enumeration of all circular fragments grown radially from each heavy atom up to a given radius * OpenEye-Tree - exhaustive enumeration of all trees up to a given size * OpenEye-MACCS166 - an implementation of the 166 MACCS keys * RDMACCS-OpenEye - a chemfp implementation of the 166 MACCS keys * ChemFP-Substruct-OpenEye - an experimental chemfp implementation of the PubChem keys Note: chemfp-2.0 implements both RDMACCS-OpenEye/1 and RDMACCS-OpenEye/2. Version 1 did not have a definition for key 44. .. py:module:: chemfp.openeye_types OpenEyeCircularFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyeCircularFingerprintType_v2 OEGraphSim fingerprint based on circular fingerprints around heavy atoms, version 2 See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-circular The OpenEye-Circular/2 :class:`.FingerprintType` parameters are: * numbits - the number of bits in the fingerprint (default: 4096) * minradius - the minimum radius (default: 0) * maxradius - the maximum radius (default: 5) * atype - the atom type (default: "Default") * btype - the bond type (default: "Default") The atype is either 0 or a '|' separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic, The btype is either 0 or a '|' separated string containing one or more of the following: BondOrder, Chiral, InRing. OpenEyeMACCSFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyeMACCSFingerprintType_v2 OEGraphSim implementation of the 166 MACCS keys, version 2 See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#maccs . The OpenEye-MACCS166/2 :class:`.FingerprintType` has no parameters. This corresponds to GraphSim version '2.0.0'. OpenEyeMACCSFingerprintType_v3 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyeMACCSFingerprintType_v3 OEGraphSim implementation of the 166 MACCS keys, version 3 See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#maccs . The OpenEye-MACCS166/3 :class:`.FingerprintType` has no parameters. This corresponds to GraphSim version '2.2.0', with fixes for bits 91 and 92. OpenEyePathFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyePathFingerprintType_v2 OEGraphSim fingerprint based on path-based enumeration, version 2 See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-path The OpenEye-Path/2 :class:`.FingerprintType` parameters are: * numbits - the number of bits in the fingerprint (default: 4096) * minbonds - the minimum number of bonds (default: 0) * maxbonds - the maximum number of bonds (default: 5) * atype - the atom type (default: "Default") * btype - the bond type (default: "Default") The atype is either 0 or a '|' separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic, The btype is either 0 or a '|' separated string containing one or more of the following: BondOrder, Chiral, InRing. OpenEyeTreeFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyeTreeFingerprintType_v2 OEGraphSim fingerprint based on tree fingerprints, version 2 See https://docs.eyesopen.com/toolkits/cpp/graphsimtk/fingerprint.html#section-fingerprint-tree The OpenEye-Tree/2 :class:`.FingerprintType` parameters are: * numbits - the number of bits in the fingerprint (default: 4096) * minbonds - minimum number of bonds in the tree * maxbonds - maximum number of bonds in the tree * atype - the atom type (default: "Default") * btype - the bond type (default: "Default") The atype is either 0 or a '|' separated string containing one or more of the following: Aromaticity, AtomicNumber, Chiral, EqHBondAcceptor, EqHBondDonor, EqHalogen, FormalCharge, HCount, HvyDegree, Hybridization, InRing, EqAromatic, The btype is either 0 or a '|' separated string containing one or more of the following: BondOrder, Chiral, InRing. OpenEyeMoleculeScreenFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyeMoleculeScreenFingerprintType_v1 OEChem molecule screen using OESubSearchScreenType::Molecule See http://https://docs.eyesopen.com/toolkits/cpp/oechemtk/OEChemClasses/OESubSearchScreen.html This OpenEyeMoleculeScreenFingerprintType_v1 :class:`.FingerprintType` takes no parameters. Calling the fingerprinter with a QMol returns the query screen, calling with an OEMol returns a target screen. OpenEyeSMARTSScreenFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyeSMARTSScreenFingerprintType_v1 OEChem SMARTS screen using OESubSearchScreenType::SMARTS See http://https://docs.eyesopen.com/toolkits/cpp/oechemtk/OEChemClasses/OESubSearchScreen.html This OpenEyeSMARTSScreenFingerprintType_v1 :class:`.FingerprintType` takes no parameters. Calling the fingerprinter with a QMol returns the query screen, calling with an OEMol returns a target screen. OpenEyeMDLScreenFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: OpenEyeMDLScreenFingerprintType_v1 OEChem MDL screen using OESubSearchScreenType::MDL See http://https://docs.eyesopen.com/toolkits/cpp/oechemtk/OEChemClasses/OESubSearchScreen.html This OpenEyeMDLScreenFingerprintType_v1 :class:`.FingerprintType` takes no parameters. Calling the fingerprinter with a QMol returns the query screen, calling with an OEMol returns a target screen. .. py:module:: chemfp.openeye_patterns SubstructOpenEyeFingerprinter_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: SubstructOpenEyeFingerprinter_v1 chemfp's Substruct fingerprint implementation for OEChem, version 1 WARNING: these fingerprints have not been validated. The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits. The ChemFP-Substruct-OpenEye/1 :class:`.FingerprintType` has no parameters. RDMACCSOpenEyeFingerprinter_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDMACCSOpenEyeFingerprinter_v1 chemfp's RDMACCS fingerprint implementation for OEChem, version 1 The RDMACSS keys are MACCS-166-like fingerprints based on RDKit's MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits. This version does not define key 44. The RDMACSS-OpenEye/1 :class:`.FingerprintType` has no parameters. RDMACCSOpenEyeFingerprinter_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDMACCSOpenEyeFingerprinter_v2 chemfp's RDMACCS fingerprint implementation for OEChem, version 2 The RDMACSS keys are MACCS-166-like fingerprints based on RDKit's MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits. This version defines key 44. The RDMACSS-OpenEye/2 :class:`.FingerprintType` has no parameters. RDKit fingerprints ------------------ RDKit implements six fingerprint families, and chemfp implements two fingerprint families based on RDKit. These are: * RDKit-Fingerprint - exhaustive enumeration of linear and branched trees * RDKit-MACCS166 - The RDKit implementation of the MACCS keys * RDKit-Morgan - EFCP-like circular fingerprints * RDKit-AtomPair - atom pair fingerprints * RDKit-Torsion - topological-torsion fingerprints * RDKit-Pattern - substructure screen fingerprint * RDMACCS-RDKit - a chemfp implementation of the 166 MACCS keys * ChemFP-Substruct-RDKit - an experimental chemfp implementation of the PubChem keys Note: chemfp-2.0 implements both RDMACCS-RDKit/1 and RDMACCS-RDKit/2. Version 1 did not have a definition for key 44. .. py:module:: chemfp.rdkit_types RDKitFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitFingerprintType_v1 RDKit's Daylight-like fingerprint based on linear path and branched tree enumeration, version 1 See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#RDKFingerprint The RDKit-Fingerprint/1 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * minPath - minimum number of bonds (default: 1) * maxPath - maximum number of bonds (default: 7) * nBitsPerHash - number of bits to set for each path hash (default: 2) * useHs - include information about the number of hydrogens on each atom? (default: True) Note: this version is only available in older (pre-2014) versions of RDKit RDKitFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitFingerprintType_v2 RDKit's Daylight-like fingerprint based on linear path and branched tree enumeration, version 2 See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#RDKFingerprint The RDKit-Fingerprint/2 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * minPath - minimum number of bonds (default: 1) * maxPath - maximum number of bonds (default: 7) * nBitsPerHash - number of bits to set for each path hash (default: 2) * useHs - include information about the number of hydrogens on each atom? (default: True) * branchedPaths - include both branched and unbranched paths (default: True) * useBondOrder - use both bond orders in the path hashes (default: True) * fromAtoms - a comma-separated list of atom indices which must be part of the path enumeration RDKitMACCSFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitMACCSFingerprintType_v1 RDKit's implementation of the 166 MACCS keys, version 1 See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMACCSKeysFingerprint The RDKit-MACCS166/1 fingerprints have no parameters. This version of RDKit does not support MACCS key 44 ("OTHER"). RDKitMACCSFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitMACCSFingerprintType_v2 RDKit's implementation of the 166 MACCS keys, version 2 See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMACCSKeysFingerprint The RDKit-MACCS166/1 fingerprints have no parameters. RDKit version added this version in late 2014. RDKitMorganFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitMorganFingerprintType_v1 RDKit Morgan (ECFP-like) fingerprints, version 1 See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMorganFingerprintAsBitVect The RDKit-Morgan/1 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * radius - radius for the Morgan algorithm (default: 2) * useFeatures - use chemical-feature invariants (default: 0) * useChirality - use chirality information (default: 0) * useBondTypes - include bond type information (default: 1) * includeRedundantEnvironments - if set, the check for redundant atom environments will not be done (added in RDKit 2020-3) (default: 0) * fromAtoms - a comma-separated list of atom indices to use as centers RDKitAtomPairFingerprint_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitAtomPairFingerprint_v1 RDKit atom pair fingerprints, version 1" See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetHashedAtomPairFingerprintAsBitVect The RDKit-AtomPair/1 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * minLength - minimum bond count for a pair (default: 1) * maxLength - maximum bond count for a pair (default: 30) Note: this version is only available in older (pre-2012) versions of RDKit RDKitAtomPairFingerprint_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitAtomPairFingerprint_v2 RDKit atom pair fingerprints, version 2" See http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetHashedAtomPairFingerprintAsBitVect The RDKit-AtomPair/2 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * minLength - minimum bond count for a pair (default: 1 bond) * maxLength - maximum bond count for a pair (default: 30, max: 63) * nBitsPerEntry - number of bits to use in simulating counts (default: 4) * includeChirality - if set, chirality will be used in the atom invariants (default: 0) * use2D - if 1, use a 2D distance matrix, if 0 use the 3D matrix from the first set of conformers, or return an empty fingerprint if no conformers (default: 1) * fromAtoms - a comma-separated list of atom indices which must be in the pair RDKitTorsionFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitTorsionFingerprintType_v1 RDKit torsion fingerprints, version 1 See http://www.rdkit.org/Python_Docs/rdkit.Chem.AtomPairs.Torsions-module.html An implementation of Topological-torsion fingerprints, as described in: R. Nilakantan, N. Bauman, J. S. Dixon, R. Venkataraghavan; "Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors" JCICS 27, 82-85 (1987). The RDKit-Torsion/1 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * targetSize - number of bonds per torsion (default: 4) Note: this version is only available in older (pre-2014) versions of RDKit RDKitTorsionFingerprintType_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitTorsionFingerprintType_v2 RDKit torsion fingerprints, version 2 See http://www.rdkit.org/Python_Docs/rdkit.Chem.AtomPairs.Torsions-module.html An implementation of Topological-torsion fingerprints, as described in: R. Nilakantan, N. Bauman, J. S. Dixon, R. Venkataraghavan; "Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors" JCICS 27, 82-85 (1987). The RDKit-Torsion/2 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * targetSize - number of bonds per torsion (default: 4) * nBitsPerEntry - number of bits to set per entry (default: 4) * includeChirality - include chirality information (default: 0) * fromAtoms - a comma-separated list of atom indices which must be part of the torsion RDKitPatternFingerprint_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitPatternFingerprint_v1 RDKit's experimental substructure screen fingerprint, version 1 See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint The RDKit-Pattern/1 fingerprint has no parameters. RDKitPatternFingerprint_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitPatternFingerprint_v2 RDKit's experimental substructure screen fingerprint, version 2 See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint The RDKit-Pattern/2 fingerprint has no parameters. RDKitPatternFingerprint_v3 ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitPatternFingerprint_v3 RDKit's experimental substructure screen fingerprint, version 3 See http://www.rdkit.org/Python_Docs/rdkit.Chem.rdmolops-module.html#PatternFingerprint The RDKit-Pattern/3 fingerprint has no parameters. This version was released 2017.03.1. RDKitSECFPFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitSECFPFingerprintType_v1 SECFP fingerprints The SMILES Extended Connectivity Fingerprint, as described in: Probst, D., Reymond, J. A probabilistic molecular fingerprint for big data settings. J Cheminform 10, 66 (2018). https://doi.org/10.1186/s13321-018-0321-8 https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0321-8 These are circular fingerprints which encode the circular region as a fragment SMILES, which is then hashed to produce the fingerprint bits. The RDKit-SECFP/1 :class:`.FingerprintType` parameters are: * fpSize - number of bits in the fingerprint (default: 2048) * radius - analogous to the radius for the Morgan algorithm (default: 3) * rings - include ring membership (default: 1) * isomeric - use isomeric SMILES (default: 0) * kekulize - Kekulize the molecule and use Kekule SMILES (default: 1) * min_radius - minimum radius for the Morgan algorithm (default: 1) RDKitAvalonFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDKitAvalonFingerprintType_v1 Avalon fingerprints The Avalon Cheminformatics toolkit is available from https://sourceforge.net/projects/avalontoolkit/ . It is not part of the core RDKit distribution. Instead, RDKit has a compile-time option to download and include it as part of the build process. The Avalon fingerprint are described in the supplemental information for "QSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets", Peter Gedeck, Bernhard Rohde, and Christian Bartels, J. Chem. Inf. Model., 2006, 46 (5), pp 1924-1936, DOI: 10.1021/ci050413p. The supplemental information is available from http://pubs.acs.org/doi/suppl/10.1021/ci050413p It uses a set of feature classes which "have been fine-tuned to provide good screen-out for the set of substructure queries encounted at Novartis while limiting redundancy." The classes are ATOM_COUNT, ATOM_SYMBOL_PATH, AUGMENTED_ATOM, AUGMENTED_BOND, HCOUNT_PAIR, HCOUNT_PATH, RING_PATH, BOND_PATH, HCOUNT_CLASS_PATH, ATOM_CLASS_PATH, RING_PATTERN, RING_SIZE_COUNTS, DEGREE_PATHS, CLASS_SPIDERS, FEATURE_PAIRS and ALL_PATTERNS. .. py:module:: chemfp.rdkit_patterns SubstructRDKitFingerprintType_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: SubstructRDKitFingerprintType_v1 chemfp's Substruct fingerprint implementation for RDKit, version 1 WARNING: these fingerprints have not been validated. The Substruct fingerprints are CACTVS/PubChem-like fingerprints designed for use across multiple toolkits. The ChemFP-Substruct-RDKit/1 :class:`.FingerprintType` has no parameters. RDMACCSRDKitFingerprinter_v1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDMACCSRDKitFingerprinter_v1 chemfp's RDMACCS fingerprint implementation for RDKit, version 1 The RDMACSS keys are MACCS-166-like fingerprints based on RDKit's MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits. This version does not define key 44. The RDMACSS-RDKit/1 :class:`.FingerprintType` has no parameters. RDMACCSRDKitFingerprinter_v2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. py:class:: RDMACCSRDKitFingerprinter_v2 chemfp's RDMACCS fingerprint implementation for RDKit, version 2 The RDMACSS keys are MACCS-166-like fingerprints based on RDKit's MACCS116 definition, but designed to be (slightly) more portable across multiple chemistry toolkits. This version defines key 44. The RDMACSS-RDKit/2 :class:`.FingerprintType` has no parameters. chemfp.arena module =================== There should be no reason for you to import this module yourself. It contains the :class:`.FingerprintArena` implementation. FingerprintArena instances are returned as part of the public API but should not be constructed directly. Instead, use :func:`chemfp.load_fingerprints` to create an arena. .. py:module:: chemfp.arena FingerprintArena ---------------- .. py:class:: FingerprintArena Store fingerprints in a contiguous block of memory for fast searches A fingerprint arena implements the :class:`chemfp.FingerprintReader` API. A fingerprint arena stores all of the fingerprints in a continuous block of memory, so the per-molecule overhead is very low. The fingerprints can be sorted by popcount, so the fingerprints with no bits set come first, followed by those with 1 bit, etc. If ``self.popcount_indices`` is a non-empty string then the string contains information about the start and end offsets for all the fingerprints with a given popcount. This information is used for the sublinear search methods. The public attributes are: .. py:attribute:: metadata :class:`chemfp.Metadata` about the fingerprints .. py:attribute:: ids list of identifiers, in index order .. py:attribute:: fingerprints *Added in version 3.3.* a :class:`.FingerprintList` list-like view of the fingerprints, in index order Other attributes, which might be subject to change, and which I won't fully explain, are: * arena - a contiguous block of memory, which contains the fingerprints * start_padding - number of bytes to the first fingerprint in the block * end_padding - number of bytes after the last fingerprint in the block * storage_size - number of bytes used to store a fingerprint * num_bytes - number of bytes in each fingerprint (must be <= storage_size) * num_bits - number of bits in each fingerprint * alignment - the fingerprint alignment * start - the index for the first fingerprint in the arena/subarena * end - the index for the last fingerprint in the arena/subarena * arena_ids - all of the identifiers for the parent arena The FingerprintArena is its own context manager, but it does nothing on context exit. The derived FPBFingerprintArena may use a memory-mapped FPB file, which will be closed by the context manager or by an explicit call to close(). .. py:method:: __len__() Number of fingerprint records in the FingerprintArena .. py:method:: __getitem__(i) Return the (id, fingerprint) pair at index i .. py:method:: __iter__() Iterate over the (id, fingerprint) contents of the arena .. py:method:: get_fingerprint_type() Get the fingerprint type object based on the metadata's type field This uses ``self.metadata.type`` to get the fingerprint type string then calls :func:`chemfp.get_fingerprint_type` to get and return a :class:`chemfp.types.FingerprintType` instance. This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn't available. :returns: a :class:`chemfp.types.FingerprintType` .. py:method:: get_fingerprint(i) Return the fingerprint at index *i* Raises an IndexError if index *i* is out of range. .. py:method:: get_by_id(id) Given the record identifier, return the (id, fingerprint) pair, If the *id* is not present then return None. .. py:method:: get_index_by_id(id) Given the record identifier, return the record index If the *id* is not present then return None. .. py:method:: get_fingerprint_by_id(id) Given the record identifier, return its fingerprint If the *id* is not present then return None .. py:method:: save(destination, format=None, level=None) Save the fingerprints to a given destination and format The output format is based on the *format*. If the format is None then the format depends on the *destination* file extension. If the extension isn't recognized then the fingerprints will be saved in "fps" format. If the output format is "fps", "fps.gz", or "fps.zst" then *destination* may be a filename, a file object, or None; None writes to stdout. If the output format is "fpb" then *destination* must be a filename or seekable file object. Chemfp cannot save to compressed FPB files. :param destination: the output destination :type destination: a filename, file object, or None :param format: the output format :type format: None, "fps", "fps.gz", "fps.zst", or "fpb" :param level: compression level when writing .gz or .zst files :type level: an integer, or "min", "default", or "max" for compressor-specific values :returns: None .. py:method:: iter_arenas(arena_size = 1000) Base class for all chemfp objects holding fingerprint records All FingerprintReader instances have a ``metadata`` attribute containing a Metadata and can be iteratated over to get the (id, fingerprint) for each record. .. py:method:: copy(indices=None, reorder=None) Create a new arena using either all or some of the fingerprints in this arena By default this create a new arena. The fingerprint data block and ids may be shared with the original arena, which makes this a shallow copy. If the original arena is a slice, or "sub-arena" of an arena, then the copy will allocate new space to store just the fingerprints in the slice and use its own list for the ids. The *indices* parameter, if not None, is an iterable which contains the indicies of the fingerprint records to copy. Duplicates are allowed, though discouraged. If *indices* are specified then the default *reorder* value of None, or the value True, will reorder the fingerprints for the new arena by popcount. This improves overall search performance. If *reorder* is False then the new arena will preserve the order given by the indices. If *indices* are not specified, then the default is to preserve the order type of the original arena. Use ``reorder=True`` to always reorder the fingerprints in the new arena by popcount, and ``reorder=False`` to always leave them in the current ordering. >>> import chemfp >>> arena = chemfp.load_fingerprints("pubchem_queries.fps") >>> arena.ids[1], arena.ids[5], arena.ids[10], arena.ids[18] (b'9425031', b'9425015', b'9425040', b'9425033') >>> len(arena) 19 >>> new_arena = arena.copy(indices=[1, 5, 10, 18]) >>> len(new_arena) 4 >>> new_arena.ids [b'9425031', b'9425015', b'9425040', b'9425033'] >>> new_arena = arena.copy(indices=[18, 10, 5, 1], reorder=False) >>> new_arena.ids [b'9425033', b'9425040', b'9425015', b'9425031'] :param indices: indicies of the records to copy into the new arena :type indices: iterable containing integers, or None :param reorder: describes how to order the fingerprints :type reorder: True to reorder, False to leave in input order, None for default action .. py:method:: to_numpy_array() *Added in version 3.4.* Get the fingerprint bytes in a chemfp arena as NumPy uint8 array. A chemfp arena stores fingerprints in a contiguous byte string. This function returns a 2D NumPy array which is a view of that string. The array has `len(arena)` rows and `arena.storage_size` columns. The storage size may be larger than the minimum number of bytes in the fingerprint because of zero padding used to improve performance. For example, the 166-bit MACCS keys uses 24 bytes of storage when only 21 bytes are needed, because then chemfp can use the fast POPCNT instruction when computing the Tanimoto. To remove extra padding bytes, use NumPy indexing to copy the fingerprint bytes to a new array:: arr[:,0:arena.num_bytes] The last column of this new array may contain padding bits if the number of bits in a fingerprint is not a multiple of 8. .. WARNING:: Do not attempt to access the contents of a NumPy view of a FPBFingerprintArena (the arena from an FPB file) after the FPB file has been closed as that will likely cause a segmentation fault or other severe failure. :returns: a NumPy array of type uint8 .. py:method:: to_numpy_bitarray(bitlist=None) *Added in version 3.4.* Get the fingerprint bits in a chemfp arena as NumPy uint8 array. This function returns a 2D NumPy array with len(arena) rows and one column for each bit. The default returns `arena.num_bits` columns, where column 0 is the first bit, etc. Use `bitlist` to specify the indicies of which columns to return. Negative indices are supported; -1 is the last bit, -2 is the second to last. Out of range indices raise an IndexError. :param bitlist: bit column indices to use (default: all bits) :type bitlist: iterable of integers :returns: a NumPy array of type uint8 .. py:method:: count_tanimoto_hits_fp(query_fp, threshold=0.7) Count the fingerprints which are sufficiently similar to the query fingerprint Return the number of fingerprints in the arena which are at least *threshold* similar to the query fingerprint *query_fp*. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: integer count .. py:method:: threshold_tanimoto_search_fp(query_fp, threshold=0.7) Find the fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this arena which are at least *threshold* similar to the query fingerprint *query_fp*. The hits are returned as a :class:`.SearchResult`, in arbitrary order. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResult` .. py:method:: knearest_tanimoto_search_fp(query_fp, k=3, threshold=0.7) Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this arena which are at least *threshold* similar to the query fingerprint, and of those, select the top *k* hits. The hits are returned as a :class:`.SearchResult`, sorted from highest score to lowest. :param queries: query fingerprints :type queries: a :class:`.FingerprintArena` :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResult` .. py:method:: count_tversky_hits_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0) Count the fingerprints which are sufficiently similar to the query fingerprint Return the number of fingerprints in the arena which are at least *threshold* similar to the query fingerprint *query_fp*. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: integer count .. py:method:: threshold_tversky_search_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0) Find the fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this arena which are at least *threshold* similar to the query fingerprint *query_fp*. The hits are returned as a :class:`.SearchResult`, in arbitrary order. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResult` .. py:method:: knearest_tversky_search_fp(query_fp, k=3, threshold=0.7, alpha=1.0, beta=1.0) Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this arena which are at least *threshold* similar to the query fingerprint, and of those, select the top *k* hits. The hits are returned as a :class:`.SearchResult`, sorted from highest score to lowest. :param queries: query fingerprints :type queries: a :class:`.FingerprintArena` :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResult` FingerprintList --------------- .. py:class:: FingerprintList *Added in version 3.3.* A read-only list-like view of the arena fingerprints This implements the standard Python list API, including indexing and iteration. Note: fingerprint searches like "fp in fingerprint_list" and "fingerprint_list.index(fp)" are not fast. chemfp.search module ==================== .. _chemfp_search: .. py:module:: chemfp.search The following functions and classes are in the chemfp.search module. There are three main classes of functions. The ones ending with ``*_fp`` use a query fingerprint to search a target arena. The ones ending with ``*_arena`` use a query arena to search a target arena. The ones ending with ``*_symmetric`` use arena to search itself, except that a fingerprint is not tested against itself. These functions share the same name with very similar functions in the top-level :mod:`chemfp` module. My apologies for any confusion. The top-level functions are designed to work with both arenas and iterators as the target. They give a simple search API, and automatically process in blocks, to give a balanced trade-off between performance and response time for the first results. The functions in this module only work with arena as the target. By default it searches the entire arena before returning. If you want to process portions of the arena then you need to specify the range yourself. count_tanimoto_hits_fp ---------------------- .. py:function:: count_tanimoto_hits_fp(query_fp, target_arena, threshold=0.7) Count the number of hits in *target_arena* at least *threshold* similar to the *query_fp* Example:: query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tanimoto_hits_fp(query_fp, targets, threshold=0.1)) :param query_fp: the query fingerprint :type query_fp: a byte string :param target_arena: the target arena :type target_fp: a :class:`FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: an integer count count_tanimoto_hits_arena ------------------------- .. py:function:: count_tanimoto_hits_arena(query_arena, target_arena, threshold=0.7) For each fingerprint in *query_arena*, count the number of hits in *target_arena* at least *threshold* similar to it Example:: queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_arena(queries, targets, threshold=0.1) print(counts[:10]) The result is implementation specific. You'll always be able to get its length and do an index lookup to get an integer count. Currently it's a `ctypes array of longs `_, but it could be an `array.array `_ or Python list in the future. :param query_arena: The query fingerprints. :type query_arena: a :class:`chemfp.arena.FingerprintArena` :param target_arena: The target fingerprints. :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: an array of counts count_tanimoto_hits_symmetric ----------------------------- .. py:function:: count_tanimoto_hits_symmetric(arena, threshold=0.7, batch_size=100) For each fingerprint in the *arena*, count the number of other fingerprints at least *threshold* similar to it A fingerprint never matches itself. The computation can take a long time. Python won't check check for a ``^C`` until the function finishes. This can be irritating. Instead, process only *batch_size* rows at a time before checking for a ``^C``. Note: the *batch_size* may disappear in future versions of chemfp. I can't detect any performance difference between the current value and a larger value, so it seems rather pointless to have. Let me know if it's useful to keep as a user-defined parameter. Example:: arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tanimoto_hits_symmetric(arena, threshold=0.2) print(counts[:10]) The result object is implementation specific. You'll always be able to get its length and do an index lookup to get an integer count. Currently it's a ctype array of longs, but it could be an array.array or Python list in the future. :param arena: the set of fingerprints :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param batch_size: the number of rows to process before checking for a ``^C`` :type batch_size: integer :returns: an array of counts partial_count_tanimoto_hits_symmetric ------------------------------------- .. py:function:: partial_count_tanimoto_hits_symmetric(counts, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None) Compute a portion of the symmetric Tanimoto counts For most cases, use :func:`chemfp.search.count_tanimoto_hits_symmetric` instead of this function! This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1. *counts* is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls. The function adds counts for counts[*query_start*:*query_end*] based on computing the upper-triangle portion contained in the rectangle *query_start*:*query_end* and *target_start*:target_end* and using symmetry to fill in the lower half. You know, this is pretty complicated. Here's the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:: import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_count_tanimoto_hits_symmetric, counts, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) print(counts) :param counts: the accumulated Tanimoto counts :type counts: a contiguous block of integer :param arena: the fingerprints. :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param query_start: the query start row :type query_start: an integer :param query_end: the query end row :type query_end: an integer, or None to mean the last query row :param target_start: the target start row :type target_start: an integer :param target_end: the target end row :type target_end: an integer, or None to mean the last target row :returns: None count_tversky_hits_fp --------------------- .. py:function:: count_tversky_hits_fp(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0) Count the number of hits in *target_arena* least *threshold* similar to the *query_fp* (Tversky) Example:: query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(chemfp.search.count_tversky_hits_fp(query_fp, targets, threshold=0.1)) :param query_fp: the query fingerprint :type query_fp: a byte string :param target_arena: the target arena :type target_fp: a :class:`FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: an integer count count_tversky_hits_arena ------------------------ .. py:function:: count_tversky_hits_arena(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0) For each fingerprint in *query_arena*, count the number of hits in *target_arena* at least *threshold* similar to it Example:: queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_arena(queries, targets, threshold=0.1, alpha=0.5, beta=0.5) print(counts[:10]) The result is implementation specific. You'll always be able to get its length and do an index lookup to get an integer count. Currently it's a `ctypes array of longs `_, but it could be an `array.array `_ or Python list in the future. :param query_arena: The query fingerprints. :type query_arena: a :class:`chemfp.arena.FingerprintArena` :param target_arena: The target fingerprints. :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: an array of counts count_tversky_hits_symmetric ---------------------------- .. py:function:: count_tversky_hits_symmetric(arena, threshold=0.7, alpha=1.0, beta=1.0, batch_size=100) For each fingerprint in the *arena*, count the number of other fingerprints at least *threshold* similar to it A fingerprint never matches itself. The computation can take a long time. Python won't check check for a ``^C`` until the function finishes. This can be irritating. Instead, process only *batch_size* rows at a time before checking for a ``^C``. Note: the *batch_size* may disappear in future versions of chemfp. I can't detect any performance difference between the current value and a larger value, so it seems rather pointless to have. Let me know if it's useful to keep as a user-defined parameter. Example:: arena = chemfp.load_fingerprints("targets.fps") counts = chemfp.search.count_tversky_hits_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) print(counts[:10]) The result object is implementation specific. You'll always be able to get its length and do an index lookup to get an integer count. Currently it's a ctype array of longs, but it could be an array.array or Python list in the future. :param arena: the set of fingerprints :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param batch_size: the number of rows to process before checking for a ``^C`` :type batch_size: integer :returns: an array of counts partial_count_tversky_hits_symmetric ------------------------------------ .. py:function:: partial_count_tversky_hits_symmetric( counts, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None) Compute a portion of the symmetric Tversky counts For most cases, use :func:`chemfp.search.count_tversky_hits_symmetric` instead of this function! This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1. *counts* is a contiguous array of integers. It should be initialized to zeros, and reused for successive calls. The function adds counts for counts[*query_start*:*query_end*] based on computing the upper-triangle portion contained in the rectangle *query_start*:*query_end* and *target_start*:target_end* and using symmetry to fill in the lower half. You know, this is pretty complicated. Here's the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:: import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) # Globally disable OpenMP arena = chemfp.load_fingerprints("targets.fps") # Load the fingerprints n = len(arena) counts = array.array("i", [0]*n) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_count_tversky_hits_symmetric, counts, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) print(counts) :param counts: the accumulated Tversky counts :type counts: a contiguous block of integer :param arena: the fingerprints. :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param query_start: the query start row :type query_start: an integer :param query_end: the query end row :type query_end: an integer, or None to mean the last query row :param target_start: the target start row :type target_start: an integer :param target_end: the target end row :type target_end: an integer, or None to mean the last target row :returns: None threshold_tanimoto_search_fp ---------------------------- .. py:function:: threshold_tanimoto_search_fp(query_fp, target_arena, threshold=0.7) Search for fingerprint hits in *target_arena* which are at least *threshold* similar to *query_fp* The hits in the returned :class:`chemfp.search.SearchResult` are in arbitrary order. Example:: query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tanimoto_search_fp(query_fp, targets, threshold=0.15))) :param query_fp: the query fingerprint :type query_fp: a byte string :param target_arena: the target arena :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResult` threshold_tanimoto_search_arena ------------------------------- .. py:function:: threshold_tanimoto_search_arena(query_arena, target_arena, threshold=0.7) Search for the hits in the *target_arena* at least *threshold* similar to the fingerprints in *query_arena* The hits in the returned :class:`chemfp.search.SearchResults` are in arbitrary order. Example:: queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tanimoto_search_arena(queries, targets, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids())) :param query_arena: The query fingerprints. :type query_arena: a :class:`chemfp.arena.FingerprintArena` :param target_arena: The target fingerprints. :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResults` threshold_tanimoto_search_symmetric ----------------------------------- .. py:function:: threshold_tanimoto_search_symmetric(arena, threshold=0.7, include_lower_triangle=True, batch_size=100) Search for the hits in the *arena* at least *threshold* similar to the fingerprints in the arena When *include_lower_triangle* is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When *include_lower_triangle* is False, only compute the upper triangle. The hits in the returned :class:`chemfp.search.SearchResults` are in arbitrary order. The computation can take a long time. Python won't check check for a ``^C`` until the function finishes. This can be irritating. Instead, process only *batch_size* rows at a time before checking for a ``^C``. Note: the *batch_size* may disappear in future versions of chemfp. Let me know if it really is useful for you to have as a user-defined parameter. Example:: arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tanimoto_search_symmetric(arena, threshold=0.2) upper_triangle = chemfp.search.threshold_tanimoto_search_symmetric( arena, threshold=0.2, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2 :param arena: the set of fingerprints :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param include_lower_triangle: if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix :type include_lower_triangle: boolean :param batch_size: the number of rows to process before checking for a ^C :type batch_size: integer :returns: a :class:`chemfp.search.SearchResults` partial_threshold_tanimoto_search_symmetric ------------------------------------------- .. py:function:: partial_threshold_tanimoto_search_symmetric(results, arena, threshold=0.7, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0) Compute a portion of the symmetric Tanimoto search results For most cases, use :func:`chemfp.search.threshold_tanimoto_search_symmetric` instead of this function! This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1. *results* is a :class:`chemfp.search.SearchResults` instance which is at least as large as the arena. It should be reused for successive updates. The function adds hits to results[*query_start*:*query_end*], based on computing the upper-triangle portion contained in the rectangle *query_start*:*query_end* and *target_start*:*target_end*. It does not fill in the lower triangle. To get the full matrix, call *fill_lower_triangle*. You know, this is pretty complicated. Here's the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:: import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = chemfp.search.SearchResults(n, n, arena.ids) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_threshold_tanimoto_search_symmetric, results, arena, threshold=0.2, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results) The hits in the :class:`chemfp.search.SearchResults` are in arbitrary order. :param results: the intermediate search results :type results: a :class:`chemfp.search.SearchResults` instance :param arena: the fingerprints. :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param query_start: the query start row :type query_start: an integer :param query_end: the query end row :type query_end: an integer, or None to mean the last query row :param target_start: the target start row :type target_start: an integer :param target_end: the target end row :type target_end: an integer, or None to mean the last target row :param results_offset: use results[results_offset] as the base for the results :param results_offset: an integer :returns: None fill_lower_triangle ------------------- .. py:function:: fill_lower_triangle(results) Duplicate each entry of *results* to its transpose This is used after the symmetric threshold search to turn the upper-triangle results into a full matrix. :param results: search results :type results: a :class:`chemfp.search.SearchResults` threshold_tversky_search_fp --------------------------- .. py:function:: threshold_tversky_search_fp(query_fp, target_arena, threshold=0.7, alpha=1.0, beta=1.0) Search for fingerprint hits in *target_arena* which are at least *threshold* similar to *query_fp* The hits in the returned :class:`chemfp.search.SearchResult` are in arbitrary order. Example:: query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.threshold_tversky_search_fp( query_fp, targets, threshold=0.15, alpha=0.5, beta=0.5))) :param query_fp: the query fingerprint :type query_fp: a byte string :param target_arena: the target arena :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResult` threshold_tversky_search_arena ------------------------------ .. py:function:: threshold_tversky_search_arena(query_arena, target_arena, threshold=0.7, alpha=1.0, beta=1.0) Search for the hits in the *target_arena* at least *threshold* similar to the fingerprints in *query_arena* The hits in the returned :class:`chemfp.search.SearchResults` are in arbitrary order. Example:: queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.threshold_tversky_search_arena( queries, targets, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) > 0: print(query_id, "->", ", ".join(query_hits.get_ids())) :param query_arena: The query fingerprints. :type query_arena: a :class:`chemfp.arena.FingerprintArena` :param target_arena: The target fingerprints. :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResults` threshold_tversky_search_symmetric ---------------------------------- .. py:function:: threshold_tversky_search_symmetric(arena, threshold=0.7, alpha=1.0, beta=1.0, include_lower_triangle=True, batch_size=100) Search for the hits in the *arena* at least *threshold* similar to the fingerprints in the arena When *include_lower_triangle* is True, compute the upper-triangle similarities, then copy the results to get the full set of results. When *include_lower_triangle* is False, only compute the upper triangle. The hits in the returned :class:`chemfp.search.SearchResults` are in arbitrary order. The computation can take a long time. Python won't check check for a ``^C`` until the function finishes. This can be irritating. Instead, process only *batch_size* rows at a time before checking for a ``^C`` Note: the *batch_size* may disappear in future versions of chemfp. Let me know if it really is useful for you to have as a user-defined parameter. Example:: arena = chemfp.load_fingerprints("queries.fps") full_result = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5) upper_triangle = chemfp.search.threshold_tversky_search_symmetric( arena, threshold=0.2, alpha=0.5, beta=0.5, include_lower_triangle=False) assert sum(map(len, full_result)) == sum(map(len, upper_triangle))*2 :param arena: the set of fingerprints :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param include_lower_triangle: if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix :type include_lower_triangle: boolean :param batch_size: the number of rows to process before checking for a ^C :type batch_size: integer :returns: a :class:`chemfp.search.SearchResults` partial_threshold_tversky_search_symmetric ------------------------------------------ .. py:function:: partial_threshold_tversky_search_symmetric( results, arena, threshold=0.7, alpha=1.0, beta=1.0, query_start=0, query_end=None, target_start=0, target_end=None, results_offset=0) Compute a portion of the symmetric Tversky search results For most cases, use :func:`chemfp.search.threshold_tversky_search_symmetric` instead of this function! This function is only useful for thread-pool implementations. In that case, set the number of OpenMP threads to 1. *results* is a :class:`chemfp.search.SearchResults` instance which is at least as large as the arena. It should be reused for successive updates. The function adds hits to results[*query_start*:*query_end*], based on computing the upper-triangle portion contained in the rectangle *query_start*:*query_end* and *target_start*:*target_end*. It does not fill in the lower triangle. To get the full matrix, call *fill_lower_triangle*. You know, this is pretty complicated. Here's the bare minimum example of how to use it correctly to process 10 rows at a time using up to 4 threads:: import chemfp import chemfp.search from chemfp import futures import array chemfp.set_num_threads(1) arena = chemfp.load_fingerprints("targets.fps") n = len(arena) results = chemfp.search.SearchResults(n, n, arena.ids) with futures.ThreadPoolExecutor(max_workers=4) as executor: for row in xrange(0, n, 10): executor.submit(chemfp.search.partial_threshold_tversky_search_symmetric, results, arena, threshold=0.2, alpha=0.5, beta=0.5, query_start=row, query_end=min(row+10, n)) chemfp.search.fill_lower_triangle(results) The hits in the :class:`chemfp.search.SearchResults` are in arbitrary order. :param counts: the intermediate search results :type counts: a SearchResults instance :param arena: the fingerprints. :type arena: a :class:`chemfp.arena.FingerprintArena` :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param query_start: the query start row :type query_start: an integer :param query_end: the query end row :type query_end: an integer, or None to mean the last query row :param target_start: the target start row :type target_start: an integer :param target_end: the target end row :type target_end: an integer, or None to mean the last target row :param results_offset: use results[results_offset] as the base for the results :param results_offset: an integer :returns: None knearest_tanimoto_search_fp --------------------------- .. py:function:: knearest_tanimoto_search_fp(query_fp, target_arena, k=3, threshold=0.7) Search for *k*-nearest hits in *target_arena* which are at least *threshold* similar to *query_fp* The hits in the :class:`chemfp.search.SearchResults` are ordered by decreasing similarity score. Example:: query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tanimoto_search_fp(query_fp, targets, k=3, threshold=0.0))) :param query_fp: the query fingerprint :type query_fp: a byte string :param target_arena: the target arena :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param k: the number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResult` knearest_tanimoto_search_arena ------------------------------ .. py:function:: knearest_tanimoto_search_arena(query_arena, target_arena, k=3, threshold=0.7) Search for the *k* nearest hits in the *target_arena* at least *threshold* similar to the fingerprints in *query_arena* The hits in the :class:`chemfp.search.SearchResults` are ordered by decreasing similarity score. Example:: queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tanimoto_search_arena(queries, targets, k=3, threshold=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids())) :param query_arena: The query fingerprints. :type query_arena: a :class:`chemfp.arena.FingerprintArena` :param target_arena: The target fingerprints. :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param k: the number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResults` knearest_tanimoto_search_symmetric ---------------------------------- .. py:function:: knearest_tanimoto_search_symmetric(arena, k=3, threshold=0.7, batch_size=100) Search for the *k*-nearest hits in the *arena* at least *threshold* similar to the fingerprints in the arena The hits in the :class:`SearchResults` are ordered by decreasing similarity score. The computation can take a long time. Python won't check check for a ``^C`` until the function finishes. This can be irritating. Instead, process only *batch_size* rows at a time before checking for a ``^C.`` Note: the *batch_size* may disappear in future versions of chemfp. Let me know if it really is useful for you to keep as a user-defined parameter. Example:: arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tanimoto_search_symmetric(arena, k=3, threshold=0.8) for (query_id, hits) in zip(arena.ids, results): print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in hits.get_ids_and_scores())) :param arena: the set of fingerprints :type arena: a :class:`chemfp.arena.FingerprintArena` :param k: the number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param include_lower_triangle: if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix :type include_lower_triangle: boolean :param batch_size: the number of rows to process before checking for a ^C :type batch_size: integer :returns: a :class:`chemfp.search.SearchResults` knearest_tversky_search_fp -------------------------- .. py:function:: knearest_tversky_search_fp(query_fp, target_arena, k=3, threshold=0.7, alpha=1.0, beta=1.0) Search for *k*-nearest hits in *target_arena* which are at least *threshold* similar to *query_fp* The hits in the :class:`chemfp.search.SearchResults` are ordered by decreasing similarity score. Example:: query_id, query_fp = chemfp.load_fingerprints("queries.fps")[0] targets = chemfp.load_fingerprints("targets.fps") print(list(chemfp.search.knearest_tversky_search_fp( query_fp, targets, k=3, threshold=0.0, alpha=0.5, beta=0.5))) :param query_fp: the query fingerprint :type query_fp: a byte string :param target_arena: the target arena :type target_fp: a :class:`chemfp.arena.FingerprintArena` :param k: the number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResults` knearest_tversky_search_arena ----------------------------- .. py:function:: knearest_tversky_search_arena(query_arena, target_arena, k=3, threshold=0.7, alpha=1.0, beta=1.0) Search for the *k* nearest hits in the *target_arena* at least *threshold* similar to the fingerprints in *query_arena* The hits in the :class:`chemfp.search.SearchResults` are ordered by decreasing similarity score. Example:: queries = chemfp.load_fingerprints("queries.fps") targets = chemfp.load_fingerprints("targets.fps") results = chemfp.search.knearest_tversky_search_arena( queries, targets, k=3, threshold=0.5, alpha=0.5, beta=0.5) for query_id, query_hits in zip(queries.ids, results): if len(query_hits) >= 2: print(query_id, "->", ", ".join(query_hits.get_ids())) :param query_arena: The query fingerprints. :type query_arena: a :class:`chemfp.arena.FingerprintArena` :param target_arena: The target fingerprints. :type target_arena: a :class:`chemfp.arena.FingerprintArena` :param k: the number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`chemfp.search.SearchResults` knearest_tversky_search_symmetric --------------------------------- .. py:function:: knearest_tversky_search_symmetric(arena, k=3, threshold=0.7, alpha=1.0, beta=1.0, batch_size=100) Search for the *k*-nearest hits in the *arena* at least *threshold* similar to the fingerprints in the arena The hits in the :class:`SearchResults` are ordered by decreasing similarity score. The computation can take a long time. Python won't check check for a ``^C`` until the function finishes. This can be irritating. Instead, process only *batch_size* rows at a time before checking for a ``^C.`` Note: the *batch_size* may disappear in future versions of chemfp. Let me know if it really is useful for you to keep as a user-defined parameter. Example:: arena = chemfp.load_fingerprints("queries.fps") results = chemfp.search.knearest_tversky_search_symmetric( arena, k=3, threshold=0.8, alpha=0.5, beta=0.5) for (query_id, hits) in zip(arena.ids, results): print(query_id, "->", ", ".join(("%s %.2f" % hit) for hit in hits.get_ids_and_scores())) :param arena: the set of fingerprints :type arena: a :class:`chemfp.arena.FingerprintArena` :param k: the number of nearest neighbors to find. :type k: positive integer :param threshold: The minimum score threshold. :type threshold: float between 0.0 and 1.0, inclusive :param include_lower_triangle: if False, compute only the upper triangle, otherwise use symmetry to compute the full matrix :type include_lower_triangle: boolean :param batch_size: the number of rows to process before checking for a ^C :type batch_size: integer :returns: a :class:`chemfp.search.SearchResults` contains_fp ----------- .. py:function:: contains_fp(query_fp, target_arena) Find the target fingerprints which contain the query fingerprint bits as a subset A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a :class:`chemfp.search.SearchResult` containing all of the target fingerprints in *target_arena* that contain the *query_fp*. The SearchResult scores are all 0.0. There is currently no direct way to limit the arena search range. Instead create a subarena by using Python's slice notation on the arena then search the subarena. :param query_fp: the query fingerprint :type query_fp: a byte string :param target_arena: The target fingerprints. :type target_arena: a :class:`chemfp.arena.FingerprintArena` :returns: a SearchResult instance contains_arena -------------- .. py:function:: contains_arena(query_arena, target_arena) Find the target fingerprints which contain the query fingerprints as a subset A target fingerprint contains a query fingerprint if all of the on bits of the query fingerprint are also on bits of the target fingerprint. This function returns a :class:`chemfp.search.SearchResults` where SearchResults[i] contains all of the target fingerprints in *target_arena* that contain the fingerprint for entry *query_arena* [i]. The SearchResult scores are all 0.0. There is currently no direct way to limit the arena search range, though you can create and search a subarena by using Python's slice notation. :param query_arena: the query fingerprints :type query_arena: a :class:`chemfp.arena.FingerprintArena` :param target_arena: the target fingerprints :type target_arena: a :class:`chemfp.arena.FingerprintArena` :returns: a :class:`chemfp.search.SearchResults` instance, of the same size as query_arena SearchResults ------------- .. py:class:: SearchResults Search results for a list of query fingerprints against a target arena This acts like a list of SearchResult elements, with the ability to iterate over each search results, look them up by index, and get the number of scores. In addition, there are helper methods to iterate over each hit and to get the hit indicies, scores, and identifiers directly as Python lists, sort the list contents, and more. .. py:method:: __len__() The number of rows in the SearchResults .. py:method:: __iter__() Iterate over each SearchResult hit .. py:method:: __getitem__(i) Get the *i*-th SearchResult .. py:attribute:: SearchResults.shape Read-only attribute. the tuple (number of rows, number of columns) The number of columns is the size of the target arena. .. py:method:: iter_indices() For each hit, yield the list of target indices .. py:method:: iter_ids() For each hit, yield the list of target identifiers .. py:method:: iter_scores() For each hit, yield the list of target scores .. py:method:: iter_indices_and_scores() For each hit, yield the list of (target index, score) tuples .. py:method:: iter_ids_and_scores() For each hit, yield the list of (target id, score) tuples .. py:method:: clear_all() Remove all hits from all of the search results .. py:method:: count_all(min_score=None, max_score=None, interval="[]") Count the number of hits with a score between *min_score* and *max_score* Using the default parameters this returns the number of hits in the result. The default *min_score* of None is equivalent to -infinity. The default *max_score* of None is equivalent to +infinity. The *interval* parameter describes the interval end conditions. The default of "[]" uses a closed interval, where min_score <= score <= max_score. The interval "()" uses the open interval where min_score < score < max_score. The half-open/half-closed intervals "(]" and "[)" are also supported. :param min_score: the minimum score in the range. :type min_score: a float, or None for -infinity :param max_score: the maximum score in the range. :type max_score: a float, or None for +infinity :param interval: specify if the end points are open or closed. :type interval: one of "[]", "()", "(]", "[)" :returns: an integer count .. py:method:: cumulative_score_all(min_score=None, max_score=None, interval="[]") The sum of all scores in all rows which are between *min_score* and *max_score* Using the default parameters this returns the sum of all of the scores in all of the results. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score. The default *min_score* of None is equivalent to -infinity. The default *max_score* of None is equivalent to +infinity. The *interval* parameter describes the interval end conditions. The default of "[]" uses a closed interval, where min_score <= score <= max_score. The interval "()" uses the open interval where min_score < score < max_score. The half-open/half-closed intervals "(]" and "[)" are also supported. :param min_score: the minimum score in the range. :type min_score: a float, or None for -infinity :param max_score: the maximum score in the range. :type max_score: a float, or None for +infinity :param interval: specify if the end points are open or closed. :type interval: one of "[]", "()", "(]", "[)" :returns: a floating point count .. py:method:: reorder_all(order="decreasing-score") Reorder the hits for all of the rows based on the requested *order*. The available orderings are: * increasing-score - sort by increasing score * decreasing-score - sort by decreasing score * increasing-index - sort by increasing target index * decreasing-index - sort by decreasing target index * move-closest-first - move the hit with the highest score to the first position * reverse - reverse the current ordering :param ordering: the name of the ordering to use :type ordering: string .. py:method:: to_csr(dtype=None) Return the results as a SciPy compressed sparse row matrix. The returned matrix has the same shape as the SearchResult instance and can be passed into, for example, a scikit-learn clustering algorithm. By default the scores are stored with the `dtype` is "float64". This method requires that SciPy (and NumPy) be installed. :param dtype: a NumPy numeric data type :type dtype: string or NumPy type SearchResult ------------ .. py:class:: SearchResult Search results for a query fingerprint against a target arena. The results contains a list of hits. Hits contain a target index, score, and optional target ids. The hits can be reordered based on score or index. .. py:method:: __len__() The number of hits .. py:method:: __iter__() Iterate through the pairs of (target index, score) using the current ordering .. py:method:: clear() Remove all hits from this result .. py:method:: get_indices() The list of target indices, in the current ordering. .. py:method:: get_ids() The list of target identifiers (if available), in the current ordering .. py:method:: iter_ids() Iterate over target identifiers (if available), in the current ordering .. py:method:: get_scores() The list of target scores, in the current ordering .. py:method:: get_ids_and_scores() The list of (target identifier, target score) pairs, in the current ordering Raises a TypeError if the target IDs are not available. .. py:method:: get_indices_and_scores() The list of (target index, score) pairs, in the current ordering .. py:method:: reorder(ordering="decreasing-score") Reorder the hits based on the requested ordering. The available orderings are: * increasing-score - sort by increasing score * decreasing-score - sort by decreasing score * increasing-index - sort by increasing target index * decreasing-index - sort by decreasing target index * move-closest-first - move the hit with the highest score to the first position * reverse - reverse the current ordering :param string ordering: the name of the ordering to use .. py:method:: count(min_score=None, max_score=None, interval="[]") Count the number of hits with a score between *min_score* and *max_score* Using the default parameters this returns the number of hits in the result. The default *min_score* of None is equivalent to -infinity. The default *max_score* of None is equivalent to +infinity. The *interval* parameter describes the interval end conditions. The default of "[]" uses a closed interval, where min_score <= score <= max_score. The interval "()" uses the open interval where min_score < score < max_score. The half-open/half-closed intervals "(]" and "[)" are also supported. :param min_score: the minimum score in the range. :type min_score: a float, or None for -infinity :param max_score: the maximum score in the range. :type max_score: a float, or None for +infinity :param interval: specify if the end points are open or closed. :type interval: one of "[]", "()", "(]", "[)" :returns: an integer count .. py:method:: cumulative_score(min_score=None, max_score=None, interval="[]") The sum of the scores which are between *min_score* and *max_score* Using the default parameters this returns the sum of all of the scores in the result. With a specified range this returns the sum of all of the scores in that range. The cumulative score is also known as the raw score. The default *min_score* of None is equivalent to -infinity. The default *max_score* of None is equivalent to +infinity. The *interval* parameter describes the interval end conditions. The default of "[]" uses a closed interval, where min_score <= score <= max_score. The interval "()" uses the open interval where min_score < score < max_score. The half-open/half-closed intervals "(]" and "[)" are also supported. :param min_score: the minimum score in the range. :type min_score: a float, or None for -infinity :param max_score: the maximum score in the range. :type max_score: a float, or None for +infinity :param interval: specify if the end points are open or closed. :type interval: one of "[]", "()", "(]", "[)" :returns: a floating point value .. py:method:: format_ids_and_scores_as_bytes(ids=None, precision=4) *Added in version 3.3.* Format the ids and scores as the byte string needed for simsearch output If there are no hits then the result is the empty string b"", otherwise it returns a byte string containing the tab-seperated ids and scores, in the order ids[0], scores[0], ids[1], scores[1], ... If the *ids* is not specified then the ids come from self.get_ids(). If no ids are available, a ValueError is raised. The ids must be a list of Unicode strings. The *precision* sets the number of decimal digits to use in the score output. It must be an integer value between 1 and 10, inclusive. This function is 3-4x faster than the Python equivalent, which is roughly:: ids = ids if (ids is not None) else self.get_ids() formatter = ("%s\t%." + str(precision) + "f").encode("ascii") return b"\t".join(formatter % pair for pair in zip(ids, self.get_scores())) :param ids: the identifiers to use for each hit. :type ids: a list of Unicode strings, or None to use the default :param precision: the precision to use for each score :type precision: an integer from 1 to 10, inclusive :returns: a byte string .. _chemfp.bitops: chemfp.bitops module ==================== .. py:module:: chemfp.bitops The following functions from the chemfp.bitops module provide low-level bit operations on byte and hex fingerprints. .. py:function:: byte_contains(sub_fp, super_fp) Return 1 if the on bits of *sub_fp* are also 1 bits in *super_fp*, that is, if *super_fp* contains *sub_fp*. .. py:function:: byte_contains_bit(fp, bit_index) Return True if the the given bit position is on, otherwise False .. py:function:: byte_difference(fp1, fp2) Return the absolute difference (xor) between the two byte strings, fp1 ^ fp2 .. py:function:: byte_from_bitlist(fp[, num_bits=1024]) Convert a list of bit positions into a byte fingerprint, including modulo folding .. py:function:: byte_hex_tanimoto(fp1, fp2) Compute the Tanimoto similarity between the byte fingerprint *fp1* and the hex fingerprint *fp2*. Return a float between 0.0 and 1.0, or raise a ValueError if *fp2* is not a hex fingerprint .. py:function:: byte_hex_tversky(fp1, fp2, alpha=1.0, beta=1.0) Compute the Tversky index between the byte fingerprint *fp1* and the hex fingerprint *fp2*. Return a float between 0.0 and 1.0, or raise a ValueError if *fp2* is not a hex fingerprint .. py:function:: byte_intersect(fp1, fp2) Return the intersection of the two byte strings, *fp1* & *fp2* .. py:function:: byte_intersect_popcount(fp1, fp2) Return the number of bits set in the instersection of the two byte fingerprints *fp1* and *fp2* .. py:function:: byte_popcount(fp) Return the number of bits set in the byte fingerprint *fp* .. py:function:: byte_tanimoto(fp1, fp2) Compute the Tanimoto similarity between the two byte fingerprints *fp1* and *fp2* .. py:function:: byte_to_bitlist(bitlist) Return a sorted list of the on-bit positions in the byte fingerprint .. py:function:: byte_tversky(fp1, fp2, alpha=1.0, beta=1.0) Compute the Tversky index between the two byte fingerprints *fp1* and *fp2* .. py:function:: byte_union(fp1, fp2) Return the union of the two byte strings, *fp1* | *fp2* .. py:function:: hex_contains(sub_fp, super_fp) Return 1 if the on bits of sub_fp are also on bits in super_fp, otherwise 0. Return -1 if either string is not a hex fingerprint .. py:function:: hex_contains_bit(fp, bit_index) Return True if the the given bit position is on, otherwise False. This function does not validate that the hex fingerprint is actually in hex. .. py:function:: hex_difference(fp1, fp2) Return the absolute difference (xor) between the two hex strings, *fp1* ^ *fp2*. Raises a ValueError for non-hex fingerprints. .. py:function:: hex_from_bitlist(fp[, num_bits=1024]) Convert a list of bit positions into a hex fingerprint, including modulo folding .. py:function:: hex_intersect(fp1, fp2) Return the intersection of the two hex strings, *fp1* & *fp2*. Raises a ValueError for non-hex fingerprints. .. py:function:: hex_intersect_popcount(fp1, fp2) Return the number of bits set in the intersection of the two hex fingerprints *fp1* and *fp2*, or raise a ValueError if either string is a non-hex string .. py:function:: hex_isvalid(s) Return 1 if the string *s* is a valid hex fingerprint, otherwise 0 .. py:function:: hex_popcount(fp) Return the number of bits set in a hex fingerprint *fp*, or -1 for non-hex strings .. py:function:: hex_tanimoto(fp1, fp2) Compute the Tanimoto similarity between two hex fingerprints. Return a float between 0.0 and 1.0, or raise a ValueError if either string is not a hex fingerprint .. py:function:: hex_tversky(fp1, fp2, alpha=1.0, beta=1.0) Compute the Tversky index between two hex fingerprints. Return a float between 0.0 and 1.0, or raise a ValueError if either string is not a hex fingerprint .. py:function:: hex_to_bitlist(bitlist) Return a sorted list of the on-bit positions in the hex fingerprint .. py:function:: hex_union(fp1, fp2) Return the union of the two hex strings, *fp1* | *fp2*. Raises a ValueError for non-hex fingerprints. .. py:function:: hex_encode(s) Encode the byte string or ASCII string to hex. Returns a text string. .. py:function:: hex_encode_as_bytes(s) Encode the byte string or ASCII string to hex. Returns a byte string. .. py:function:: hex_decode(s) Decode the hex-encoded value to a byte string chemfp.encodings ================ .. py:module:: chemfp.encodings Decode different fingerprint representations into chemfp form. (Currently only decoders are available. Future released may include encoders.) The chemfp fingerprints are stored as byte strings, with the bytes in least-significant bit order (bit #0 is stored in the first/left-most byte) and with the bits in most-significant bit order (bit #0 is stored in the first/right-most bit of the first byte). Other systems use different encodings. These include: - the '0 and '1' characters, as in '00111101' - hex encoding, like '3d' - base64 encoding, like 'SGVsbG8h' - CACTVS's variation of base64 encoding plus variations of different LSB and MSB orders. This module decodes most of the fingerprint encodings I have come across. The fingerprint decoders return a 2-ple of the bit length and the chemfp fingerprint. The bit length is None unless the bit length is known exactly, which currently is only the case for the binary and CACTVS fingerprints. (The hex and other encoders must round the fingerprints up to a multiple of 8 bits.) from_binary_lsb --------------- .. py:function:: from_binary_lsb(text) Convert a string like '00010101' (bit 0 here is off) into '\xa8' The encoding characters '0' and '1' are in LSB order, so bit 0 is the left-most field. The result is a 2-ple of the fingerprint length and the decoded chemfp fingerprint >>> from_binary_lsb('00010101') (8, b'\xa8') >>> from_binary_lsb('11101') (5, b'\x17') >>> from_binary_lsb('00000000000000010000000000000') (29, b'\x00\x80\x00\x00') >>> from_binary_msb --------------- .. py:function:: from_binary_msb(text) Convert a string like '10101000' (bit 0 here is off) into '\xa8' The encoding characters '0' and '1' are in MSB order, so bit 0 is the right-most field. >>> from_binary_msb(b'10101000') (8, b'\xa8') >>> from_binary_msb(b'00010101') (8, b'\x15') >>> from_binary_msb(b'00111') (5, b'\x07') >>> from_binary_msb(b'00000000000001000000000000000') (29, b'\x00\x80\x00\x00') >>> from_base64 ----------- .. py:function:: from_base64(text) Decode a base64 encoded fingerprint string The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order. >>> from_base64("SGk=") (None, b'Hi') >>> from binascii import hexlify >>> hexlify(from_base64("SGk=")[1]) b'4869' >>> from_hex -------- .. py:function:: from_hex(text) Decode a hex encoded fingerprint string The encoded fingerprint must be in chemfp form, with the bytes in LSB order and the bits in MSB order. >>> from_hex(b'10f2') (None, b'\x10\xf2') >>> Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character. from_hex_msb ------------ .. py:function:: from_hex_msb(text) Decode a hex encoded fingerprint string where the bits and bytes are in MSB order >>> from_hex_msb(b'10f2') (None, b'\xf2\x10') >>> Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character. from_hex_lsb ------------ .. py:function:: from_hex_lsb(text) Decode a hex encoded fingerprint string where the bits and bytes are in LSB order >>> from_hex_lsb(b'102f') (None, b'\x08\xf4') >>> Raises a ValueError if the hex string is not a multiple of 2 bytes long or if it contains a non-hex character. from_cactvs ----------- .. py:function:: from_cactvs(text) Decode a 881-bit CACTVS-encoded fingerprint used by PubChem >>> from_cactvs(b"AAADceB7sQAEAAAAAAAAAAAAAAAAAWAAAAAwAAAAAAAAAAABwAAAHwIYAAAADA" + ... b"rBniwygJJqAACqAyVyVACSBAAhhwIa+CC4ZtgIYCLB0/CUpAhgmADIyYcAgAAO" + ... b"AAAAAAABAAAAAAAAAAIAAAAAAAAAAA==") (881, b'\x07\xde\x8d\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\x06\x00\x00\x00\x0c\x00\x00\x00\x00\x00\x00\x00\x00\x80\x03\x00\x00\xf8@\x18\x00\x00\x000P\x83y4L\x01IV\x00\x00U\xc0\xa4N*\x00I \x00\x84\xe1@X\x1f\x04\x1df\x1b\x10\x06D\x83\xcb\x0f)%\x10\x06\x19\x00\x13\x93\xe1\x00\x01\x00p\x00\x00\x00\x00\x00\x80\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x00\x00') >>> For format details, see ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt from_daylight ------------- .. py:function:: from_daylight(text) Decode a Daylight ASCII fingerprint >>> from_daylight(b"I5Z2MLZgOKRcR...1") (None, b'PyDaylight') See the implementation for format details. from_on_bit_positions --------------------- .. py:function:: from_on_bit_positions(text, num_bits=1024, separator=" ") Decode from a list of integers describing the location of the on bits >>> from_on_bit_positions("1 4 9 63", num_bits=32) (32, b'\x12\x02\x00\x80') >>> from_on_bit_positions("1,4,9,63", num_bits=64, separator=",") (64, b'\x12\x02\x00\x00\x00\x00\x00\x80') The text contains a sequence of non-negative integer values separated by the `separator` text. Bit positions are folded modulo num_bits. This is often used to convert sparse fingerprints into a dense fingerprint. Note: if you have a list of bit position as integer values then you probably want to use :func:`chemfp.bitops.byte_from_bitlist`. .. py:module:: chemfp.fps_io chemfp.fps_io module ==================== This module is part of the private API. Do not import it directly. The function :func:`chemfp.open` returns an FPSReader if the source is an FPS file. The function :func:`chemfp.open_fingerprint_writer` returns an FPSWriter if the destination is an FPS file. FPSReader --------- .. py:class:: FPSReader FPS file reader This class implements the :class:`chemfp.FingerprintReader` API. It is also its own a context manager, which automatically closes the file when the manager exists. The public attributes are: .. py:attribute:: metadata a :class:`chemfp.Metadata` instance with information about the fingerprint type .. py:attribute:: location a :class:`chemfp.io.Location` instance with parser location and state information .. py:attribute:: closed True if the file is open, else False The FPSReader.location only tracks the "lineno" variable. .. py:method:: __iter__() Iterate through the (id, fp) pairs .. py:method:: iter_arenas(arena_size=1000) iterate through *arena_size* fingerprints at a time, as subarenas Iterate through *arena_size* fingerprints at a time, returned as :class:`chemfp.arena.FingerprintArena` instances. The arenas are in input order and not reordered by popcount. This method helps trade off between performance and memory use. Working with arenas is often faster than processing one fingerprint at a time, but if the file is very large then you might run out of memory, or get bored while waiting to process all of the fingerprint before getting the first answer. If *arena_size* is None then this makes an iterator which returns a single arena containing all of the fingerprints. :param arena_size: The number of fingerprints to put into each arena. :type arena_size: positive integer, or None :returns: an iterator of :class:`chemfp.arena.FingerprintArena` instances .. py:method:: save(destination, format=None, level=None) Save the fingerprints to a given destination and format The output format is based on the *format*. If the format is None then the format depends on the *destination* file extension. If the extension isn't recognized then the fingerprints will be saved in "fps" format. If the output format is "fps", "fps.gz", or "fps.zst" then *destination* may be a filename, a file object, or None; None writes to stdout. If the output format is "fpb" then *destination* must be a filename or seekable file object. Chemfp cannot save to compressed FPB files. :param destination: the output destination :type destination: a filename, file object, or None :param format: the output format :type format: None, "fps", "fps.gz", "fps.zst", or "fpb" :param level: compression level when writing .gz or .zst files :type level: an integer, or "min", "default", or "max" for compressor-specific values :returns: None .. py:method:: get_fingerprint_type() Get the fingerprint type object based on the metadata's type field This uses ``self.metadata.type`` to get the fingerprint type string then calls :func:`chemfp.get_fingerprint_type` to get and return a :class:`chemfp.types.FingerprintType` instance. This will raise a TypeError if there is no metadata, and a ValueError if the type field was invalid or the fingerprint type isn't available. :returns: a :class:`chemfp.types.FingerprintType` .. py:method:: close() Close the file .. py:method:: count_tanimoto_hits_fp(query_fp, threshold=0.7) Count the fingerprints which are sufficiently similar to the query fingerprint Return the number of fingerprints in the reader which are at least *threshold* similar to the query fingerprint *query_fp*. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: integer count .. py:method:: count_tanimoto_hits_arena(queries, threshold=0.7) Count the fingerprints which are sufficiently similar to each query fingerprint Returns a list containing a count for each query fingerprint in the *queries* arena. The count is the number of fingerprints in the reader which are at least *threshold* similar to the query fingerprint. The order of results is the same as the order of the queries. :param queries: query fingerprints :type queries: a :class:`.FingerprintArena` :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: list of integer counts, one for each query .. py:method:: count_tversky_hits_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0) Count the fingerprints which are sufficiently similar to the query fingerprint Return the number of fingerprints in the reader which are at least *threshold* similar to the query fingerprint *query_fp*. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :type alpha: float between 0.0 and 100.0, inclusive :type beta: float between 0.0 and 100.0, inclusive :returns: integer count .. py:method:: threshold_tanimoto_search_fp(query_fp, threshold=0.7) Find the fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this reader which are at least *threshold* similar to the query fingerprint *query_fp*. The hits are returned as a :class:`.SearchResult`, in arbitrary order. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResult` .. py:method:: threshold_tanimoto_search_arena(queries, threshold=0.7) Find the fingerprints which are sufficiently similar to each of the query fingerprints For each fingerprint in the *queries* arena, find all of the fingerprints in this arena which are at least *threshold* similar. The hits are returned as a :class:`.SearchResults`, where the hits in each :class:`.SearchResult` is in arbitrary order. :param queries: query fingerprints :type queries: a :class:`.FingerprintArena` :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResults` .. py:method:: threshold_tversky_search_fp(query_fp, threshold=0.7, alpha=1.0, beta=1.0) Find the fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this reader which are at least *threshold* similar to the query fingerprint *query_fp*. The hits are returned as a :class:`.SearchResult`, in arbitrary order. :param query_fp: query fingerprint :type query_fp: byte string :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :type alpha: float between 0.0 and 100.0, inclusive :type beta: float between 0.0 and 100.0, inclusive :returns: a :class:`.SearchResult` .. py:method:: knearest_tanimoto_search_fp(query_fp, k=3, threshold=0.7) Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this reader which are at least *threshold* similar to the query fingerprint, and of those, select the top *k* hits. The hits are returned as a :class:`.SearchResult`, sorted from highest score to lowest. :param queries: query fingerprints :type queries: a :class:`.FingerprintArena` :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResult` .. py:method:: knearest_tanimoto_search_arena(queries, k=3, threshold=0.7) Find the k-nearest fingerprints which are sufficiently similar to each of the query fingerprints For each fingerprint in the *queries* arena, find the fingerprints in this reader which are at least *threshold* similar to the query fingerprint, and of those, select the top *k* hits. The hits are returned as a :class:`.SearchResults`, where the hits in each :class:`.SearchResult` are sorted by similarity score. :param queries: query fingerprints :type queries: a :class:`.FingerprintArena` :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :returns: a :class:`.SearchResults` .. py:method:: knearest_tversky_search_fp(query_fp, k=3, threshold=0.7, alpha=1.0, beta=1.0) Find the k-nearest fingerprints which are sufficiently similar to the query fingerprint Find all of the fingerprints in this reader which are at least *threshold* similar to the query fingerprint, and of those, select the top *k* hits. The hits are returned as a :class:`.SearchResult`, sorted from highest score to lowest. :param queries: query fingerprints :type queries: a :class:`.FingerprintArena` :param threshold: minimum similarity threshold (default: 0.7) :type threshold: float between 0.0 and 1.0, inclusive :type alpha: float between 0.0 and 100.0, inclusive :type beta: float between 0.0 and 100.0, inclusive :returns: a :class:`.SearchResult` FPSWriter --------- .. py:class:: FPSWriter Write fingerprints in FPS format. This is a subclass of :class:`chemfp.FingerprintWriter`. Instances have the following attributes: * metadata - a :class:`chemfp.Metadata` instance * format - the string 'fps' * closed - False when the file is open, else True * location - a :class:`chemfp.io.Location` instance An FPSWriter is its own context manager, and will close the output file on context exit. The Location instance supports the "recno", "output_recno", and "lineno" properties. .. py:method:: write_fingerprint(id, fp) Write a single fingerprint record with the given id and fp :param string id: the record identifier :param bytes fp: the fingerprint .. py:method:: write_fingerprints(id_fp_pairs) Write a sequence of fingerprint records :param id_fp_pairs: An iterable of (id, fingerprint) pairs. .. py:method:: close() Close the writer This will set self.closed to False. chemfp.fpb_io module ==================== This module is part of the private API. Do not import directly. The function :func:`chemfp.open_fingerprint_writer` returns an OrderedFPBWriter if the destination is an FPB file and *reorder* is True, or an InputOrderFPBWriter if *reorder* is False. .. py:module:: chemfp.fpb_io OrderedFPBWriter ---------------- .. py:class:: OrderedFPBWriter Fingerprint writer for FPB files where the input fingerprint order is preserved This is a subclass of :class:`chemfp.FingerprintWriter`. Instances have the following public attributes: .. py:attribute:: metadata a :class:`chemfp.Metadata` instance .. py:attribute:: format the string 'fpb' .. py:attribute:: closed False when the file is open, else True Other attributes (like "alignment", "include_hash", "include_popc", "max_spool_size", and "tmpdir") are undocumented and subject to change in the future. Let me know if they are useful. An OrderedFPBWriter is also is own context manager, and will close the writer on context exit. .. py:method:: write_fingerprint(id, fp) Write a single fingerprint record with the given id and fp to the destination :param string id: the record identifier :param bytes fp: the fingerprint .. py:method:: write_fingerprints(id_fp_iter) Write a sequence of (id, fingerprint) pairs to the destination :param id_fp_pairs: An iterable of (id, fingerprint) pairs. .. py:method:: close() Close the output writer InputOrderFPBWriter ------------------- .. py:class:: InputOrderFPBWriter Fingerprint writer for FPB files which preserves the input fingerprint order This is a subclass of :class:`chemfp.FingerprintWriter`. Instances have the following public attributes: .. py:attribute:: metadata a :class:`chemfp.Metadata` instance .. py:attribute:: format the string 'fpb' .. py:attribute:: closed False when the file is open, else True Other attributes (like "alignment", "include_hash", "include_popc", "max_spool_size", and "tmpdir") are undocumented and subject to change in the future. Let me know if they are useful. An InputOrderFPBWriter is also is own context manager, and will close the writer on context exit. .. py:method:: write_fingerprint(id, fp) Write a single fingerprint record with the given id and fp to the destination :param string id: the record identifier :param bytes fp: the fingerprint .. py:method:: write_fingerprints(id_fp_iter) Write a sequence of (id, fingerprint) pairs to the destination :param id_fp_pairs: An iterable of (id, fingerprint) pairs. .. py:method:: close() Close the output writer This will set self.closed to False chemfp toolkit API ================== .. py:module:: chemfp.toolkit Open Babel, OEChem and RDKit have different ways to read and write molecules. The chemfp toolkit API is a common wrapper API for structure I/O. The chemfp functions work with native toolkit molecules; chemfp does not have a common molecule API. (For that, use `Cinfony `_.) While the API is the same across :mod:`.openbabel_toolkit`, :mod:`.openbabel_toolkit`, :mod:`.rdkit_toolkit`, and the :mod:`.text_toolkit`, there are some differences in how they work. For example, each of the toolkits has it own set of reader and writer arguments. The details are available in the documentation, and this chapter acts as a pointer to the specific toolkit documentation. name ---- .. py:attribute:: name The string "openbabel", "openeye", "rdkit", or "text". [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] software -------- .. py:attribute:: software A string like "OpenBabel/2.4.1", "OEChem/20170208", "RDKit/2016.09.3" or "chemfp/3.1". [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] is_licensed =========== .. py:function:: is_licensed () [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Check if the toolkit is licensed. get_formats =========== .. py:function:: get_formats (include_unavailable=False) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Return a list of structure formats. get_input_formats ================= .. py:function:: get_input_formats () [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Return a list of input structure formats. get_output_formats ================== .. py:function:: get_output_formats () [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Return a list of output structure formats. get_format ========== .. py:function:: get_format (format) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get a named format. get_input_format ================ .. py:function:: get_input_format (format) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get a named input format. get_output_format ================= .. py:function:: get_output_format (format) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get a named output format. get_input_format_from_source ============================ .. py:function:: get_input_format_from_source (source=None, format=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get an format given an input source. get_output_format_from_destination ================================== .. py:function:: get_output_format_from_destination (destination=None, format=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get an format given an output destination. read_molecules ============== .. py:function:: read_molecules (source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None") [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Read molecules from a structure file. read_molecules_from_string ========================== .. py:function:: read_molecules_from_string (content, format, id_tag=None, reader_args=None, errors="strict", location=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Read molecules from structure data stored in a string. read_ids_and_molecules ====================== .. py:function:: read_ids_and_molecules (source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Read ids and molecules from a structure file. read_ids_and_molecules_from_string ================================== .. py:function:: read_ids_and_molecules_from_string (content, format, id_tag=None, reader_args=None, errors="strict", location=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Read ids and molecules from structure data stored in a string. make_id_and_molecule_parser =========================== .. py:function:: make_id_and_molecule_parser (format, id_tag=None, reader_args=None, errors="strict") [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Make a specialized function which returns the id and molecule given a structure record. parse_molecule ============== .. py:function:: parse_molecule (content, format, id_tag=None, reader_args=None, errors="strict") [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Parse a structure record into a molecule. parse_id_and_molecule ===================== .. py:function:: parse_id_and_molecule (content, format, id_tag=None, reader_args=None, errors="strict") [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Parse a structure record into an id and molecule. create_string ============= .. py:function:: create_string (mol, format, id=None, writer_args=None, errors="strict") [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Convert a molecule into a Unicode string containg a structure record. create_bytes ============ .. py:function:: create_bytes (mol, format, id=None, writer_args=None, errors="strict") [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Convert a molecule into a byte string containing a structure record. open_molecule_writer ==================== .. py:function:: open_molecule_writer (destination=None, format=None, writer_args=None, errors="strict", location=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Create an output molecule writer, for writing to a file. open_molecule_writer_to_string ============================== .. py:function:: open_molecule_writer_to_string (format, writer_args=None, errors="strict", location=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Create an output molecule writer, for writing to a Unicode string. open_molecule_writer_to_bytes ============================= .. py:function:: open_molecule_writer_to_bytes (format, writer_args=None, errors="strict", location=None) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Create an output molecule writer, for writing to a byte string. copy_molecule ============= .. py:function:: copy_molecule (mol) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Make a copy of a toolkit molecule. add_tag ======= .. py:function:: add_tag (mol, tag, value) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Add an SD tag to the molecule. get_tag ======= .. py:function:: get_tag (mol, tag) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get an SD tag for a molecule. get_tag_pairs ============= .. py:function:: get_tag_pairs () [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get the list of tag name and tag value pairs. get_id ====== .. py:function:: get_id (mol) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Get the molecule id. set_id ====== .. py:function:: set_id (mol, id) [:ref:`openbabel_toolkit `] [:ref:`openeye_toolkit `] [:ref:`rdkit_toolkit `] [:ref:`text_toolkit `] Set the molecule id. .. py:module:: chemfp.base_toolkit chemfp.base_toolkit =================== The chemfp.base_toolkit module contains a few objects which are shared by the different toolkit. There should be no reason for you to import the module yourself. molecule I/O file metadata -------------------------- The ``metadata`` attribute of the toolkit readers and writers is a FormatMetadata instance. It contains information about the structure file. Note that this is **not** the same as the fingerprint :class:`chemfp.Metadata` instance, which contains information about the fingerprint file. FormatMetadata -------------- .. py:class:: FormatMetadata Information about the reader or writer The public attributes are: .. py:attribute:: filename the source or destination filename, the string "" for string-based I/O, or None if not known .. py:attribute:: record_format the normalized record format name. All SMILES formats are "smi", and this does not contain compression information .. py:attribute:: args the final reader_args or writer_args, after all processing, and as used by the reader and writer .. py:method:: __repr__() Return a string like 'FormatMeta(filename="cmpds.sdf.gz", record_format="sdf", args={})' Toolkit readers =============== The toolkit readers read from structure files. There are several different variations, depending on the function used to read the file. All of the readers are subclasses of :class:`chemfp.base_toolkit.BaseMoleculeReader`. ================================================================ ================================================ Function Returned reader ================================================================ ================================================ :func:`chemfp.toolkit.read_molecules` :class:`chemfp.base_toolkit.MoleculeReader` :func:`chemfp.toolkit.read_molecules_from_string` :class:`chemfp.base_toolkit.MoleculeReader` :func:`chemfp.toolkit.read_ids_and_molecules` :class:`chemfp.base_toolkit.IdAndMoleculeReader` :func:`chemfp.toolkit.read_ids_and_molecules_from_string` :class:`chemfp.base_toolkit.IdAndMoleculeReader` :func:`chemfp.text_toolkit.read_sdf_records` :class:`chemfp.base_toolkit.RecordReader` :func:`chemfp.text_toolkit.read_sdf_records_from_string` :class:`chemfp.base_toolkit.RecordReader` :func:`chemfp.text_toolkit.read_sdf_ids_and_records` :class:`chemfp.base_toolkit.IdAndRecordReader` :func:`chemfp.text_toolkit.read_sdf_ids_and_records_from_string` :class:`chemfp.base_toolkit.IdAndRecordReader` :func:`chemfp.text_toolkit.read_sdf_ids_and_values` :class:`chemfp.base_toolkit.IdAndRecordReader` :func:`chemfp.text_toolkit.read_sdf_ids_and_values_from_string` :class:`chemfp.base_toolkit.IdAndRecordReader` ================================================================ ================================================ All of the readers have the same API. The major difference is that some readers return a single object during iteration while the others (those with an "And" in the name) return a pair of objects. BaseMoleculeReader ------------------ .. py:class:: BaseMoleculeReader Base class for the toolkit readers The public attributes are: .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True Readers are iterators, so iter(reader) returns itself. next(reader) returns either a single object or a pair of objects depending on reader. Readers are also a context manager, and call self.close() during exit. .. py:method:: close() Close the reader If the reader wasn't previously closed then close it. This will set the location properties to their final values, close any files that the reader may have opened, and set ``self.closed`` to False. .. py:class:: MoleculeReader Read structures from a file and iterate over the toolkit molecules The public attributes are: .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True Note: the toolkit implementation is free to reuse a molecule instead of returning a new one each time. .. py:class:: IdAndMoleculeReader Read structures from a file and iterate over the (id, toolkit molecule) pairs The public attributes are: .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True Note: the toolkit implementation is free to reuse a molecule instead of returning a new one each time. .. py:class:: RecordReader Read and iterate over records as strings The public attributes are: .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True .. py:class:: IdAndRecordReader Read records from file and iterate over the (id, record string) pairs The public attributes are: .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True Toolkit writers =============== The :func:`chemfp.open_molecule_writer` function returns a :class:`chemfp.base_toolkit.MoleculeWriter`, and :func:`chemfp.open_molecule_writer_to_string` returns a :class:`chemfp.base_toolkit.MoleculeStringWriter`. The two classes implement the :class:`chemfp.base_toolkit.BaseMoleculeWriter` API, and MoleculeWriterToString also implements getvalue(). BaseMoleculeWriter ------------------ .. py:class:: BaseMoleculeWriter The base molecule writer API, implemented by :class:`MoleculeWriter` and :class:`MoleculeStringWriter` The public attributes are: .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True The writer is a context manager, which calls self.close() when the manager exits. .. py:method:: write_molecule(mol) Write a toolkit molecule :param mol: the molecule to write :type mol: a toolkit molecule .. py:method:: write_molecules(mols) Write a sequence of molecules :param mols: the molecules to write :type mols: a toolkit molecule iterator .. py:method:: write_id_and_molecule(id, mol) Write an identifier and toolkit molecule If id is None then the output uses the molecule's own id/title. Specifying the id may modify the molecule's id/title, depending on the format and toolkit. :param id: the identifier to use for the molecule :type id: string, or None :param mol: the molecule to write :type mol: a toolkit molecule .. py:method:: write_ids_and_molecules(ids_and_mols) Write a sequence of (id, molecule) pairs This function works well with :func:`chemfp.toolkit.read_ids_and_molecules()`, for example, to convert an SD file to SMILES file, and use an alternate *id_tag* to specify an alternative identifier. :param mols: the molecules to write :type mols: a (id string, toolkit molecule) iterator .. py:method:: close() Close the writer If the reader wasn't previously closed then close it. This will set the location properties to their final values, close any files that the writer may have opened, and set ``self.closed`` to False. .. py:class:: MoleculeWriter A BaseMoleculeWriter which writes molecules to a file. The public attributetes are: .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True The writer is a context manager, which calls self.close() when the manager exits. .. py:class:: MoleculeStringWriter A BaseMoleculeWriter which writes molecules to a string. This class implements the :class:`chemfp.base_toolkit.BaseMoleculeWriter` API. .. py:attribute:: metadata a :class:`chemfp.base_toolkit.FormatMetadata` instance .. py:attribute:: location a :class:`chemfp.io.Location` instance .. py:attribute:: closed False if the reader is open, otherwise True The writer is a context manager, which calls self.close() when the manager exits. .. py:method:: getvalue() Get the string containing all of the written record. This function can also be called after the writer is closed. :returns: a string Format ------ .. py:class:: Format Information about a toolkit format. Use :func:`chemfp.toolkit.get_format` and related functions to return a Format instance. The public properties are: .. py:attribute::toolkit_name the toolkit name; either "rdkit", "openeye", or "openbabel" .. py:attribute::name the format name, without any compression information .. py:attribute::compression the compression type: "" for uncompressed, "gz" for gzip .. py:attribute::record_format the normalized record format name. All SMILES formats are "smi", and this does not contain compression information .. py:method:: __repr__() Return a string like 'Format("openeye/sdf.gz")' .. py:attribute:: Format.prefix Read-only attribute. Return the prefix to turn an unqualified parameter into a fully qualified parameter :returns: a string like "rdkit.smi" or "openbabel.sdf" .. py:attribute:: Format.is_input_format Read-only attribute. Return True if this toolkit can read molecules in this format .. py:attribute:: Format.is_output_format Read-only attribute. Return True if this toolkit can write molecules in this format .. py:attribute:: Format.is_available Read-only attribute. Return True if this version of the toolkit understands this format For example, if your version of RDKit does not support InChI then this would return False for the "inchi" and "inchikey" formats. .. py:attribute:: Format.supports_io Read-only attribute. Return True if this format support reading or writing records This will return False for formats like "smistring" and "inchikeystring" because those are are not record-based formats. Note: I don't like this name. I may change it to ``is_record_format``. Let me know if you have ideas, or if changing the name will be a problem. .. py:method:: get_reader_args_from_text_settings(reader_settings) Process the *reader_settings* and return the *reader_args* for this format. This function exists to help convert string settings, eg, from the command-line or a configuration, into usable *reader_args*. Setting names may be fully-qualified names like "rdkit.sdf.sanitize", partially qualified names like "rdkit.*.sanitize" or "openeye.smi.delimiter", or unqualified names like "delimiter". The qualifiers act as a namespace so the settings can be specified without needing to know the actual toolkit or format. The function turns the format-appropriate qualified names into unqualified ones and converts the string values into usable Python objects. For example: >>> from chemfp import rdkit_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_reader_args_from_text_settings({"rdkit.*.sanitize": "true", "delimiter": "to-eol"}) {'delimiter': 'to-eol', 'sanitize': True} :param reader_settings: the reader settings :type reader_settings: a dictionary with string keys and values :returns: a dictionary of unqualified argument names as keys and processed Python values as values .. py:method:: get_writer_args_from_text_settings(writer_settings) Process *writer_settings* and return the *writer_args* for this format. This function exists to help convert string settings, eg, from the command-line or a configuration, into usable *writer_args*. Setting names may be fully-qualified names like "rdkit.sdf.kekulize", partially qualified names like "rdkit.*.delimiter" or "openeye.smi.delimiter", or unqualified names like "delimiter". The qualifiers act as a namespace so the settings can be specified without needing to know the actual toolkit or format. The function turns the format-appropriate qualified names into unqualified ones and converts the string values into usable Python objects. For example: >>> from chemfp import rdkit_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_writer_args_from_text_settings({"rdkit.*.kekuleSmiles": "true", "canonical": "false"}) {'kekuleSmiles': True, 'canonical': False} :param writer_settings: the writer settings :type writer_settings: a dictionary with string keys and values :returns: a dictionary of unqualified argument names as keys and processed Python values as values .. py:method:: get_default_reader_args() Return a dictionary of the default reader arguments The keys are unqualified (ie, without dots). >>> from chemfp import openbabel_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_default_reader_args() {'has_header': False, 'delimiter': None, 'options': None} :returns: a dictionary of string keys and Python objects for values .. py:method:: get_default_writer_args() Return a dictionary of the default writer arguments The keys are unqualified (ie, without dots). >>> from chemfp import openbabel_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_default_writer_args() {'explicit_hydrogens': False, 'isomeric': True, 'delimiter': None, 'options': None, 'canonicalization': 'default'} :returns: a dictionary of string keys and Python objects for values .. py:method:: get_unqualified_reader_args(reader_args) Convert possibly qualified reader args into unqualified reader args for this format The *reader_args* dictionary can be confusing because of the priority rules in how to resolve qualifiers, and because it can include irrelevant parameters, which are ignored. The get_unqualified_reader_args function applies the qualifier resolution algorithm and removes irrelevant parameters to return a dictionary containing the equivalent unqualified reader args dictionary for this format. >>> from chemfp import rdkit_toolkit as T >> fmt = T.get_format("smi") >>> fmt.get_unqualified_reader_args({"rdkit.*.delimiter": "tab", "smi.sanitize": False, "X": "Y"}) {'delimiter': 'tab', 'has_header': False, 'sanitize': False} >>> fmt = T.get_format("can") >>> fmt.get_unqualified_reader_args({"rdkit.*.delimiter": "tab", "smi.sanitize": False, "X": "Y"}) {'delimiter': 'tab', 'has_header': False, 'sanitize': True} :parameters reader_args: reader arguments, which can contain qualified and unqualified arguments :type reader_args: a dictionary with string keys and Python values :returns: a dictionary of reader arguments, containing only unqualified arguments appropriate for this format. .. py:method:: get_unqualified_writer_args(writer_args) Convert possibly qualified writer args into unqualified writer args for this format The *writer_args* dictionary can be confusing because of the priority rules in how to resolve qualifiers, and because it can include irrelevant parameters, which are ignored. The get_unqualified_writer_args function applies the qualifier resolution algorithm and removes irrelevant parameters to return a dictionary containing the equivalent unqualified writer args dictionary for this format. >>> from chemfp import rdkit_toolkit as T >>> fmt = T.get_format("smi") >>> fmt.get_unqualified_writer_args({"rdkit.*.delimiter": "tab", "smi.kekuleSmiles": True, "X": "Y"}) {'isomericSmiles': True, 'delimiter': 'tab', 'kekuleSmiles': True, 'allBondsExplicit': False, 'canonical': True} >>> fmt = T.get_format("can") >>> fmt.get_unqualified_writer_args({"rdkit.*.delimiter": "tab", "smi.kekuleSmiles": True, "X": "Y"}) {'isomericSmiles': False, 'delimiter': 'tab', 'kekuleSmiles': False, 'allBondsExplicit': False, 'canonical': True} :parameters writer_args: writer arguments, which can contain qualified and unqualified arguments :type writer_args: a dictionary with string keys and Python values :returns: a dictionary of writer arguments, containing only unqualified arguments appropriate for this format. .. py:module:: chemfp.openbabel_toolkit chemfp.openbabel_toolkit module =============================== The chemfp toolkit layer for Open Babel. .. _openbabel_toolkit.name: name ---- .. py:attribute:: name The string "openbabel". .. _openbabel_toolkit.software: software -------- .. py:attribute:: software A string like "OpenBabel/2.4.1", where the second part of the string comes from OBReleaseVersion. .. _openbabel_toolkit.is_licensed: is_licensed (openbabel_toolkit) ------------------------------- .. py:function:: is_licensed() Return True - Open Babel is always licensed :returns: True .. _openbabel_toolkit.get_formats: get_formats (openbabel_toolkit) ------------------------------- .. py:function:: get_formats(include_unavailable=False) Get the list of structure formats that Open Babel supports If *include_unavailable* is True then also include Open Babel formats which aren't available to this specific version of Open Babel. :param include_unavailable: include unavailable formats? :type include_unavailable: True or False :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _openbabel_toolkit.get_input_formats: get_input_formats (openbabel_toolkit) ------------------------------------- .. py:function:: get_input_formats() Get the list of supported Open Babel input formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _openbabel_toolkit.get_output_formats: get_output_formats (openbabel_toolkit) -------------------------------------- .. py:function:: get_output_formats() Get the list of supported Open Babel output formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _openbabel_toolkit.get_format: get_format (openbabel_toolkit) ------------------------------ .. py:function:: get_format(format_name) Get the named format, or raise a ValueError This will raise a ValueError if Open Babel does not implement the format *format_name* or that format is not available. :param format_name: the format name :type format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _openbabel_toolkit.get_input_format: get_input_format (openbabel_toolkit) ------------------------------------ .. py:function:: get_input_format(format_name) Get the named input format, or raise a ValueError This will raise a ValueError if Open Babel does not implement the format *format_name* or that format is not an input format. :param format_name: the format name :type format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _openbabel_toolkit.get_output_format: get_output_format (openbabel_toolkit) ------------------------------------- .. py:function:: get_output_format(format_name) Get the named format, or raise a ValueError This will raise a ValueError if Open Babel does not implement the format *format_name* or that format is not an output format. :param format_name: the format name :type format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _openbabel_toolkit.get_input_format_from_source: get_input_format_from_source (openbabel_toolkit) ------------------------------------------------ .. py:function:: get_input_format_from_source(source=None, format=None) Get the most appropriate format given the available source and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *source* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param source: the structure data source. :type source: a filename (as a string), a file object, or None to read from stdin :param format: format information, if known. :type format: a Format(-like) object, string, or None :returns: a :class:`chemfp.base_toolkit.Format` object .. _openbabel_toolkit.get_output_format_from_destination: get_output_format_from_destination (openbabel_toolkit) ------------------------------------------------------ .. py:function:: get_output_format_from_destination(destination=None, format=None) Get the most appropriate format given the available destination and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *destination* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param destination: the structure data source. :type destination: a filename (as a string), a file object, or None to read from stdin :param format: format information, if known. :type format: a Format(-like) object, string, or None :returns: a :class:`chemfp.base_toolkit.Format` object .. _openbabel_toolkit.read_molecules: read_molecules (openbabel_toolkit) ---------------------------------- .. py:function:: read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads OBMol molecules from a structure file Iterate through the *format* structure records in *source*. If *format* is None then auto-detect the format based on the *source*. For SD files, use *id_tag* to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the *id_tag*. It exists to make it easier to switch between reader functions.) Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around. The *reader_args* dictionary parameters depend on the format. Every Open Babel format supports an "options" entry, which is passed to SetOptions(). See that documentation for details. Some formats support additional parameters: * SMILES and InChI * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * has_header - True or False * SDF * implementation - if "openbabel" or None, use the Open Babel record parser; if "chemfp", use chemfp's own record parser, which has better location tracking The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. See :func:`chemfp.openbabel_toolkit.read_ids_and_molecules` if you want (id, OBMol) pairs instead of just the molecules. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating OBMol molecules .. _openbabel_toolkit.read_molecules_from_string: read_molecules_from_string (openbabel_toolkit) ---------------------------------------------- .. py:function:: read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads OBMol molecules from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`chemfp.openbabel_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openbabel_toolkit.read_ids_and_molecules_from_string` if you want to read (id, OBMol) pairs instead of just molecules. Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating OBMol molecules .. _openbabel_toolkit.read_ids_and_molecules: read_ids_and_molecules (openbabel_toolkit) ------------------------------------------ .. py:function:: read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads (id, OBMol molecule) pairs from a structure file See :func:`chemfp.openbabel_toolkit.read_molecules` for full parameter details. The major difference is that this returns an iterator of (id, OBMol) pairs instead of just the molecules. Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndMoleculeReader` iterating (id, OBMol) pairs .. _openbabel_toolkit.read_ids_and_molecules_from_string: read_ids_and_molecules_from_string (openbabel_toolkit) ------------------------------------------------------ .. py:function:: read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads (id, OBMol) pairs from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`chemfp.openbabel_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openbabel_toolkit.read_molecules_from_string` if you just want to read the OBMol molecules instead of (id, OBMol) pairs. Note: the reader will clear and reuse the OBMol instance. Make a copy if you want to keep the molecule around. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndMoleculeReader` iterating (id, OBMol) pairs .. _openbabel_toolkit.make_id_and_molecule_parser: make_id_and_molecule_parser (openbabel_toolkit) ----------------------------------------------- .. py:function:: make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict") Create a specialized function which takes a record and returns an (id, OBMol) pair The returned function is optimized for reading many records from individual strings because it only does parameter validation once. The function will reuse the OBMol for successive calls, so make a copy if you want to keep it around. However, I haven't really noticed much of a performance difference between this and :func:`chemfp.openbabel_toolkit.parse_id_and_molecule` so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.) See :func:`chemfp.openbabel_toolkit.read_molecules` for details about the other parameters. :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a function of the form ``parser(record string) -> (id, OBMol)`` .. _openbabel_toolkit.parse_molecule: parse_molecule (openbabel_toolkit) ---------------------------------- .. py:function:: parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from the *content* string and return an OBMol molecule. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.openbabel_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openbabel_toolkit.parse_id_and_molecule` if you want the (id, OBMol) pair instead of just the molecule. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: an OBMol molecule .. _openbabel_toolkit.parse_id_and_molecule: parse_id_and_molecule (openbabel_toolkit) ----------------------------------------- .. py:function:: parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from *content* and return the (id, OBMol) pair. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.openbabel_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openbabel_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openbabel_toolkit.parse_molecule` if just want the OBMol molecule and not the the (id, OBMol) pair. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: an (id, OBMol molecule) pair .. _openbabel_toolkit.create_string: create_string (openbabel_toolkit) --------------------------------- .. py:function:: create_string(mol, format, id=None, writer_args=None, errors="strict") Convert an OBMol into a structure record in the given format as a Unicode string If *id* is not None then use it instead of the molecule's own title. Warning: this may briefly modify the molecule, so may not be thread-safe. :param mol: the molecule to use for the output :type mol: an Open Babel molecule :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a Unicode string .. _openbabel_toolkit.create_bytes: create_bytes (openbabel_toolkit) -------------------------------- .. py:function:: create_bytes(mol, format, id=None, writer_args=None, errors="strict", level=None) Convert an OBMol into a structure record in the given format as a byte string If *id* is not None then use it instead of the molecule's own title. Warning: this may briefly modify the molecule, so may not be thread-safe. :param mol: the molecule to use for the output :type mol: an Open Babel molecule :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a byte string .. _openbabel_toolkit.open_molecule_writer: open_molecule_writer (openbabel_toolkit) ---------------------------------------- .. py:function:: open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", level=None) Return a MoleculeWriter which can write Open Babel molecules to a destination. A :class:`chemfp.base_toolkit.MoleculeWriter` has the methods ``write_molecule``, ``write_molecules``, and ``write_ids_and_molecules``, which are ways to write an OBMol molecule, an OBMol molecule iterator, or an (id, OBMol molecule) pair iterator to a file. Molecules are written to *destination*. The output format can be a string like "sdf.gz" or "smi", a :class:`chemfp.base_toolkit.Format`, or Format-like object with "name" and "compression" attributes, or None to auto-detect based on the *destination*. If auto-detection is not possible, the output will be written as uncompressed SMILES. The *writer_args* dictionary parameters depend on the format. Every format supports an ``options`` entry, which is passed to Open Babel's ``SetOptions()``. See the Open Babel documentation for details. Some formats supports additional parameters: * SMILES * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * isomeric - True to write isomeric SMILES, False or default is non-isomeric * canonicalization - True, "default", or None uses Open Babel's own canonicalization algorithm; False or "none" to use no canonicalization; "universal" generates a universal SMILES; "anticanonical" generates a SMILES with randomly assigned atom classes; "inchified" uses InChI-fied SMILES * InChI and InChIKey * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * include_id - True or default to include the id as the second column; False has no id column * SDF * always_v3000 - True to always write V3000 files; False or default to write V3000 files only if needed. * include_atom_class - True to include atom class; False or default does not * include_hcount - True to include hcount; False or default does not The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. :param destination: the structure destination :type destination: a filename, file object, or None to write to stdout :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param level: compression level to use for compressed formats (does not affect Open Babel) :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeWriter` expecting Open Babel molecules .. _openbabel_toolkit.open_molecule_writer_to_string: open_molecule_writer_to_string (openbabel_toolkit) -------------------------------------------------- .. py:function:: open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None) Return a MoleculeStringWriter which can write Open Babel molecule records to a string. See :func:`chemfp.openbabel_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output as a Unicode string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting Open Babel molecules .. _openbabel_toolkit.open_molecule_writer_to_bytes: open_molecule_writer_to_bytes (openbabel_toolkit) ------------------------------------------------- .. py:function:: open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None, level=None) Return a MoleculeStringWriter which can write Open Babel molecule records to a byte string See :func:`chemfp.openbabel_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output as a byte string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param level: compression level to use for compressed formats (does not affect Open Babel) :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting Open Babel molecules .. _openbabel_toolkit.copy_molecule: copy_molecule (openbabel_toolkit) --------------------------------- .. py:function:: copy_molecule(mol) Return a new OBMol molecule which is a copy of the given Open Babel molecule :param mol: the molecule to copy :type mol: an Open Babel molecule :returns: a new OBMol instance .. _openbabel_toolkit.add_tag: add_tag (openbabel_toolkit) --------------------------- .. py:function:: add_tag(mol, tag, value) Add an SD tag value to the Open Babel molecule Raises a KeyError if the tag is a special internal Open Babel name. :param mol: the molecule :type mol: an Open Babel molecule :param tag: the SD tag name :type tag: string :param value: the text for the tag :type value: string :returns: None .. _openbabel_toolkit.get_tag: get_tag (openbabel_toolkit) --------------------------- .. py:function:: get_tag(mol, tag) Get the named SD tag value, or None if it doesn't exist :param mol: the molecule :type mol: an Open Babel molecule :param tag: the SD tag name :type tag: string :returns: a string, or None .. _openbabel_toolkit.get_tag_pairs: get_tag_pairs (openbabel_toolkit) --------------------------------- .. py:function:: get_tag_pairs(mol) Get a list of all SD tag (name, value) pairs for the molecule :param mol: the molecule :type mol: an Open Babel molecule :returns: a list of (string name, string value) pairs .. _openbabel_toolkit.get_id: get_id (openbabel_toolkit) -------------------------- .. py:function:: get_id(mol) Get the molecule's id using Open Babel's GetTitle() :param mol: the molecule :type mol: an Open Babel molecule :returns: a string .. _openbabel_toolkit.set_id: set_id (openbabel_toolkit) -------------------------- .. py:function:: set_id(mol, id) Set the molecule's id using Open Babel's SetTitle() :param mol: the molecule :type mol: an Open Babel molecule :param id: the new id :type id: string :returns: None .. py:module:: chemfp.openeye_toolkit chemfp.openeye_toolkit module ============================= The chemfp toolkit layer for OpenEye. .. _openeye_toolkit.name: name ---- .. py:attribute:: name The string "openeye". .. _openeye_toolkit.software: software -------- .. py:attribute:: software A string like "OEChem/20170208", where the second part of the string comes from OEChemGetVersion(). .. _openeye_toolkit.is_licensed: is_licensed (openeye_toolkit) ----------------------------- .. py:function:: is_licensed() Return True if the OEChem toolkit license is valid, otherwise False. This does not check if the OEGraphSim license is valid. I haven't yet figured out how I want to handle that distinction. In the meanwhile you'll need to use the OEChem API yourself. :returns: True or False .. _openeye_toolkit.get_formats: get_formats (openeye_toolkit) ----------------------------- .. py:function:: get_formats(include_unavailable=False) Get the list of structure formats that OEChem supports If *include_unavailable* is True then also include OEChem formats which aren't available to this specific version of OEChem. :param include_unavailable: include unavailable formats? :type include_unavailable: True or False :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _openeye_toolkit.get_input_formats: get_input_formats (openeye_toolkit) ----------------------------------- .. py:function:: get_input_formats() Get the list of supported OEChem input formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _openeye_toolkit.get_output_formats: get_output_formats (openeye_toolkit) ------------------------------------ .. py:function:: get_output_formats() Get the list of supported OEChem output formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _openeye_toolkit.get_format: get_format (openeye_toolkit) ---------------------------- .. py:function:: get_format(format) Get the named format, or raise a ValueError This will raise a ValueError if OEChem does not implement the format *format_name* or that format is not available. :param format_name: the format name :type format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _openeye_toolkit.get_input_format: get_input_format (openeye_toolkit) ---------------------------------- .. py:function:: get_input_format(format) Get the named input format, or raise a ValueError This will raise a ValueError if OEChem does not implement the format *format_name* or that format is not an input format. :param format_name: the format name :type format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _openeye_toolkit.get_output_format: get_output_format (openeye_toolkit) ----------------------------------- .. py:function:: get_output_format(format) Get the named format, or raise a ValueError This will raise a ValueError if OEChem does not implement the format *format_name* or that format is not an output format. :param format_name: the format name :type format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _openeye_toolkit.get_input_format_from_source: get_input_format_from_source (openeye_toolkit) ---------------------------------------------- .. py:function:: get_input_format_from_source(source=None, format=None) Get the most appropriate format given the available source and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *source* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param source: the structure data source. :type source: a filename (as a string), a file object, or None to read from stdin :param format: format information, if known. :type format: a Format(-like) object, string, or None :returns: a :class:`chemfp.base_toolkit.Format` object .. _openeye_toolkit.get_output_format_from_destination: get_output_format_from_destination (openeye_toolkit) ---------------------------------------------------- .. py:function:: get_output_format_from_destination(destination=None, format=None) Get the most appropriate format given the available destination and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *destination* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param destination: the structure data source. :type destination: a filename (as a string), a file object, or None to read from stdin :param format: format information, if known. :type format: a Format(-like) object, string, or None :returns: a :class:`chemfp.base_toolkit.Format` object .. _openeye_toolkit.read_molecules: read_molecules (openeye_toolkit) -------------------------------- .. py:function:: read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads OEGraphMol molecules from a structure file Iterate through the *format* structure records in *source*. If *format* is None then auto-detect the format based on the *source*. For SD files, use *id_tag* to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the *id_tag*. It exists to make it easier to switch between reader functions.) Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around. The *reader_args* dictionary parameters depend on the format. Every OEChem format supports: * aromaticity - one of "default", "openeye", "daylight", "tripos", "mdl", "mmff", or None * flavor - a number, string-encoded number, or flavor string A "flavor string" is a "|" or "," separated list of format-specific flavor terms. It can be a simple as "Default", or a more complex string like "Default|-ENDM|DELPHI" which for the PDB reader starts with the default settings, removes the ENDM flavor, and adds the CHARGE and RADIUS flavors. The supported input flavor terms for each format are: * SMILES - Canon, Strict, Default * sdf - Default * skc - Default * mol2, mol2h - M2H, Default * mmod - FormalCrg, Default * pdb - ALL, ALTLOC, BondOrder, CHARGE, Connect, DATA, DELPHI, END, ENDM, FORMALCHARGE, FormalCrg, ImplicitH, RADIUS, Rings, SecStruct, TER, TerMask, Default * xyz - BondOrder, Connect, FormalCrg, ImplicitH, Rings, Default * cdx - SuperAtoms, Default * oeb - Default You can also pass in a numeric value like 123 or a numeric string like "0". In addition, the SMILES record readers have limited support for the "delimiter" reader_arg: * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None Note: the first whitespace after the SMILES string will always be treated as a delimiter. The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. See :func:`chemfp.openeye_toolkit.read_ids_and_molecules` if you want (id, OEGraphMol) pairs instead of just the molecules. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating OEGraphMol molecules .. _openeye_toolkit.read_molecules_from_string: read_molecules_from_string (openeye_toolkit) -------------------------------------------- .. py:function:: read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads molecules from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`chemfp.openeye_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openeye_toolkit.read_ids_and_molecules_from_string` if you want to read (id, OEGraphMol) pairs instead of just molecules. Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating OEGraphMol molecules .. _openeye_toolkit.read_ids_and_molecules: read_ids_and_molecules (openeye_toolkit) ---------------------------------------- .. py:function:: read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads (id, OEGraphMol molecule) pairs from a structure file See :func:`chemfp.openeye_toolkit.read_molecules` for full parameter details. The major difference is that this returns an iterator of (id, OEGraphMol) pairs instead of just the molecules. Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndMoleculeReader` iterating (id, OEGraphMol) pairs .. _openeye_toolkit.read_ids_and_molecules_from_string: read_ids_and_molecules_from_string (openeye_toolkit) ---------------------------------------------------- .. py:function:: read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads (id, OEGraphMol) pairs from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`chemfp.openeye_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openeye_toolkit.read_molecules_from_string` if you just want to read the OEGraphMol molecules instead of (id, OEGraphMol) pairs. Note: the reader will clear and reuse the OEGraphMol instance. Make a copy if you want to keep the molecule around. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndMoleculeReader` iterating (id, OEGraphMol) pairs .. _openeye_toolkit.make_id_and_molecule_parser: make_id_and_molecule_parser (openeye_toolkit) --------------------------------------------- .. py:function:: make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict") Create a specialized function which takes a record and returns an (id, OEGraphMol) pair The returned function is optimized for reading many records from individual strings because it only does parameter validation once. The function will reuse the OEGraphMol for successive calls, so make a copy if you want to keep it around. However, I haven't really noticed much of a performance difference between this and :func:`chemfp.openeye_toolkit.parse_id_and_molecule` so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.) See :func:`chemfp.openeye_toolkit.read_molecules` for details about the other parameters. :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a function of the form ``parser(record string) -> (id, OEGraphMol)`` .. _openeye_toolkit.parse_molecule: parse_molecule (openeye_toolkit) -------------------------------- .. py:function:: parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from the *content* string and return an OEGraphMol molecule. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.openeye_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openeye_toolkit.parse_id_and_molecule` if you want the (id, OEGraphMol) pair instead of just the molecule. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: an OEGraphMol molecule .. _openeye_toolkit.parse_id_and_molecule: parse_id_and_molecule (openeye_toolkit) --------------------------------------- .. py:function:: parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from *content* and return the (id, OEGraphMol) pair. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.openeye_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openeye_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.openeye_toolkit.parse_molecule` if just want the OEGraphMol molecule and not the the (id, OEGraphMol) pair. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: an (id, OEGraphMol molecule) pair .. _openeye_toolkit.create_string: create_string (openeye_toolkit) ------------------------------- .. py:function:: create_string(mol, format, id=None, writer_args=None, errors="strict") Convert an OEChem molecule into a structure record in the given format as a Unicode string If *id* is not None then use it instead of the molecule's own title. Warning: this may briefly modify the molecule, so may not be thread-safe. :param mol: the molecule to use for the output :type mol: an OEChem molecule :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a string .. _openeye_toolkit.create_bytes: create_bytes (openeye_toolkit) ------------------------------ .. py:function:: create_bytes(mol, format, id=None, writer_args=None, errors="strict", level=None) Convert an OEChem molecule into a structure record in the given format as a byte string If *id* is not None then use it instead of the molecule's own title. Warning: this may briefly modify the molecule, so may not be thread-safe. :param mol: the molecule to use for the output :type mol: an OEChem molecule :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a string .. _openeye_toolkit.open_molecule_writer: open_molecule_writer (openeye_toolkit) -------------------------------------- .. py:function:: open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", level=None) Return a MoleculeWriter which can write OEChem molecules to a destination. A :class:`chemfp.base_toolkit.MoleculeWriter` has the methods ``write_molecule``, ``write_molecules``, and ``write_ids_and_molecules``, which are ways to write an OEChem molecule, an OEChem molecule iterator, or an (id, OEChem molecule) pair iterator to a file. Molecules are written to *destination*. The output format can be a string like "sdf.gz" or "smi", a :class:`chemfp.base_toolkit.Format`, or Format-like object with "name" and "compression" attributes, or None to auto-detect based on the *destination*. If auto-detection is not possible, the output will be written as uncompressed SMILES. The *writer_args* dictionary parameters depend on the format. Every OEChem format supports: * aromaticity - one of "default", "openeye", "daylight", "tripos", "mdl", "mmff", or None * flavor - a number, string-encoded number, or flavor string A "flavor string" is a "|" or "," separated list of format-specific flavor terms. It can be as simple as "Default", or a more complex string like DEFAULT|-AtomStereo|-BondStero|Canonical to generate a canonical SMILES string without stereo information. The supported output flavor terms for each format are: * SMILES - AtomMaps, AtomStereo, BondStereo, Canonical, ExtBonds, Hydrogens, ImpHCount, Isotopes, Kekule, RGroups, SuperAtoms * sdf - CurrentParity, MCHG, MDLParity, MISO, MRGP, MV30, NoParity, Default * mol2, mol2h - AtomNames, AtomTypeNames, BondTypeNames, Hydrogens, OrderAtoms, Substructure, Default * sln - Default * pdb - BONDS, BOTH, CHARGE, CurrentResidues, DELPHI, ELEMENT, FORMALCHARGE, FormalCrg, HETBONDS, NoResidues, OEResidues, ORDERS, OrderAtoms, RADIUS, TER, Default * xyz - Charges, Symbols, Default * cdx - Default * mopac - CHARGES, XYZ, Default * mf - Title, Default * oeb - Default * inchi, inchikey - Chiral, FixedHLayer, Hydrogens, ReconnectedMetals, Stereo, RelativeStereo, RacemicStereo, Default You can also pass in a numeric value like 123 or a numeric string like "0". The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. :param destination: the structure destination :type destination: a filename, file object, or None to write to stdout :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer parameters passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param level: compression level to use for compressed formats (does not affect OEChem) :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeWriter` expecting OEChem molecules .. _openeye_toolkit.open_molecule_writer_to_string: open_molecule_writer_to_string (openeye_toolkit) ------------------------------------------------ .. py:function:: open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None) Return a MoleculeStringWriter which can write OEChem molecule records to a Unicode string. See :func:`chemfp.openeye_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output string as a Unicode string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting OEChem molecules .. _openeye_toolkit.open_molecule_writer_to_bytes: open_molecule_writer_to_bytes (openeye_toolkit) ----------------------------------------------- .. py:function:: open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None, level=None) Return a MoleculeStringWriter which can write OEChem molecule records to a byte string. See :func:`chemfp.openeye_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output string as a byte string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param level: compression level to use for compressed formats (does not affect OEChem) :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting OEChem molecules .. _openeye_toolkit.copy_molecule: copy_molecule (openeye_toolkit) ------------------------------- .. py:function:: copy_molecule(mol) Return a new OEGraphMol which is a copy of the given OEChem molecule :param mol: the molecule to copy :type mol: an Open Babel molecule :returns: a new OBMol instance .. _openeye_toolkit.add_tag: add_tag (openeye_toolkit) ------------------------- .. py:function:: add_tag(mol, tag, value) Add an SD tag value to the OEChem molecule :param mol: the molecule :type mol: an OEChem molecule :param tag: the SD tag name :type tag: string :param value: the text for the tag :type value: string :returns: None .. _openeye_toolkit.get_tag: get_tag (openeye_toolkit) ------------------------- .. py:function:: get_tag(mol, tag) Get the named SD tag value, or None if it doesn't exist :param mol: the molecule :type mol: an OEChem molecule :param tag: the SD tag name :type tag: string :returns: a string, or None .. _openeye_toolkit.get_tag_pairs: get_tag_pairs (openeye_toolkit) ------------------------------- .. py:function:: get_tag_pairs(mol) Get a list of all SD tag (name, value) pairs for the molecule :param mol: the molecule :type mol: an OEChem molecule :returns: a list of (string name, string value) pairs .. _openeye_toolkit.get_id: get_id (openeye_toolkit) ------------------------ .. py:function:: get_id(mol) Get the molecule's id using OEChem's GetTitle() :param mol: the molecule :type mol: an OEChem molecule :returns: a string .. _openeye_toolkit.set_id: set_id (openeye_toolkit) ------------------------ .. py:function:: set_id(mol, id) Set the molecule's id using OEChem's SetTitle() :param mol: the molecule :type mol: an OEChem molecule :param id: the new id :type id: string :returns: None .. py:module:: chemfp.rdkit_toolkit chemfp.rdkit_toolkit module =========================== The chemfp toolkit layer for RDKit. .. _rdkit_toolkit.name: name ---- .. py:attribute:: name The string "rdkit". .. _rdkit_toolkit.software: software -------- .. py:attribute:: software A string like "RDKit/2016.09.3", where the second part of the string comes from rdkit.rdBase.rdkitVersion. .. _rdkit_toolkit.is_licensed: is_licensed (rdkit_toolkit) --------------------------- .. py:function:: is_licensed() Return True - RDKit is always licensed :returns: True .. _rdkit_toolkit.get_formats: get_formats (rdkit_toolkit) --------------------------- .. py:function:: get_formats(include_unavailable=False) Get the list of structure formats that RDKit supports If *include_unavailable* is True then also include RDKit formats which aren't available to this specific version of RDKit, such as the InChI formats if your RDKit installation wasn't compiled with InChI support. :param include_unavailable: include unavailable formats? :type include_unavailable: True or False :returns: a list of Format objects .. _rdkit_toolkit.get_input_formats: get_input_formats (rdkit_toolkit) --------------------------------- .. py:function:: get_input_formats() Get the list of supported RDKit input formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _rdkit_toolkit.get_output_formats: get_output_formats (rdkit_toolkit) ---------------------------------- .. py:function:: get_output_formats() Get the list of supported RDKit output formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _rdkit_toolkit.get_format: get_format (rdkit_toolkit) -------------------------- .. py:function:: get_format(format) Get the named format, or raise a ValueError This will raise a ValueError if RDKit does not implement the format *format_name* or that format is not available. :param format_name: the format name :type format_name: a string :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _rdkit_toolkit.get_input_format: get_input_format (rdkit_toolkit) -------------------------------- .. py:function:: get_input_format(format) Get the named input format, or raise a ValueError This will raise a ValueError if RDKit does not implement the format *format_name* or that format is not an input format. :param format_name: the format name :type format_name: a string :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _rdkit_toolkit.get_output_format: get_output_format (rdkit_toolkit) --------------------------------- .. py:function:: get_output_format(format) Get the named format, or raise a ValueError This will raise a ValueError if RDKit does not implement the format *format_name* or that format is not an output format. :param format_name: the format name :type format_name: a string :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _rdkit_toolkit.get_input_format_from_source: get_input_format_from_source (rdkit_toolkit) -------------------------------------------- .. py:function:: get_input_format_from_source(source=None, format=None) Get the most appropriate format given the available source and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *source* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param source: the structure data source. :type source: a filename (as a string), a file object, or None to read from stdin :param format: format information, if known. :type format: a Format(-like) object, string, or None :returns: a :class:`chemfp.base_toolkit.Format` object .. _rdkit_toolkit.get_output_format_from_destination: get_output_format_from_destination (rdkit_toolkit) -------------------------------------------------- .. py:function:: get_output_format_from_destination(destination=None, format=None) Get the most appropriate format given the available destination and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *destination* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param destination: The structure data source. :type destination: a filename (as a string), a file object, or None to read from stdin :param format: format information, if known. :type format: a Format(-like) object, string, or None :returns: a :class:`chemfp.base_toolkit.Format` object .. _rdkit_toolkit.read_molecules: read_molecules (rdkit_toolkit) ------------------------------ .. py:function:: read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads RDKit molecules from a structure file Iterate through the *format* structure records in *source*. If *format* is None then auto-detect the format based on the *source*. For SD files, use *id_tag* to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the *id_tag*. It exists to make it easier to switch between reader functions.) Note: the reader returns a new RDKit molecule each time. The *reader_args* dictionary parameters depend on the format. These include: * SMILES * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * has_header - True or False * sanitize - True or default sanitizes; False for unsanitized processing * InChI * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * sanitize - True or default sanitizes; False for unsanitized processing * removeHs - True or default removes explicit hydrogens; False leaves them in the structure * logLevel - an integer log level * treatWarningAsError - True raises an exception on error; False or default keeps processing * SDF * sanitize - True or default sanitizes; False for unsanitized processing * removeHs - True or default removes explicit hydrogens; False leaves them in the structure * strictParsing - True or default for strict parsing; False for lenient parsing The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. See :func:`chemfp.rdkit_toolkit.read_ids_and_molecules` if you want (id, molecule) pairs instead of just the molecules. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating RDKit molecules .. _rdkit_toolkit.read_molecules_from_string: read_molecules_from_string (rdkit_toolkit) ------------------------------------------ .. py:function:: read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads RDKit molecules from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.read_ids_and_molecules_from_string` if you want to read (id, RDKit) pairs instead of just molecules. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating RDKit molecules .. _rdkit_toolkit.read_ids_and_molecules: read_ids_and_molecules (rdkit_toolkit) -------------------------------------- .. py:function:: read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads (id, RDKit molecule) pairs from a structure file See :func:`chemfp.rdkit_toolkit.read_molecules` for full parameter details. The major difference is that this returns an iterator of (id, RDKit molecule) pairs instead of just the molecules. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndMoleculeReader` iterating (id, RDKit molecule) pairs .. _rdkit_toolkit.read_ids_and_molecules_from_string: read_ids_and_molecules_from_string (rdkit_toolkit) -------------------------------------------------- .. py:function:: read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads (id, RDKit molecule) pairs from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.read_molecules_from_string` if you just want to read the RDKit molecules instead of (id, molecule) pairs. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndMoleculeReader` iterating (id, RDKit molecule) pairs .. _rdkit_toolkit.make_id_and_molecule_parser: make_id_and_molecule_parser (rdkit_toolkit) ------------------------------------------- .. py:function:: make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict") Create a specialized function which takes a record and returns an (id, RDKit molecule) pair The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven't really noticed much of a performance difference between this and :func:`chemfp.rdkit_toolkit.parse_id_and_molecule` so you can probably so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.) See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a function of the form ``parser(record string) -> (id, RDKit molecule)`` .. _rdkit_toolkit.parse_molecule: parse_molecule (rdkit_toolkit) ------------------------------ .. py:function:: parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from the *content* string and return an RDKit molecule. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.parse_id_and_molecule` if you want the (id, RDKit molecule) pair instead of just the molecule. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: an RDKit molecule .. _rdkit_toolkit.parse_id_and_molecule: parse_id_and_molecule (rdkit_toolkit) ------------------------------------- .. py:function:: parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from *content* and return the (id, RDKit molecule) pair. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.parse_molecule` if just want the RDKit molecule and not the the (id, RDKit molecule) pair. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: an (id, RDKit molecule) pair .. _rdkit_toolkit.create_string: create_string (rdkit_toolkit) ----------------------------- .. py:function:: create_string(mol, format, id=None, writer_args=None, errors="strict") Convert an RDKit molecule into a structure record in the given format as a Unicode string If *id* is not None then use it instead of the molecule's own title. Warning: this may briefly modify the molecule, so may not be thread-safe. :param mol: the molecule to use for the output :type mol: an RDKit molecule :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a Unicode string .. _rdkit_toolkit.create_bytes: create_bytes (rdkit_toolkit) ---------------------------- .. py:function:: create_bytes(mol, format, id=None, writer_args=None, errors="strict", level=None) Convert an RDKit molecule into a structure record in the given format as a byte string If *id* is not None then use it instead of the molecule's own title. Warning: this may briefly modify the molecule, so may not be thread-safe. :param mol: the molecule to use for the output :type mol: an RDKit molecule :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a byte string .. _rdkit_toolkit.open_molecule_writer: open_molecule_writer (rdkit_toolkit) ------------------------------------ .. py:function:: open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", level=None) Return a MoleculeWriter which can write RDKit molecules to a destination. A :class:`chemfp.base_toolkit.MoleculeWriter` has the methods ``write_molecule``, ``write_molecules``, and ``write_ids_and_molecules``, which are ways to write an RDKit molecule, an RDKit molecule iterator, or an (id, RDKit molecule) pair iterator to a file. Molecules are written to *destination*. The output format can be a string like "sdf.gz" or "smi", a :class:`chemfp.base_toolkit.Format`, or Format-like object with "name" and "compression" attributes, or None to auto-detect based on the *destination*. If auto-detection is not possible, the output will be written as uncompressed SMILES. The *writer_args* dictionary parameters depend on the format. These include: * SMILES * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * isomericSmiles - True to generate isomeric SMILES * kekuleSmiles - True to generate SMILES in Kekule form * canonical - True to generate a canonical SMILES * allBondsExplicit - True to write explict '-' and ':' bonds, even if they can be inferred; default is False * allHsExplicit - True to write explicit hydrogen counts; default is False * cxsmiles - True to include CXSMILES annotations; default is False InChI and InChIKey * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * include_id - True or default to include the id as the second column; False has no id column * options - an options string passed to the underlying InChI library * logLevel - an integer log level * treatWarningAsError - True raises an exception on error; False or default keeps processing SDF * includeStereo - True include stereo information; False or default does not * kekulize - True or default creates the connection table with bonds in Kekeule form * v3k - True to alway export in V3000 format The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. :param destination: the structure destination :type destination: a filename, file object, or None to write to stdout :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer parameters passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeWriter` expecting RDKit molecules .. _rdkit_toolkit.open_molecule_writer_to_string: open_molecule_writer_to_string (rdkit_toolkit) ---------------------------------------------- .. py:function:: open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None) Return a MoleculeStringWriter which can write molecule records in the given format to a string. See :func:`chemfp.rdkit_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output as a Unicode string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting RDKit molecules .. _rdkit_toolkit.open_molecule_writer_to_bytes: open_molecule_writer_to_bytes (rdkit_toolkit) --------------------------------------------- .. py:function:: open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None, level=None) Return a MoleculeStringWriter which can write molecule records in the given format to a text string. See :func:`chemfp.rdkit_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output as a byte string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting RDKit molecules .. _rdkit_toolkit.copy_molecule: copy_molecule (rdkit_toolkit) ----------------------------- .. py:function:: copy_molecule(mol) Return a new RDKit molecule which is a copy of the given molecule :param mol: the molecule to copy :type mol: an RDKit molecule :returns: a new RDKit Mol instance .. _rdkit_toolkit.add_tag: add_tag (rdkit_toolkit) ----------------------- .. py:function:: add_tag(mol, tag, value) Add an SD tag value to the RDKit molecule :param mol: the molecule :type mol: an RDKit molecule :param tag: the SD tag name :type tag: string :param value: the text for the tag :type value: string :returns: None .. _rdkit_toolkit.get_tag: get_tag (rdkit_toolkit) ----------------------- .. py:function:: get_tag(mol, tag) Get the named SD tag value, or None if it doesn't exist :param mol: the molecule :type mol: an RDKit molecule :param tag: the SD tag name :type tag: string :returns: a string, or None .. _rdkit_toolkit.get_tag_pairs: get_tag_pairs (rdkit_toolkit) ----------------------------- .. py:function:: get_tag_pairs(mol) Get a list of all SD tag (name, value) pairs for the molecule :param mol: the molecule :type mol: an RDKit molecule :returns: a list of (string name, string value) pairs .. _rdkit_toolkit.get_id: get_id (rdkit_toolkit) ---------------------- .. py:function:: get_id(mol) Get the molecule's id from RDKit's _Name property :param mol: the molecule :type mol: an RDKit molecule :returns: a string .. _rdkit_toolkit.set_id: set_id (rdkit_toolkit) ---------------------- .. py:function:: set_id(mol, id) Set the molecule's id as RDKit's _Name property :param mol: the molecule :type mol: an RDKit molecule :param id: the new id :type id: string :returns: None .. py:module:: chemfp.text_toolkit chemfp.text_toolkit module ========================== The text_toolkit implements the chemfp toolkit API but where the "molecules" are simple TextRecord instances which store the records as text strings. It does not use a back-end chemistry toolkit, and it cannot convert between different chemistry representations. The TextRecord is a base class. The actual records depend on the format, and will be one of: * :class:`.SDFRecord` * :class:`.SmiRecord` * :class:`.CanRecord` * :class:`.UsmRecord` * :class:`.SmiStringRecord` * :class:`.CanStringRecord` * :class:`.UsmStringRecord` The text toolkit will let you "convert" between the different SMILES formats, but it doesn't actually change the SMILES string. The SMILES records have the attributes ``id``, ``record`` and ``smiles``. The toolkit also knows a bit about the SD format. The SDF records have the attributes ``id``, ``id_bytes`` and ``record``, and there are methods to get SD tag values and add a tag to the end of the tag data block. The text_toolkit also supports a few SDF-specific I/O functions to read SDF records directly as a string instead of wrapped in a TextRecord. The record types also have the attributes ``encoding`` and ``encoding_errors`` which affect how the record bytes are parsed. .. _text_toolkit.name: name ---- .. py:attribute:: name The string "text" .. _text_toolkit.software: software -------- .. py:attribute:: software A string like "chemfp/3.0". .. _text_toolkit.is_licensed: is_licensed (text_toolkit) -------------------------- .. py:function:: is_licensed() Return True - chemfp's text toolkit is always licensed :returns: True .. _text_toolkit.get_formats: get_formats (text_toolkit) -------------------------- .. py:function:: get_formats(include_unavailable=False) Get the list of structure formats that chemfp's text toolkit supports This version of chemfp will always support the structure formats available to chemfp so 'include_unavailable' does not affect anything. (It may affect other toolkits.) :param include_unavailable: include unavailable formats? :value include_unavailable: True or False :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _text_toolkit.get_input_formats: get_input_formats (text_toolkit) -------------------------------- .. py:function:: get_input_formats() Get the list of supported chemfp text toolkit input formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _text_toolkit.get_output_formats: get_output_formats (text_toolkit) --------------------------------- .. py:function:: get_output_formats() Get the list of supported chemfp text toolkit output formats :returns: a list of :class:`chemfp.base_toolkit.Format` objects .. _text_toolkit.get_format: get_format (text_toolkit) ------------------------- .. py:function:: get_format(format_name) Get the named format, or raise a ValueError This will raise a ValueError for unknown format names. :param format_name: the format name :value format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _text_toolkit.get_input_format: get_input_format (text_toolkit) ------------------------------- .. py:function:: get_input_format(format_name) Get the named input format, or raise a ValueError This will raise a ValueError for unknown format names or if that format is not an input format. :param format_name: the format name :value format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _text_toolkit.get_output_format: get_output_format (text_toolkit) -------------------------------- .. py:function:: get_output_format(format_name) Get the named format, or raise a ValueError This will raise a ValueError for unknown format names or if that format is not an output format. :param format_name: the format name :value format_name: a string :returns: a :class:`chemfp.base_toolkit.Format` object .. _text_toolkit.get_input_format_from_source: get_input_format_from_source (text_toolkit) ------------------------------------------- .. py:function:: get_input_format_from_source(source=None, format=None) Get the most appropriate format given the available source and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *source* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param source: The structure data source. :type source: A filename (as a string), a file object, or None to read from stdin :param format: Format information, if known. :type format: A Format(-like) object, string, or None :returns: a :class:`chemfp.base_toolkit.Format` object .. _text_toolkit.get_output_format_from_destination: get_output_format_from_destination (text_toolkit) ------------------------------------------------- .. py:function:: get_output_format_from_destination(destination=None, format=None) Get the most appropriate format given the available destination and format information If *format* is a :class:`chemfp.base_toolkit.Format` then return it. If it's a Format-like object with "name" and "compression" attributes use it to make a real Format object with the same attributes. If it's a string then use it to create a Format object. If *format* is None, use the *destination* to auto-detect the format. If auto-detection is not possible, assume it's an uncompressed SMILES file. :param destination: The structure data source. :type destination: A filename (as a string), a file object, or None to read from stdin :param format: format information, if known. :type format: A Format(-like) object, string, or None :returns: A :class:`chemfp.base_toolkit.Format` object .. _text_toolkit.read_molecules: read_molecules (text_toolkit) ----------------------------- .. py:function:: read_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads TextRecord instances from a structure file Iterate through the *format* structure records in *source*. If *format* is None then auto-detect the format based on the *source*. For SD files, use *id_tag* to get the record id from the given SD tag instead of the title line. (read_molecules() will ignore the *id_tag*. It exists to make it easier to switch between reader functions.) Only the SMILES formats use the *reader_args* dictionary. The supported parameters are: * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None * has_header - True or False The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. See :func:`.read_ids_and_molecules` if you want (id, :class:`.TextRecord`) pairs instead of just the molecules. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader parameters passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :param encoding: the byte encoding :type encoding: string (typically 'utf8' or 'latin1') :param encoding_errors: how to handle decoding failure :type encoding_errors: string (typically 'strict', 'ignore', or 'replace') :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating :class:`.TextRecord` molecules .. _text_toolkit.read_molecules_from_string: read_molecules_from_string (text_toolkit) ----------------------------------------- .. py:function:: read_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads TextRecord instances from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`.read_molecules` for details about the other parameters. See :func:`.read_ids_and_molecules_from_string` if you want to read (id, :class:`.TextRecord`) pairs instead of just molecules. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :param encoding: the byte encoding :type encoding: string (typically 'utf8' or 'latin1') :param encoding_errors: how to handle decoding failure :type encoding_errors: string (typically 'strict', 'ignore', or 'replace') :returns: a :class:`chemfp.base_toolkit.MoleculeReader` iterating :class:`.TextRecord` molecules .. _text_toolkit.read_ids_and_molecules: read_ids_and_molecules (text_toolkit) ------------------------------------- .. py:function:: read_ids_and_molecules(source=None, format=None, id_tag=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict") Return an iterator that reads (id, TextRecord) pairs from a structure file See :func:`chemfp.text_toolkit.read_molecules` for full parameter details. The major difference is that this returns an iterator of (id, :class:`.TextRecord`) pairs instead of just the molecules. :param source: the structure source :type source: a filename, file object, or None to read from stdin :param format: the input structure format :type format: a format name string, or Format object, or None to auto-detect :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :param encoding: the byte encoding :type encoding: string (typically 'utf8' or 'latin1') :param encoding_errors: how to handle decoding failure :type encoding_errors: string (typically 'strict', 'ignore', or 'replace') :returns: a :class:`chemfp.text_toolkit.IdAndMoleculeReader` iterating (id, :class:`.TextRecord`) pairs .. _text_toolkit.read_ids_and_molecules_from_string: read_ids_and_molecules_from_string (text_toolkit) ------------------------------------------------- .. py:function:: read_ids_and_molecules_from_string(content, format, id_tag=None, reader_args=None, errors="strict", location=None) Return an iterator that reads (id, TextRecord) pairs from a string containing structure records *content* is a string containing 0 or more records in the format *format*. See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.read_molecules_from_string` if you just want to read the :class:`.TextRecord` molecules instead of (id, TextRecord) pairs. :param content: the string containing structure records :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :param encoding: the byte encoding :type encoding: string (typically 'utf8' or 'latin1') :param encoding_errors: how to handle decoding failure :type encoding_errors: string (typically 'strict', 'ignore', or 'replace') :returns: a :class:`chemfp.base_toolkit.IdAndMoleculeReader` iterating (id, :class:`.TextRecord`) pairs .. _text_toolkit.make_id_and_molecule_parser: make_id_and_molecule_parser (text_toolkit) ------------------------------------------ .. py:function:: make_id_and_molecule_parser(format, id_tag=None, reader_args=None, errors="strict") Create a specialized function which takes a record and returns an (id, TextRecord) pair The returned function is optimized for reading many records from individual strings because it only does parameter validation once. However, I haven't really noticed much of a performance difference between this and :func:`chemfp.text_toolkit.parse_id_and_molecule` so I suggest you use that function directly instead of making a specialized function. (Let me know if making a specialized function is useful.) See :func:`chemfp.text_toolkit.read_molecules` for details about the other parameters. The specific :class:`.TextRecord` subclass returned depends on the format. :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a function of the form ``parser(record string) -> (id, text_record)`` .. _text_toolkit.parse_molecule: parse_molecule (text_toolkit) ----------------------------- .. py:function:: parse_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from the *content* string and return a TextRecord. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.text_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.text_toolkit.parse_id_and_molecule` if you want the (id, :class:`.TextRecord`) pair instead of just the text record. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param encoding: the byte encoding :type encoding: string (typically 'utf8' or 'latin1') :param encoding_errors: how to handle decoding failure :type encoding_errors: string (typically 'strict', 'ignore', or 'replace') :returns: a :class:`.TextRecord` .. _text_toolkit.parse_id_and_molecule: parse_id_and_molecule (text_toolkit) ------------------------------------ .. py:function:: parse_id_and_molecule(content, format, id_tag=None, reader_args=None, errors="strict") Parse the first structure record from *content* and return the (id, TextRecord) pair. *content* is a string containing a single structure record in format *format*. (Additional records are ignored). See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.read_molecules` for details about the other parameters. See :func:`chemfp.rdkit_toolkit.parse_molecule` if just want the :class:`.TextRecord` and not the the (id, TextRecord) pair. :param content: the string containing a structure record :type content: a string :param format: the input structure format :type format: a format name string, or Format object :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: reader arguments passed to the underlying toolkit :type reader_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param encoding: the byte encoding :type encoding: string (typically 'utf8' or 'latin1') :param encoding_errors: how to handle decoding failure :type encoding_errors: string (typically 'strict', 'ignore', or 'replace') :returns: an (id, :class:`.TextRecord` molecule) pair .. _text_toolkit.create_string: create_string (text_toolkit) ---------------------------- .. py:function:: create_string(mol, format, id=None, writer_args=None, errors="strict") Convert a TextRecord into a structure record in the given format as a Unicode string If *id* is not None then use it instead of the molecule's own id. :param mol: the molecule to use for the output :type mol: a :class:`.TextRecord` :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :returns: a Unicode string .. _text_toolkit.create_bytes: create_bytes (text_toolkit) --------------------------- .. py:function:: create_bytes(mol, format, id=None, writer_args=None, errors="strict", level=None) Convert a TextRecord into a structure record in the given format as a byte string If *id* is not None then use it instead of the molecule's own id. :param mol: the molecule to use for the output :type mol: a :class:`.TextRecord` :param format: the output structure format :type format: a format name string, or Format object :param id: an alternate record id :type id: a string, or None to use the molecule's own id :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a byte string .. _text_toolkit.open_molecule_writer: open_molecule_writer (text_toolkit) ----------------------------------- .. py:function:: open_molecule_writer(destination=None, format=None, writer_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", level=None) Return a MoleculeWriter which can write TextRecord instances to a destination. A :class:`chemfp.base_toolkit.MoleculeWriter` has the methods ``write_molecule``, ``write_molecules``, and ``write_ids_and_molecules``, which are ways to write an :class:`.TextRecord`, an TextRecord iterator, or an (id, TextRecord) pair iterator to a file. TextRecords are written to *destination*. The output format can be a string like "sdf.gz" or "smi", a :class:`chemfp.base_toolkit.Format`, or Format-like object with "name" and "compression" attributes, or None to auto-detect based on the *destination*. If auto-detection is not possible, the output will be written as uncompressed SMILES. That said, the text toolkit doesn't know how to convert between SMILES and SDF formats, and will raise an exception if you try. The *writer_args* is only used for the "smi", "can", and "usm" output formats. The only supported parameter is:: * delimiter - one of "tab", "space", "to-eol", the space or tab characters, or None The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. :param destination: the structure destination :type destination: a filename, file object, or None to write to stdout :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param encoding: the byte encoding :type encoding: string (typically 'utf8' or 'latin1') :param encoding_errors: how to handle decoding failure :type encoding_errors: string (typically 'strict', 'ignore', or 'replace') :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeWriter` expecting :class:`.TextRecord` instances .. _text_toolkit.open_molecule_writer_to_string: open_molecule_writer_to_string (text_toolkit) --------------------------------------------- .. py:function:: open_molecule_writer_to_string(format, writer_args=None, errors="strict", location=None) Return a MoleculeStringWriter which can write TextRecord instances to a string. See :func:`chemfp.text_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output as a Unicode string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting :class:`.TextRecord` instances .. _text_toolkit.open_molecule_writer_to_bytes: open_molecule_writer_to_bytes (text_toolkit) -------------------------------------------- .. py:function:: open_molecule_writer_to_bytes(format, writer_args=None, errors="strict", location=None, level=None) Return a MoleculeStringWriter which can write TextRecord instances to a string. See :func:`chemfp.text_toolkit.open_molecule_writer` for full parameter details. Use the writer's :meth:`chemfp.base_toolkit.MoleculeStringWriter.getvalue` to get the output as a byte string. :param format: the output structure format :type format: a format name string, or Format(-like) object, or None to auto-detect :param writer_args: writer arguments passed to the underlying toolkit :type writer_args: a dictionary :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track writer state information :type location: a :class:`chemfp.io.Location` object, or None :param level: compression level to use for compressed formats :type level: None, a positive integer, or one of the strings 'min', 'default', or 'max' :returns: a :class:`chemfp.base_toolkit.MoleculeStringWriter` expecting :class:`.TextRecord` instances .. _text_toolkit.copy_molecule: copy_molecule (text_toolkit) ---------------------------- .. py:function:: copy_molecule(mol) Return a new TextRecord which is a copy of the given TextRecord :param mol: the text record :type mol: a :class:`.TextRecord` :returns: a new :class:`.TextRecord` .. _text_toolkit.add_tag: add_tag (text_toolkit) ---------------------- .. py:function:: add_tag(mol, tag, value) Add an SD tag value to the TextRecord If the *mol* is in "sdf" format then this will modify ``mol.record`` to append the new *tag* and *value* to the end of the tag block. The other tags will not be modified, including tags with the same tag name. :param mol: the text record :type mol: a :class:`.TextRecord` :param string tag: the SD tag name :param string value: the text for the tag :returns: None .. _text_toolkit.get_tag: get_tag (text_toolkit) ---------------------- .. py:function:: get_tag(mol, tag) Get the named SD tag value, or None if it doesn't exist If the *mol* is in "sdf" format then this will return the corresponding tag value from ``mol.record``, or None if the tag does not exist. If the record is in any other format then it will return None. :param mol: the molecule :type mol: a :class:`.TextRecord` :param tag: the SD tag name :type tag: string :returns: a string, or None .. _text_toolkit.get_tag_pairs: get_tag_pairs (text_toolkit) ---------------------------- .. py:function:: get_tag_pairs(mol) Get a list of all SD tag (name, value) pairs for the TextRecord If the *mol* is in "sdf" format then this will return the list of (tag, value) pairs in ``mol.record``, where the *tag* and *value* are strings. If the record is in any other format then it will return an empty list. :param mol: the molecule :type mol: a :class:`.TextRecord` :returns: a list of (tag name, tag value) pairs .. _text_toolkit.get_id: get_id (text_toolkit) --------------------- .. py:function:: get_id(mol) Get the molecule's id from the TextRecord's id field This is toolkit-portable way to get ``mol.id``. :param mol: the molecule :type mol: a TextRecord :returns: a string .. _text_toolkit.set_id: set_id (text_toolkit) --------------------- .. py:function:: set_id(mol, id) Set the TextRecord's id to the new id This is the toolkit-portable way to write ``mol.id = id``. Note: this does not modify ``mol.record``. Use :func:`chemfp.text_toolkit.create_string` or similar text_toolkit functions to get the record text with a new identifier. :param mol: the molecule :type mol: a :class:`.TextRecord` :param id: the new id :type id: string :returns: None .. _text_toolkit.read_sdf_records: read_sdf_records (text_toolkit) ------------------------------- .. py:function:: read_sdf_records(source=None, reader_args=None, compression=None, errors="strict", location=None, block_size=327680) Return an iterator that reads each record from an SD file as a string. Iterate through the records in *source*, which must be in SD format. If *compression* is None or "auto" then auto-detect the compression type based on *source*, and default to uncompressed when it can't be determined. Use "gz" when the input is gzip compressed, and "none" or "" if uncompressed. The *reader_args* parameter is currently unused. It exists for future compatability. The *errors* parameter specifies how to handle errors. "strict" raises an exception, "report" sends a message to stderr and goes to the next record, and "ignore" goes to the next record. The *location* parameter takes a :class:`chemfp.io.Location` instance. If None then a default Location will be created. The *block_size* parameter is the number of bytes to read from the SD file. The current implementation reads a block, iterates through the records in the block, then prepends any remaining text to the start of the next block. You shouldn't need to change this parameter, but if you do, please let me know. Note: to prevent accidental memory consumption if the input is in the wrong format, a complete record must be found within the first 327680 bytes or 5*block_size bytes, whichever is larger. The parser has only a basic understanding of the SD format. It knows how to handle the counts line, the SKP property, and even tag data with the value '$$$$'. It is not a full validator and it does not know chemistry. WARNING: the parser does not yet handle the MS Windows newline convention. See :func:`.read_sdf_ids_and_records` if you want (id, record) pairs, and :func:`.read_sdf_ids_and_values` if you want (id, tag data) pairs. See :func:`.read_sdf_ids_and_records_from_string` to read from a string instead of a file or file-like object. :param source: the SDF source :type source: a filename, file object, or None to read from stdin :param reader_args: currently ignored :type reader_args: currently ignored :param compression: the data content compression method :type compression: one of "auto", "none", "", or "gz" :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :func:`chemfp.base_toolkit.RecordReader` iterating over the records as a string .. _text_toolkit.read_sdf_ids_and_records: read_sdf_ids_and_records (text_toolkit) --------------------------------------- .. py:function:: read_sdf_ids_and_records(source=None, id_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680) Return an iterator that reads the (id, record string) pairs from an SD file See :func:`.read_sdf_records` for most parameter details. That function iterates over the records, while this one iterates over the (id, record) pairs. By default the id comes from the title line. Use *id_tag* to get the record id from the given SD tag instead. See :func:`.read_sdf_ids_and_values` if you want to read an identifier and tag value, or two tag values, instead of returning the full record. :param source: the SDF source :type source: a filename, file object, or None to read from stdin :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: currently ignored :type reader_args: currently ignored :param compression: the data content compression method :type compression: one of "auto", "none", "", or "gz" :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndRecordReader` iterating (id, record string) pairs .. _text_toolkit.read_sdf_ids_and_values: read_sdf_ids_and_values (text_toolkit) -------------------------------------- .. py:function:: read_sdf_ids_and_values(source=None, id_tag=None, value_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680) Return an iterator that reads the (id, tag value string) pairs from an SD file See :func:`.read_sdf_records` for most parameter details. That function iterates over the records, while this one iterates over the (id, tag value) pairs. By default this uses the title line for both the id and tag value strings. Use *id_tag* and *value_tag*, respectively, to use a given tag value instead. If a tag doesn't exist then None will be used. :param source: the SDF source :type source: a filename, file object, or None to read from stdin :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param value_tag: SD tag containing the value :type value_tag: string, or None to use the record title :param reader_args: currently ignored :type reader_args: currently ignored :param compression: the data content compression method :type compression: one of "auto", "none", "", or "gz" :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndRecordReader` iterating (id, value string) pairs .. _text_toolkit.read_sdf_records_from_string: read_sdf_records_from_string (text_toolkit) ------------------------------------------- .. py:function:: read_sdf_records_from_string(content, reader_args=None, compression=None, errors="strict", location=None, block_size=327680) Return an iterator that reads each record from a string containing SD records See :func:`.read_sdf_records_from_string` for the parameter details. The main difference is that this function reads from *content*, which is a string containing 0 or more SDF records. If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, and the compression option is not supported. If content is a byte string then the records will be returned as byte strings, and compression is supported. See :func:`.read_sdf_ids_and_records_from_string` to read (id, record) pairs and :func:`.read_sdf_ids_and_values_from_string` to read (id, tag value) pairs. :param content: a string containing zero or more SD records :type content: string or bytes :param reader_args: currently ignored :type reader_args: currently ignored :param compression: the data content compression method :type compression: one of "auto", "none", "", or "gz" :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.RecordReader` iterating over each record as a string .. _text_toolkit.read_sdf_ids_and_records_from_string: read_sdf_ids_and_records_from_string (text_toolkit) --------------------------------------------------- .. py:function:: read_sdf_ids_and_records_from_string(content=None, id_tag=None, reader_args=None, compression=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680) Return an iterator that reads the (id, record) pairs from a string containing SD records This function reads the records from *content*, which is a string containing 0 or more SDF records. It iterates over the (id, record) pairs. By default the id comes from the first line of the SD record. Use *id_tag* to use a given tag value instead. See :func:`.read_sdf_records` for details about the other parameters. If content is a (Unicode) string then it must only contain ASCII characters, the records will be returned as strings, the compression option is not supported, and the encoding and encoding_errors parameters are ignored. If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id. :param content: a string containing zero or more SD records :type content: string or bytes :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param reader_args: currently ignored :type reader_args: currently ignored :param compression: the data content compression method :type compression: one of "auto", "none", "", or "gz" :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndRecordReader` iterating over the (id, record string) pairs .. _text_toolkit.read_sdf_ids_and_values_from_string: read_sdf_ids_and_values_from_string (text_toolkit) -------------------------------------------------- .. py:function:: read_sdf_ids_and_values_from_string(content=None, id_tag=None, value_tag=None, compression=None, reader_args=None, errors="strict", location=None, encoding="utf8", encoding_errors="strict", block_size=327680) Return an iterator that reads the (id, value) pairs from a string containing SD records This function reads the records from *content*, which is a string containing 0 or more SDF records. It iterates over the (id, value) pairs, which by default both contain the title line. Use *id_tag* and *value_tag*, respectively, to use a given tag value instead. If a tag doesn't exist then None will be used. If content is a (Unicode) string then it must only contain ASCII characters, the compression option is not supported, and the encoding and encoding_errors parameters are ignored. If content is a byte string then the records will be returned as byte strings, compression is supported, and the encoding and encoding_errors parameters are used to parse the id and value. See :func:`.read_sdf_records` for details about the other parameters. :param content: a string containing zero or more SD records :type content: string or bytes :param id_tag: SD tag containing the record id :type id_tag: string, or None to use the record title :param value_tag: SD tag containing the value :type value_tag: string, or None to use the record title :param reader_args: currently ignored :type reader_args: currently ignored :param compression: the data content compression method :type compression: one of "auto", "none", "", or "gz" :param errors: specify how to handle errors :type errors: one of "strict", "report", or "ignore" :param location: object used to track parser state information :type location: a :class:`chemfp.io.Location` object, or None :returns: a :class:`chemfp.base_toolkit.IdAndRecordReader` iterating over the (id, value) pairs .. _text_toolkit.get_sdf_tag: get_sdf_tag (text_toolkit) -------------------------- .. py:function:: get_sdf_tag(sdf_record, tag) Return the value for a named tag in an SDF record string Get the value for the tag named *tag* from the string *sdf_record* containing an SD record. :param string sdf_record: an SD record :param string tag: a tag name :returns: the corresponding tag value as a string, or None .. _text_toolkit.add_sdf_tag: add_sdf_tag (text_toolkit) -------------------------- .. py:function:: add_sdf_tag(sdf_record, tag, value) Add an SD tag value to an SD record string This will append the new *tag* and *value* to the end of the tag data block in the *sdf_record* string. :param string sdf_record: an SD record :param string tag: a tag name :param string value: the new tag value :returns: a new SD record string with the new tag and value .. _text_toolkit.get_sdf_tag_pairs: get_sdf_tag_pairs (text_toolkit) -------------------------------- .. py:function:: get_sdf_tag_pairs(sdf_record) Return the (tag, value) entries in the SDF record string Parse the *sdf_record* and return the tag data as a list of (tag, value) pairs. The type of the returned strings will be the same as the type of the input sdf_record string. :param string sdf_record: an SDF record :returns: a list of (tag, value) pairs .. _text_toolkit.get_sdf_id: get_sdf_id (text_toolkit) ------------------------- .. py:function:: get_sdf_id(sdf_record) Return the id for the SDF record string The id is the first line of the *sdf_record*. A future version of this function may support an *id_tag* parameter. Let me know if that would be useful. The returned id string will have the same type as the input sdf_record. :param string sdf_record: an SD record :returns: the first line of the SD record .. _text_toolkit.set_sdf_id: set_sdf_id (text_toolkit) ------------------------- .. py:function:: set_sdf_id(sdf_record, id) Set the id of the SDF record string to a new value Set the first line of *sdf_record* to the new *id*, which must not contain a newline. The sdf_record and the id must have the same string type. :param string sdf_record: an SDF record :param string id: the new id chemfp._text_toolkit module (private) ===================================== .. py:module:: chemfp._text_toolkit As you might have infered from the leading "_" in "_text_toolkit", this is not a public module. There is no reason for you to import it directly, the module name is subject to change, and even the location of the classes is also subject to change. The reason why I even bring it up is because the :mod:`chemfp.text_toolkit` returns class instances from this module, so you might well wonder about them. TextRecord ---------- .. py:class:: TextRecord Base class for the text_toolkit 'molecules', which work with the records as text. The :mod:`chemfp.text_toolkit` implements the toolkit API, but it doesn't know chemistry. Instead of returning real molecule objects, with atoms and bonds, it returns TextRecord subclass instances that hold the record as a text string. As an implementation detail (which means its subject to change) there is a subclass for each of the support formats. * :class:`SDFRecord` - holds "sdf" records * :class:`SmiRecord` - holds "smi" records (the full line from a "smi" SMILES file) * :class:`CanRecord` - holds "can" records (the full line from a "can" SMILES file) * :class:`UsmRecord` - holds "usm" records (the full line from a "usm" SMILES file) * :class:`SmiStringRecord` - holds "smistring" records (only the "smistring" SMILES string; no id) * :class:`CanStringRecord` - holds "canstring" records (only the "canstring" SMILES string; no id) * :class:`UsmStringRecord` - holds "usmstring" records (only the "usmstring" SMILES string; no id) All of the classes have the following attributes: .. py:attribute:: id The record identifier as a Unicode string, or None if there is no identifier .. py:attribute:: id_bytes The record identifier as a byte string, or None if there is no identifier .. py:attribute:: record The record, as a string. For the smistring, canstring, and usmstring formats, this is only the SMILES string. .. py:attribute:: record_format One of "sdf", "smi", "can", "usm", "smistring", "canstring", or "usmstring". The SMILES classes have an attribute: .. py:attribute:: smiles The SMILES string component of the record. .. py:method:: add_tag(tag, value) Add an SD tag value to the TextRecord This methods does nothing if the record is not an "sdf" record. :param tag: the SD tag name :type tag: string :param value: the text for the tag :type value: string :returns: None .. py:method:: get_tag(tag) Get the named SD tag value, or None if it doesn't exist or is not an "sdf" record. :param tag: the SD tag name :type tag: byte or Unicode string :returns: a Unicode string, or None .. py:method:: get_tag_as_bytes(tag) Get the named SD tag value, or None if it doesn't exist or is not an "sdf" record. :param tag: the SD tag name :type tag: byte string :returns: a byte string, or None .. py:method:: get_tag_pairs() Get a list of all SD tag (name, value) pairs for the TextRecord using Unicode strings This function returns an empty list if the record is not an "sdf" record. :returns: a list of (Unicode string name, Unicode string value) pairs .. py:method:: get_tag_pairs_as_bytes() Get a list of all SD tag (name, value) pairs for the TextRecord using byte strings This function returns an empty list if the record is not an "sdf" record. :returns: a list of (byte string name, byte string value) pairs .. py:method:: copy() Return a new record which is a copy of the given record SDFRecord --------- .. py:class:: SDFRecord Holds an SDF record. See :class:`chemfp._text_toolkit.TextRecord` for API details SmiRecord --------- .. py:class:: SmiRecord Holds an "smi" record. See :class:`chemfp._text_toolkit.TextRecord` for API details CanRecord --------- .. py:class:: CanRecord Holds an "can" record. See :class:`chemfp._text_toolkit.TextRecord` for API details UsmRecord --------- .. py:class:: UsmRecord Holds an "usm" record. See :class:`chemfp._text_toolkit.TextRecord` for API details SmiStringRecord --------------- .. py:class:: SmiStringRecord Holds an "smistring" record. See :class:`chemfp._text_toolkit.TextRecord` for API details CanStringRecord --------------- .. py:class:: CanStringRecord Holds an "canstring" record. See :class:`chemfp._text_toolkit.TextRecord` for API details UsmStringRecord --------------- .. py:class:: UsmStringRecord Holds an "usmstring" record. See :class:`chemfp._text_toolkit.TextRecord` for API details chemfp.io module ================ .. py:module:: chemfp.io This module implements a single public class, :class:`Location`, which tracks parser state information, including the location of the current record in the file. The other functions and classes are undocumented, should not be used, and may change in future releases. Location -------- .. py:class:: Location Get location and other internal reader and writer state information A Location instance gives a way to access information like the current record number, line number, and molecule object.:: >>> import chemfp >>> with chemfp.read_molecule_fingerprints("RDKit-MACCS166", ... "ChEBI_lite.sdf.gz", id_tag="ChEBI ID") as reader: ... for id, fp in reader: ... if id == "CHEBI:3499": ... print("Record starts at line", reader.location.lineno) ... print("Record byte range:", reader.location.offsets) ... print("Number of atoms:", reader.location.mol.GetNumAtoms()) ... break ... [08:18:12] S group MUL ignored on line 103 Record starts at line 3599 Record byte range: (138171, 141791) Number of atoms: 36 The supported properties are: * filename - a string describing the source or destination * lineno - the line number for the start of the file * mol - the toolkit molecule for the current record * offsets - the (start, end) byte positions for the current record * output_recno - the number of records written successfully * recno - the current record number * record - the record as a text string * record_format - the record format, like "sdf" or "can" Most of the readers and writers do not support all of the properties. Unsupported properties return a None. The *filename* is a read/write attribute and the other attributes are read-only. If you don't pass a location to the readers and writers then they will create a new one based on the source or destination, respectively. You can also pass in your own Location, created as ``Location(filename)`` if you have an actual filename, or ``Location.from_source(source)`` or ``Location.from_destination(destination)`` if you have a more generic source or destination. .. py:method:: __init__(filename=None) Use *filename* as the location's filename .. py:method:: from_source(cls, source) Create a Location instance based on the source If *source* is a string then it's used as the filename. If *source* is None then the location filename is "". If *source* is a file object then its ``name`` attribute is used as the filename, or None if there is no attribute. .. py:method:: from_destination(cls, destination) Create a Location instance based on the destination If *destination* is a string then it's used as the filename. If *destination* is None then the location filename is "". If *destination* is a file object then its ``name`` attribute is used as the filename, or None if there is no attribute. .. py:method:: __repr__() Return a string like 'Location("")' .. py:attribute:: Location.first_line Read-only attribute. The first line of the current record .. py:attribute:: Location.filename Read/write attribute. A string which describes the source or destination. This is usually the source or destination filename but can be a string like "" or "". .. py:attribute:: Location.mol Read-only attribute. The molecule object for the current record .. py:attribute:: Location.offsets Read-only attribute. The (start, end) byte offsets, starting from 0 *start* is the record start byte position and *end* is one byte past the last byte of the record. .. py:attribute:: Location.output_recno Read-only attribute. The number of records actually written to the file or string. The value ``recno - output_recno`` is the number of records sent to the writer but which had an error and could not be written to the output. .. py:attribute:: Location.recno Read-only attribute. The current record number For writers this is the number of records sent to the writer, and output_recno is the number of records sucessfully written to the file or string. .. py:attribute:: Location.record Read-only attribute. The current record as an uncompressed text string .. py:attribute:: Location.record_format Read-only attribute. The record format name .. py:method:: where() Return a human readable description about the current reader or writer state. The description will contain the filename, line number, record number, and up to the first 40 characters of the first line of the record, if those properties are available.