.. _simsearch: simsearch command-line options ==================================== The following comes from ``simsearch --help``: .. code-block:: none Usage: simsearch [OPTIONS] TARGET_FILENAME Search an FPS or FPB file for similar fingerprints. Options: -k, --k-nearest, --k K Select the k nearest neighbors (use 'all' for all neighbors) -t, --threshold FLOAT RANGE Minimum similarity score threshold [0.0<=x<=1.0] --beta FLOAT Tversky beta parameter (default: the value of --alpha) --alpha FLT Tversky alpha parameter (default: 1.0) -q, --queries PATH Filename containing the query fingerprints --NxN Use the targets as the queries, and exclude the self-similarity term --query TEXT query as a structure record (default format: 'smi') --hex-query, --hex HEX_STR query in hex --query-id STR id for the query or hex-query (default: 'Query1') --query-format, --in FORMAT input query format (default uses the file extension, else 'fps') --target-format FORMAT input target format (default uses the file extension, else 'fps') --query-type STRING fingerprint type string if the queries are structures (default: use the target fingerprint type) --id-tag NAME tag containing the record id if --query- format is an SD file) --errors [strict|report|ignore] how should structure parse errors be handled? (default=ignore) --delimiter VALUE Delimiter style for SMILES and InChI files. Forces '-R delimiter=VALUE'. --has-header Skip the first line of a SMILES or InChI file. Forces '-R has_header=1'. -R NAME=VALUE Specify a reader argument --cxsmiles / --no-cxsmiles Use --no-cxsmiles to disable the default support for CXSMILES extensions. Forces '-R cxsmiles=1' or '-R cxsmiles=0'. -o, --output FILENAME output filename (default is stdout) --out FORMAT Output format. One of 'simsearch', 'csv', 'tsv', or 'npz' (default: based on filename, or 'simsearch') --include-metadata / --no-metadata With --no-metadata, do not include header metadata in 'simsearch' output format. --include-empty / --no-include-empty In csv or tsv output, include a line for queries with no hits (the default) --empty-target-id STR In csv or tsv output, the target id for a query with no hits (default: '*') --empty-score STR In csv or tsv output, the score for a query with no hits (default: 'NaN') --precision [1|2|3|4|5|6|7|8|9|10] Number of digits in Tanimoto score (default: based on the fingerprint size) -c, --count Report counts -j, --num-threads INT The number of threads to use. -1 means all available cores. This option overrides $OMP_NUM_THREADS. (default: -1) -b, --batch-size INTEGER RANGE Number of fingerprints to process at a time [x>=1] --scan Scan the file to find matches (low memory overhead) --memory Build and search an in-memory data structure (faster for multiple queries) --no-mmap Don't use mmap to read uncompressed FPB files. May give better performance on networked file systems, at the expense of higher memory use. --times / --no-times Write timing information to stderr --progress / --no-progress Show a progress bar (default: show unless the output is a terminal) --version Show the version and exit. --license-check Check the license and report results to stdout. --license-file FILENAME Specify a chemfp license file --traceback Print the traceback on KeyboardInterrupt --version Show the version and exit. --help Show this message and exit. Examples: * Find the nearest 2 ChEMBL fingerprints given a SMILES string. Write the results to stdout in "simsearch" format, each query and its hits on one line: % simsearch --query c1ccccc1P chembl_30.fpb -k 2 #Simsearch/1 #num_bits=2048 #type=Tanimoto k=4 threshold=0.0 #software=chemfp/4.1 #targets=chembl_30.fpb 4 Query1 CHEMBL119405 0.4666667 CHEMBL14092 0.4285714 * Generate an NxN matix and save the results in an npy file compatible with a SciPy sparse matrix. % simsearch --NxN distinct.fps -o distinct.npz * Use query fingerprints from a file (in FPS format) to search target fingerprints (in FPB format), for fingerprints with a Tanimoto similarity of at least 0.4. Write the matches to stdout in "csv" with one row for each query hit. If there are no query hits then use "*" (the default) for the target id and specify "NA" for the score. % simsearch --queries queries.fps targets.fpb --threshold 0.41 \ --out csv --empty-score NA query_id,target_id,score 22525101,22525003,0.4261364 22525101,22525019,0.4224599 22525101,22525016,0.9161290 22525102,*,NA 22525103,*,NA 22525104,22525016,0.4100418 * Do the same search but save the results to a tsv (tab-separated) file. The format is inferred from the output filename. % simsearch --queries queries.fps targets.fpb --threshold 0.41 \ --empty-score NA -o results.tsv --no-progress % head -7 results.tsv query_id target_id score 22525101 22525003 0.4261364 22525101 22525019 0.4224599 22525101 22525016 0.9161290 22525102 * NA 22525103 * NA 22525104 22525016 0.4100418