Help for the command-line tools¶
The chemfp command-line tools are:
- fpcat - merge multiple fingerprint files into one
- ob2fps - use Open Babel to generate fingerprints
- oe2fps - use OEChem/OEGraphSim to generate fingerprints
- rdkit2fps - use RDKit to generate fingerprints
- sdf2fps - extract fingerprints from an SD file
- simsearch - search a fingerprint file for similar fingerprints
fpcat command-line options¶
The following comes from fpcat --help
:
usage: fpcat [-h] [--in FORMAT] [--merge] [-o FILENAME] [--out FORMAT]
[--reorder] [--preserve-order] [--alignment N]
[--show-progress] [--max-spool-size SIZE] [--tmpdir DIRNAME]
[--version]
[filename [filename ...]]
Combine multiple fingerprint files into a single file.
positional arguments:
filename input fingerprint filenames (default: use stdin)
optional arguments:
-h, --help show this help message and exit
--in FORMAT input fingerprint format. One of fps, fps.gz, or fpb.
(default guesses from filename or is fps)
--merge assume the input fingerprint files are in popcount
order and do a merge sort
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output fingerprint format. One of fps, fps.gz, or fpb.
(default guesses from output filename, or is 'fps')
--reorder reorder the output fingerprints by popcount (default
for FPB output)
--preserve-order save the output fingerprints in the same order as the
input (default for FPS output)
--alignment N alignment size when saving a FPB file (default=8)
--show-progress show progress
--max-spool-size SIZE
use temporary files for extra storage space for huge
FPB files (default uses RAM)
--tmpdir DIRNAME directory for the temporary files (default uses the
system temp directory)
--version show program's version number and exit
Examples:
fpcat can be used to convert between FPS and FPB formats. This is
handy if you want to see what's inside of an FPB file:
fpcat fingerprints.fpb
You can use also use fpcat to make an FPB file from an FPS file:
fpcat fingerprints.fps -o fingerprints.fpb
You might have generated a set of FPS file which you want to merge
into a single FPB. (For example, you might have used GNU parallel to
generate FPS files for each of the PubChem files, which you want to
merge into a single file.):
fpcat Compound_*.fps -o pubchem.fpb
By default the FPB format sorts the fingerprints by popcount. (Use
--preserve-order if you really want to preserve the input order.) The
sort overhead for PubChem uses about 10 GB of RAM. If you don't have
that much memory then ask fpcat to use less memory:
fpcat --max-spool-size 1GB Compound_*.fps -o pubchem.fpb
This will use about 2 GB of RAM and the --tmpdir for the rest. (Yes,
it would be nice if I could get those two memory size numbers to
match.)
The --merge option is experimental. Use it if the input fingerprints
are in popcount order, because sorted output is a simple merge sort of
the individual sorted inputs. However, this option opens all input
files at the same time, which may exceed your resource limit on file
descriptors. The current implementation also requires a lot of disk
seeks so is slow for many files.
ob2fps command-line options¶
The following comes from ob2fps --help
:
usage: ob2fps [-h]
[--FP2 | --FP3 | --FP4 | --MACCS | --substruct | --rdmaccs | --rdmaccs/1]
[--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT]
[--errors {strict,report,ignore}] [-R NAME=VALUE]
[--delimiter {tab,whitespace,to-eol,space}] [--has-header]
[--version]
[filenames [filenames ...]]
Generate FPS or FPB fingerprints from a structure file using OpenBabel
positional arguments:
filenames input structure files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--FP2 linear fragments up to 7 atoms
--FP3 SMARTS patterns specified in the file patterns.txt
--FP4 SMARTS patterns specified in the file
SMARTS_InteLigand.txt
--MACCS Open Babel's implementation of the MACCS 166 keys
--substruct ChemFP substructure fingerprints
--rdmaccs, --rdmaccs/2
166 bit RDKit/MACCS fingerprints (version 2)
--rdmaccs/1 use the version 1 definition for --rdmaccs
--id-tag NAME tag name containing the record id (SD files only)
--in FORMAT input structure format (default autodetects from the
filename extension)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=ignore)
-R NAME=VALUE specify a reader argument
--delimiter {tab,whitespace,to-eol,space}
delimiter style for SMILES and InChI files. Alias for
'-R delimiter=VALUE'.
--has-header Skip the first line of a SMILES or InChI file Aliase
for '-R has_header=1'
--version show program's version number and exit
By default the Open Babel structure reader determines the file format
and compression type based on the filename extension. Unknown
filename extensions are treated as a uncompressed SMILES files.
If the data comes from stdin, or the guess based on extension name is
wrong, then use "--in FORMAT" option to change the default input format.
For examples:
--in smi
--in sdf.gz
The most commmon format names are :
File Type Valid FORMAT names
--------- ------------------
SMILES smi, can, usm - append ".gz" for gzip'ed files
InChI inchi - append ".gz" for gzip'ed files
SDF (native) sdf - gzip compression is handled automatically
SDF (chemfp) sdf - append ".gz" suffix for gzip'ed files
MOL2 mol2 - gzip compression is handled automatically
PDB pdb - " " " " "
MacroModel mmod - " " " " "
For a full list of formats, see http://openbabel.org/wiki/List_of_extensions .
Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead.
chemfp uses its own parsers to find SMILES and InChi records, which are
passed on to Open Babel for processing. These give chemfp better error
reporting and control. However, unlike the normal Open Babel parsers, they
do not automatically recognize gzip files, so the format name must include
the ".gz" suffix to read compressed formats.
By default chemfp uses Open Babel's native SDF reader. It also supports
an alternate implementation using chemfp's low-level SDF record parser.
To use chemfp's record parser, use the 'implementation' reader argument:
-R implementation=chemfp
All format support Open Babel's 'options' OBConversion argument. This is a
compact string like 'ab"btext"', which in this case sets option 'a' to
True, and option 'b' to text "btext".
You will need to consult the Open Babel documentation and implementation
for details on the options available to each format.
oe2fps command-line options¶
The following comes from oe2fps --help
:
usage: oe2fps [-h] [--path] [--circular] [--tree] [--numbits INT]
[--minbonds INT] [--maxbonds INT] [--minradius INT]
[--maxradius INT] [--atype ATYPE] [--btype BTYPE] [--maccs166]
[--substruct] [--rdmaccs] [--rdmaccs/1] [--aromaticity NAME]
[--id-tag NAME] [--in FORMAT] [-o FILENAME] [--out FORMAT]
[--errors {strict,report,ignore}] [-R NAME=VALUE]
[--delimiter {tab,whitespace,to-eol,space}] [--version]
[filenames [filenames ...]]
Generate FPS or FPB fingerprints from a structure file using OEChem
positional arguments:
filenames input structure files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--aromaticity NAME use the named aromaticity model (same as '-R
aromaticity=NAME')
--id-tag NAME tag name containing the record id (SD files only)
--in FORMAT input structure format (default guesses from filename)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=ignore)
-R NAME=VALUE specify a reader argument
--delimiter {tab,whitespace,to-eol,space}
delimiter style for SMILES and InChI files. Alias for
'-R delimiter=VALUE'.
--version show program's version number and exit
path, circular, and tree fingerprints:
--path generate path fingerprints (default)
--circular generate circular fingerprints
--tree generate tree fingerprints
--numbits INT number of bits in the fingerprint (default=4096)
--minbonds INT minimum number of bonds in the path or tree
fingerprint (default=0)
--maxbonds INT maximum number of bonds in the path or tree
fingerprint (path default=5, tree default=4)
--minradius INT minimum radius for the circular fingerprint
(default=0)
--maxradius INT maximum radius for the circular fingerprint
(default=5)
--atype ATYPE atom type flags, described below (default=Default)
--btype BTYPE bond type flags, described below (default=Default)
166 bit MACCS substructure keys:
--maccs166 generate MACCS fingerprints
881 bit ChemFP substructure keys:
--substruct generate ChemFP substructure fingerprints
ChemFP version of the 166 bit RDKit/MACCS keys:
--rdmaccs, --rdmaccs/2
generate 166 bit RDKit/MACCS fingerprints (version 2)
--rdmaccs/1 use the version 1 definition for --rdmaccs
ATYPE is one or more of the following, separated by the '|' character
Arom AtmNum Chiral EqArom EqHBAcc EqHBDon EqHalo FCharge HCount HvyDeg
Hyb InRing
The following shorthand terms and expansions are also available:
DefaultPathAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb|EqHalo
DefaultCircularAtom = AtmNum|Arom|Chiral|FCharge|HCount|EqHalo
DefaultTreeAtom = AtmNum|Arom|Chiral|FCharge|HvyDeg|Hyb
and 'Default' selects the correct value for the specified fingerprint.
Examples:
--atype Default
--atype "Arom|AtmNum|FCharge|HCount"
--atype Arom,AtmNum,FCharge,HCount
BTYPE is one or more of the following, separated by the '|' character
Chiral InRing Order
The following shorthand terms and expansions are also available:
DefaultPathBond = Order|Chiral
DefaultCircularBond = Order
DefaultTreeBond = Order
and 'Default' selects the correct value for the specified fingerprint.
Examples:
--btype Default
--btype Order|InRing
To simplify command-line use, a comma may be used instead of a '|' to
separate different fields. Example:
--atype AtmNum,HvyDegree
OEChem guesses the input structure format based on the filename
extension and assumes SMILES for structures read from stdin.
Use "--in FORMAT" to select an alternative, where FORMAT is one of:
File Type Valid FORMATs (use gz if compressed)
--------- ------------------------------------
SMILES smi, can, usm, smi.gz, can.gz, usm.gz
SDF sdf, mol, sdf.gz, mol.gz
SKC skc, skc.gz
CDK cdk, cdk.gz
MOL2 mol2, mol2.gz
PDB pdb, pdb.gz
MacroModel mmod, mmod.gz
OEBinary v2 oeb, oeb.gz
InChI inchi, inchi.gz
Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead.
Use the '-R' reader arguments option to pass in format-specific structure
reader arguments. The details depend on the specific format. All formats
handle the following two reader arguments:
aromaticity - one of 'openeye', 'daylight', 'tripos', 'mdl', or 'mmff'
(this can also be set via the older '--aromaticity' command-line option)
flavor - a '|' or ',' separated list of flavor names, or a numeric value.
A leading '-' means to remove the given flavor. Examples include:
o Canon,Strict -- the bitwise merger of the format's Canon and Strict values
o DEFAULT|-Kekule -- the format's DEFAULT flavor but without the Kekule bits
(every flavor has a DEFAULT)
o 42 -- the specific OEChem flavor value 42
Format Reader arguments
------ ----------------
smi, flavor using 'Canon', 'Strict', and 'DEFAULT'
can, delimiter -- one of 'to-eol', 'tab', 'whitespace', or 'space'
& usm
sdf the only flavor is 'DEFAULT'
skc the only flavor is 'DEFAULT'
mol2 flavor using 'M2H'
mol2h flavor using 'M2H'
mmod flavor using 'FormalCrg'
pdb flavor using 'ALL', 'BondOrder', 'CHARGE', 'Connect', 'DATA',
'END', 'ENDM', 'FORMALCHARGE', 'FormalCrg', 'ImplicitH',
'RADIUS', 'Rings', 'SecStruct', and 'TER'
xyz flavor using 'BondOrder', 'Connect', 'FormalCrg', 'ImplicitH',
and 'Rings'
cdx flavor using 'SuperAtom'
oeb the only flavor is 'DEFAULT'
See http://docs.eyesopen.com/toolkits/cpp/oechemtk/molreadwrite.html#flavored-input-and-output
for a description of available flavors for each format.
rdkit2fps command-line options¶
The following comes from rdkit2fps --help
:
usage: rdkit2fps [-h] [--fpSize FPSIZE] [--RDK] [--minPath INT]
[--maxPath INT] [--nBitsPerHash INT] [--useHs 0|1] [--morgan]
[--radius INT] [--useFeatures 0|1] [--useChirality 0|1]
[--useBondTypes 0|1] [--torsions] [--targetSize INT]
[--pairs] [--minLength INT] [--maxLength INT] [--maccs166]
[--avalon] [--isQuery 0_or_1] [--bitFlags INT] [--pattern]
[--substruct] [--rdmaccs] [--rdmaccs/1] [--id-tag NAME]
[--in FORMAT] [-o FILENAME] [--out FORMAT]
[--errors {strict,report,ignore}] [-R NAME=VALUE]
[--delimiter {tab,whitespace,to-eol,space}] [--has-header]
[--version]
[filenames [filenames ...]]
Generate FPS or FPB fingerprints from a structure file using RDKit
positional arguments:
filenames input structure files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--fpSize FPSIZE number of bits in the fingerprint. Default of 2048 for
RDK, Morgan, topological torsion, atom pair, and
pattern fingerprints, and 512 for Avalon fingerprints
--id-tag NAME tag name containing the record id (SD files only)
--in FORMAT input structure format (default guesses from filename)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=ignore)
-R NAME=VALUE specify a reader argument
--delimiter {tab,whitespace,to-eol,space}
delimiter style for SMILES and InChI files. Alias for
'-R delimiter=VALUE'.
--has-header Skip the first line of a SMILES or InChI file Aliase
for '-R has_header=1'
--version show program's version number and exit
RDKit topological fingerprints:
--RDK generate RDK fingerprints (default)
--minPath INT minimum number of bonds to include in the subgraph
(default=1)
--maxPath INT maximum number of bonds to include in the subgraph
(default=7)
--nBitsPerHash INT number of bits to set per path (default=2)
--useHs 0|1 include information about the number of hydrogens on
each atom (default=1)
RDKit Morgan fingerprints:
--morgan generate Morgan fingerprints
--radius INT radius for the Morgan algorithm (default=2)
--useFeatures 0|1 use chemical-feature invariants (default=0)
--useChirality 0|1 include chirality information (default=0)
--useBondTypes 0|1 include bond type information (default=1)
RDKit Topological Torsion fingerprints:
--torsions generate Topological Torsion fingerprints
--targetSize INT number of bonds per torsion (default=4)
RDKit Atom Pair fingerprints:
--pairs generate Atom Pair fingerprints
--minLength INT minimum bond count for a pair (default=1)
--maxLength INT maximum bond count for a pair (default=30)
166 bit MACCS substructure keys:
--maccs166 generate MACCS fingerprints
Avalon fingerprints:
--avalon generate Avalon fingerprints
--isQuery 0_or_1 is the fingerprint for a query structure? (1 if yes, 0
if no) (default=0)
--bitFlags INT bit flags, SSSBits are 32767 and similarity bits are
15761407 (default=15761407)
RDKit Pattern fingerprints:
--pattern generate (substructure) pattern fingerprints
ChemFP's version of the 881 bit PubChem substructure keys:
--substruct generate ChemFP substructure fingerprints
ChemFP version of the 166 bit RDKit/MACCS keys:
--rdmaccs, --rdmaccs/2
generate 166 bit RDKit/MACCS fingerprints (version 2)
--rdmaccs/1 use the version 1 definition for --rdmaccs
This program guesses the input structure format and the compression
based on the filename extension. If the guess fails then it assumes
the input is an uncompressed SMILES file.
If the data comes from stdin, or the guess based on extension name is
wrong, then use "--in" to change the default input format. The
supported format extensions are:
File Type Valid FORMATs (use gz if compressed)
--------- ------------------------------------
SMILES smi, can, usm, smi.gz, can.gz, ism.gz
SDF sdf, sdf.gz
InChI inchi, inchi.gz
Note: chemfp-2.0 removed the "ism" input format type. Use "smi" instead.
Use the '-R' reader arguments option to pass in format-specific structure
reader arguments. The details depend on the specific format.
* All of the input formats implement the 'sanitize' option. Use
"-R sanitize=false" to disable the default sanitization.
* The SMILES formats use the 'delimiter' option to specify the
delimiter type. The default is 'to-eol'. The other values are
"tab", "whitespace", and "space". Use "-R delimiter=whitespace"
to match RDKit's native delimiter style.
* The SDF format supports two additional reader arguments:
* 'strictParsing'; use "-R strictParsing=false" to disable strict parsing
* 'removeHs'; use "-R removeHs=false" to keep all of the hydrogens
* The InChI format supports four additional reader arguments:
* 'delimiter' works the same as it does for the SMILES formats
* 'removeHs' works the same as it does for the SDF format
* 'treatWarningAsError'; use "-R treatWarningAsError=true" to convert all warnings into errors
* 'logLevel' specifies the RDKit/InChI library log level, as an integer
sdf2fps command-line options¶
The following comes from sdf2fps --help
:
usage: sdf2fps [-h] [--id-tag TAG] [--fp-tag TAG] [--in FORMAT]
[--num-bits INT] [--errors {strict,report,ignore}]
[-o FILENAME] [--out FORMAT] [--software TEXT] [--type TEXT]
[--version] [--binary] [--binary-msb] [--hex] [--hex-lsb]
[--hex-msb] [--base64] [--cactvs] [--daylight]
[--decoder DECODER] [--pubchem]
[filenames [filenames ...]]
Extract a fingerprint tag from an SD file and generate FPS or FPB fingerprints
positional arguments:
filenames input SD files (default is stdin)
optional arguments:
-h, --help show this help message and exit
--id-tag TAG get the record id from TAG instead of the first line
of the record
--fp-tag TAG get the fingerprint from tag TAG (required)
--in FORMAT Specify if the input SD file is uncompressed or gzip
compressed
--num-bits INT use the first INT bits of the input. Use only when the
last 1-7 bits of the last byte are not part of the
fingerprint. Unexpected errors will occur if these
bits are not all zero.
--errors {strict,report,ignore}
how should structure parse errors be handled?
(default=strict)
-o FILENAME, --output FILENAME
save the fingerprints to FILENAME (default=stdout)
--out FORMAT output structure format (default guesses from output
filename, or is 'fps')
--software TEXT use TEXT as the software description
--type TEXT use TEXT as the fingerprint type description
--version show program's version number and exit
Fingerprint decoding options:
--binary Encoded with the characters '0' and '1'. Bit #0 comes
first. Example: 00100000 encodes the value 4
--binary-msb Encoded with the characters '0' and '1'. Bit #0 comes
last. Example: 00000100 encodes the value 4
--hex Hex encoded. Bit #0 is the first bit (1<<0) of the
first byte. Example: 01f2 encodes the value \x01\xf2 =
498
--hex-lsb Hex encoded. Bit #0 is the eigth bit (1<<7) of the
first byte. Example: 804f encodes the value \x01\xf2 =
498
--hex-msb Hex encoded. Bit #0 is the first bit (1<<0) of the
last byte. Example: f201 encodes the value \x01\xf2 =
498
--base64 Base-64 encoded. Bit #0 is first bit (1<<0) of first
byte. Example: AfI= encodes value \x01\xf2 = 498
--cactvs CACTVS encoding, based on base64 and includes a
version and bit length
--daylight Daylight encoding, which is is base64 variant
--decoder DECODER import and use the DECODER function to decode the
fingerprint
shortcuts:
--pubchem decode CACTVS substructure keys used in PubChem. Same
as --software=CACTVS/unknown --type 'CACTVS-E_SCREEN/1.0 extended=2'
--fp-tag=PUBCHEM_CACTVS_SUBSKEYS --cactvs
simsearch command-line options¶
The following comes from simsearch --help
:
usage: simsearch [-h] [-k K_NEAREST] [-t THRESHOLD] [--alpha ALPHA]
[--beta BETA] [--queries QUERIES] [--NxN] [--query QUERY]
[--hex-query HEX_QUERY] [--query-id QUERY_ID]
[--query-format FORMAT] [--target-format FORMAT]
[-o FILENAME] [-c] [-b BATCH_SIZE] [--scan] [--memory]
[--times] [--version]
target_filename
Search an FPS or FPB file for similar fingerprints
positional arguments:
target_filename target filename
optional arguments:
-h, --help show this help message and exit
-k K_NEAREST, --k-nearest K_NEAREST
select the k nearest neighbors (use 'all' for all
neighbors)
-t THRESHOLD, --threshold THRESHOLD
minimum similarity score threshold
--alpha ALPHA Tversky alpha parameter (default: 1.0)
--beta BETA Tversky beta parameter (default: the value of --alpha)
--queries QUERIES, -q QUERIES
filename containing the query fingerprints
--NxN use the targets as the queries, and exclude the self-
similarity term
--query QUERY query as a structure record (default format: 'smi')
--hex-query HEX_QUERY
query in hex
--query-id QUERY_ID id for the query or hex-query (default: 'Query1'
--query-format FORMAT, --in FORMAT
input query format (default uses the file extension,
else 'fps')
--target-format FORMAT
input target format (default uses the file extension,
else 'fps')
-o FILENAME, --output FILENAME
output filename (default is stdout)
-c, --count report counts
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size
--scan scan the file to find matches (low memory overhead)
--memory build and search an in-memory data structure (faster
for multiple queries)
--times report load and execution times to stderr
--version show program's version number and exit