-
Notifications
You must be signed in to change notification settings - Fork 12
/
mode_query.txt
395 lines (291 loc) · 16.7 KB
/
mode_query.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
SYNOPSIS
metacache query <database>
metacache query <database> <sequence file/directory>... [OPTION]...
metacache query <database> [OPTION]... <sequence file/directory>...
DESCRIPTION
Map sequences (short reads, long reads, genome fragments, ...)
to their most likely taxon of origin.
BASIC PARAMETERS
<database> database file name;
A MetaCache database contains taxonomic information and
min-hash signatures of reference sequences (complete
genomes, scaffolds, contigs, ...).
<sequence file/directory>...
FASTA or FASTQ files containing genomic sequences (short
reads, long reads, contigs, complete genomes, ...) that
shall be classified.
* If directory names are given, they will be searched for
sequence files (at most 10 levels deep).
* If no input filenames or directories are given,
MetaCache will run in interactive query mode. This can be
used to load the database into memory only once and then
query it multiple times with different query options.
MAPPING RESULTS OUTPUT
-out <file> Redirect output to file <file>.
If not specified, output will be written to stdout. If
more than one input file was given all output will be
concatenated into one file.
-split-out <file> Generate output and statistics for each input file
separately. For each input file <in> an output file with
name <file>_<in> will be written.
PAIRED-END READ HANDLING
-pairfiles Interleave paired-end reads from two consecutive files, so
that the nth read from file m and the nth read from file
m+1 will be treated as a pair. If more than two files are
provided, their names will be sorted before processing.
Thus, the order defined by the filenames determines the
pairing not the order in which they were given in the
command line.
-pairseq Two consecutive sequences (1+2, 3+4, ...) from each file
will be treated as paired-end reads.
-insertsize <#> Maximum insert size to consider.
default: sum of lengths of the individual reads
READ FILTERS
-min-readlen <#> Minimum length of reads (in bp). Reads shorter than the
given number will not be classified.
default: no limit
-max-readlen <#> Maximum length of reads (in bp). Reads longer than the
given number will not be classified.
default: no limit
CLASSIFICATION
-lowest <rank> Do not classify on ranks below <rank>
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
default: sequence
-highest <rank> Do not classify on ranks above <rank>
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
default: domain
-hitmin <t> Sets classification threshhold to <t>.
A read will not be classified if less than t features from
the database match. Higher values will increase precision
at the expense of sensitivity.
default: 0
-hitdiff <d> Sets candidate LCA threshhold to <d> percent.
Influences if only candidate with the most hits will be
used as classification result or if taxa of other
candidates will be considered.
All candidate (taxa) will be included that have at least
d% as many hits above the hit-min threshold as the
candidate with the most hits.
default: 100
-maxcand <#> maximum number of reference taxon candidates to consider
for each query;
A large value can significantly decrease the querying
speed!.
default: 2
-cov-percentile <p>
Remove the p-th percentile of hit reference sequences with
the lowest coverage. Classification is done using only the
remaining reference sequences. This can help to reduce
false positives, especially whenyour input data has a high
sequencing coverage.
This feature decreases the querying speed!
default: off
GENERAL OUTPUT FORMATTING
-no-summary Dont't show result summary & mapping statistics at the end
of the mapping output
default: off
-no-query-params Don't show query settings at the beginning of the mapping
output
default: off
-no-err Suppress all error messages.
default: off
CLASSIFICATION RESULT FORMATTING
-no-map Don't report classification for each individual query
sequence; show summaries only (useful for quick tests).
default: off
-mapped-only Don't list unclassified reads/read pairs.
default: off
-taxids Print taxon ids in addition to taxon names.
default: off
-taxids-only Print taxon ids instead of taxon names.
default: off
-omit-ranks Do not print taxon rank names.
default: off
-separate-cols Prints *all* mapping information (rank, taxon name, taxon
ids) in separate columns (see option '-separator').
default: off
-separator <text> Sets string that separates output columns.
default: '\t|\t'
-comment <text> Sets string that precedes comment (non-mapping) lines.
default: '# '
-queryids Show a unique id for each query.
Note that in paired-end mode a query is a pair of two read
sequences. This option will always be activated if option
'-hits-per-ref' is given.
default: off
-lineage Report complete lineage for per-read classification
starting with the lowest rank found/allowed and ending
with the highest rank allowed. See also options '-lowest'
and '-highest'.
default: off
ANALYSIS: ABUNDANCES
-abundances <file>
Show absolute and relative abundance of each taxon.
If a valid filename is given, the list will be written to
this file.
default: off
-abundance-per <rank>
Show absolute and relative abundances for each taxon on
one specific rank.
Classifications on higher ranks will be estimated by
distributing them down according to the relative
abundances of classifications on or below the given rank.
(Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain)
If '-abundances <file>' was given, this list will be
printed to the same file.
default: off
ANALYSIS: RAW DATABASE HITS
-tophits For each query, print top feature hits in database.
default: off
-allhits For each query, print all feature hits in database.
default: off
-locations Show locations in candidate reference sequences.
Activates option '-tophits'.
default: off
-hits-per-ref <file>
Shows a list of all hits for each reference sequence.
If this condensed list is all you need, you should
deactive the per-read mapping output with '-no-map'.
If a valid filename is given after '-hits-per-ref', the
list will be written to a separate file.
Option '-queryids' will be activated and the lowest
classification rank will be set to 'sequence'.
default: off
ANALYSIS: ALIGNMENTS
-align Show semi-global alignment to best candidate reference
sequence.
Original files of reference sequences must be available.
This feature decreases the querying speed!
default: off
ADVANCED: GROUND TRUTH BASED EVALUATION
-ground-truth Report correct query taxa if known.
Queries need to have either a 'taxid|<number>' entry in
their header or a sequence id that is also present in the
database.
This feature decreases the querying speed!
default: off
-precision Report precision & sensitivity by comparing query taxa
(ground truth) and mapped taxa.
Queries need to have either a 'taxid|<number>' entry in
their header or a sequence id that is also found in the
database.
This feature decreases the querying speed!
default: off
-taxon-coverage Report true/false positives and true/false negatives.This
option turns on '-precision', so ground truth data needs
to be available.
This feature decreases the querying speed!
default: off
ADVANCED: CUSTOM QUERY SKETCHING (SUBSAMPLING)
-kmerlen <k> number of nucleotides/characters in a k-mer
default: determined by database
-sketchlen <s> number of features (k-mer hashes) per sampling window
default: determined by database
-winlen <w> number of letters in each sampling window
default: determined by database
-winstride <l> distance between window starting positions
default: determined by database
ADVANCED: DATABASE MODIFICATION
-max-locations-per-feature <#>
maximum number of reference sequence locations to be
stored per feature;
If the value is too high it will significantly impact
querying speed. Note that an upper hard limit is always
imposed by the data type used for the hash table bucket
size (set with compilation macro
'-DMC_LOCATION_LIST_SIZE_TYPE').
default: 254
-remove-overpopulated-features
Removes all features that have reached the maximum allowed
amount of locations per feature. This can improve querying
speed and can be used to remove non-discriminative
features.
default: off
Not available in the GPU version.
-remove-ambig-features <rank>
Removes all features that have more distinct reference
sequence on the given taxonomic rank than set by
'-max-ambig-per-feature'. This can decrease the database
size significantly at the expense of sensitivity. Note
that the lower the given taxonomic rank is, the more
pronounced the effect will be.
Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain
default: off
Not available in the GPU version.
-max-ambig-per-feature <#>
Maximum number of allowed different reference sequence
taxa per feature if option '-remove-ambig-features' is
used.
Not available in the GPU version.
-max-load-fac <factor>
maximum hash table load factor;
This can be used to trade off larger memory consumption
for speed and vice versa. A lower load factor will improve
speed, a larger one will improve memory efficiency.
default: 0.800000
Not available in the GPU version.
ADVANCED: PERFORMANCE TUNING / TESTING
-threads <#> Sets the maximum number of parallel threads to use.default
(on this machine): 32
-batch-size <#> Process <#> many queries (reads or read pairs) per thread
at once.
default (on this machine): 4096
-query-limit <#> Classify at max. <#> queries (reads or read pairs) per
input file.
default: no limit
EXAMPLES
Query all sequences in 'myreads.fna' against pre-built database 'refseq':
metacache query refseq myreads.fna -out results.txt
Query all sequences in multiple files against database 'refseq':
metacache query refseq reads1.fna reads2.fna reads3.fna
Query all sequence files in folder 'test' againgst database 'refseq':
metacache query refseq test
Query multiple files and folder contents against database 'refseq':
metacache query refseq file1.fna folder1 file2.fna file3.fna folder2
Perform a precision test and show all ranks for each classification result:
metacache query refseq reads.fna -precision -allranks -out results.txt
Load database in interactive query mode, then query multiple read batches
metacache query refseq
reads1.fa reads2.fa -pairfiles -insertsize 400
reads3.fa -pairseq -insertsize 300
reads4.fa -lineage
OUTPUT FORMAT
MetaCache's default read mapping output format is:
read_header | rank:taxon_name
This will not be changed in the future to avoid breaking anyone's
pipelines. Command line options won't change in the near future for the
same reason. The following table shows some of the possible mapping
layouts with their associated command line arguments:
read mapping layout command line arguments
--------------------------------------- ---------------------------------
read_header | taxon_id -taxids-only -omit-ranks
read_header | taxon_name -omit-ranks
read_header | taxon_name(taxon_id) -taxids -omit-ranks
read_header | taxon_name | taxon_id -taxids -omit-ranks -separate-cols
read_header | rank:taxon_id -taxids-only
read_header | rank:taxon_name
read_header | rank:taxon_name(taxon_id) -taxids
read_header | rank | taxon_id -taxids-only -separate-cols
read_header | rank | taxon_name -separate-cols
read_header | rank | taxon_name | taxon_id -taxids -separate-cols
Note that the separator '\t|\t' can be changed to something else with
the command line option '-separator <text>'.
Note that the default lowest taxon rank is 'sequence'. Sequence-level taxon
ids have negative numbers in order to not interfere with NCBI taxon ids.
Each reference sequence is added as its own taxon below the
lowest known NCBI taxon for that sequence. If you do not want to classify
at sequence-level, you can set a higher rank as lowest classification rank
with the '-lowest' command line option: '-lowest species' or
'-lowest subspecies' or '-lowest genus', etc.