|
| 1 | +# SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation |
| 2 | + |
| 3 | + |
| 4 | +- **Documents:** [http://bioinf.shenwei.me/seqkit](http://bioinf.shenwei.me/seqkit) |
| 5 | +([**Usage**](http://bioinf.shenwei.me/seqkit/usage/), |
| 6 | +[**FAQ**](http://bioinf.shenwei.me/seqkit/faq/), |
| 7 | +[**Tutorial**](http://bioinf.shenwei.me/seqkit/tutorial/), |
| 8 | +and |
| 9 | +[**Benchmark**](http://bioinf.shenwei.me/seqkit/benchmark/)) |
| 10 | +- **Source code:** [https://github.com/shenwei356/seqkit](https://github.com/shenwei356/seqkit) |
| 11 | +[](https://github.com/shenwei356/seqkit) |
| 12 | +[](https://github.com/shenwei356/seqkit/blob/master/LICENSE) |
| 13 | +- **Latest version:** [](https://github.com/shenwei356/seqkit/releases) |
| 14 | +[](http://bioinf.shenwei.me/seqkit/download/) |
| 15 | +[](http://bioinf.shenwei.me/seqkit/download/) |
| 16 | +[](https://anaconda.org/bioconda/seqkit) |
| 17 | +- **[Please cite](#citation):** [](https://doi.org/10.1371/journal.pone.0163962) |
| 18 | +[](https://scholar.google.com/citations?view_op=view_citation&hl=en&user=wHF3Lm8AAAAJ&citation_for_view=wHF3Lm8AAAAJ:zYLM7Y9cAGgC) |
| 19 | +- **Others**: [](https://biotreasury.rjmart.cn/#/tool?id=10081) |
| 20 | + |
| 21 | +## Features |
| 22 | + |
| 23 | +- **Easy to install** ([download](http://bioinf.shenwei.me/seqkit/download/)) |
| 24 | + - Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64) |
| 25 | + - Light weight and out-of-the-box, no dependencies, no compilation, no configuration |
| 26 | + - `conda install -c bioconda seqkit` |
| 27 | +- **Easy to use** |
| 28 | + - Ultrafast (see [technical-details](http://bioinf.shenwei.me/seqkit/usage/#technical-details-and-guides-for-use) and [benchmark](http://bioinf.shenwei.me/seqkit/benchmark)) |
| 29 | + - Seamlessly parsing both FASTA and FASTQ formats |
| 30 | + - Supporting (`gzip`/`xz`/`zstd`/`bzip2` compressed) STDIN/STDOUT and input/output file, easily integrated in pipe |
| 31 | + - Reproducible results (configurable rand seed in `sample` and `shuffle`) |
| 32 | + - Supporting custom sequence ID via regular expression |
| 33 | + - Supporting [Bash/Zsh autocompletion](http://bioinf.shenwei.me/seqkit/download/#shell-completion) |
| 34 | +- **Versatile commands** ([usages and examples](http://bioinf.shenwei.me/seqkit/usage/)) |
| 35 | + - Practical functions supported by [38 subcommands](#subcommands) |
| 36 | + |
| 37 | + |
| 38 | +## Installation |
| 39 | + |
| 40 | +Go to [Download Page](http://bioinf.shenwei.me/seqkit/download) for more download options and changelogs, or |
| 41 | +install via conda: |
| 42 | + |
| 43 | + conda install -c bioconda seqkit |
| 44 | + |
| 45 | +## Subcommands |
| 46 | + |
| 47 | +|Category |Command |Function |Input |Strand-sensitivity|Multi-threads| |
| 48 | +|:----------------|:-------------------------------------------------------------------|:--------------------------------------------------------------------------------------------|:--------------|:-----------------|:------------| |
| 49 | +|Basic operation |[seq](https://bioinf.shenwei.me/seqkit/usage/#seq) |Transform sequences: extract ID/seq, filter by length/quality, remove gaps… |FASTA/Q | | | |
| 50 | +| |[stats](https://bioinf.shenwei.me/seqkit/usage/#stats) |Simple statistics: #seqs, min/max_len, N50, Q20%, Q30%… |FASTA/Q | |✓ | |
| 51 | +| |[subseq](https://bioinf.shenwei.me/seqkit/usage/#subseq) |Get subsequences by region/gtf/bed, including flanking sequences |FASTA/Q |+ or/and - | | |
| 52 | +| |[sliding](https://bioinf.shenwei.me/seqkit/usage/#sliding) |Extract subsequences in sliding windows |FASTA/Q |+ only | | |
| 53 | +| |[faidx](https://bioinf.shenwei.me/seqkit/usage/#faidx) |Create the FASTA index file and extract subsequences (with more features than samtools faidx)|FASTA |+ or/and - | | |
| 54 | +| |[translate](https://bioinf.shenwei.me/seqkit/usage/#translate) |translate DNA/RNA to protein sequence |FASTA/Q |+ or/and - | | |
| 55 | +| |[watch ](https://bioinf.shenwei.me/seqkit/usage/#watch ) |Monitoring and online histograms of sequence features |FASTA/Q | | | |
| 56 | +| |[scat ](https://bioinf.shenwei.me/seqkit/usage/#scat ) |Real time concatenation and streaming of fastx files |FASTA/Q | |✓ | |
| 57 | +|Format conversion|[fq2fa](https://bioinf.shenwei.me/seqkit/usage/#fq2fa) |Convert FASTQ to FASTA format |FASTQ | | | |
| 58 | +| |[fx2tab](https://bioinf.shenwei.me/seqkit/usage/#fx2tab) |Convert FASTA/Q to tabular format |FASTA/Q | | | |
| 59 | +| |[fa2fq](https://bioinf.shenwei.me/seqkit/usage/#fa2fq) |Retrieve corresponding FASTQ records by a FASTA file |FASTA/Q |+ only | | |
| 60 | +| |[tab2fx](https://bioinf.shenwei.me/seqkit/usage/#tab2fx) |Convert tabular format to FASTA/Q format |TSV | | | |
| 61 | +| |[convert](https://bioinf.shenwei.me/seqkit/usage/#convert) |Convert FASTQ quality encoding between Sanger, Solexa and Illumina |FASTA/Q | | | |
| 62 | +|Searching |[grep](https://bioinf.shenwei.me/seqkit/usage/#grep) |Search sequences by ID/name/sequence/sequence motifs, mismatch allowed |FASTA/Q |+ and - |partly, -m | |
| 63 | +| |[locate](https://bioinf.shenwei.me/seqkit/usage/#locate) |Locate subsequences/motifs, mismatch allowed |FASTA/Q |+ and - |partly, -m | |
| 64 | +| |[amplicon](https://bioinf.shenwei.me/seqkit/usage/#amplicon) |Extract amplicon (or specific region around it), mismatch allowed |FASTA/Q |+ and - |partly, -m | |
| 65 | +| |[fish](https://bioinf.shenwei.me/seqkit/usage/#fish) |Look for short sequences in larger sequences |FASTA/Q |+ and - | | |
| 66 | +|Set operation |[sample](https://bioinf.shenwei.me/seqkit/usage/#sample) |Sample sequences by number or proportion |FASTA/Q | | | |
| 67 | +| |[rmdup](https://bioinf.shenwei.me/seqkit/usage/#rmdup) |Remove duplicated sequences by ID/name/sequence |FASTA/Q |+ and - | | |
| 68 | +| |[common](https://bioinf.shenwei.me/seqkit/usage/#common) |Find common sequences of multiple files by id/name/sequence |FASTA/Q |+ and - | | |
| 69 | +| |[duplicate](https://bioinf.shenwei.me/seqkit/usage/#duplicate) |Duplicate sequences N times |FASTA/Q | | | |
| 70 | +| |[split](https://bioinf.shenwei.me/seqkit/usage/#split) |Split sequences into files by id/seq region/size/parts (mainly for FASTA) |FASTA preffered| | | |
| 71 | +| |[split2](https://bioinf.shenwei.me/seqkit/usage/#split2) |Split sequences into files by size/parts (FASTA, PE/SE FASTQ) |FASTA/Q | | | |
| 72 | +| |[head](https://bioinf.shenwei.me/seqkit/usage/#head) |Print first N FASTA/Q records |FASTA/Q | | | |
| 73 | +| |[head-genome](https://bioinf.shenwei.me/seqkit/usage/#head-genome) |Print sequences of the first genome with common prefixes in name |FASTA/Q | | | |
| 74 | +| |[range](https://bioinf.shenwei.me/seqkit/usage/#range) |Print FASTA/Q records in a range (start:end) |FASTA/Q | | | |
| 75 | +| |[pair](https://bioinf.shenwei.me/seqkit/usage/#pair) |Patch up paired-end reads from two fastq files |FASTA/Q | | | |
| 76 | +|Edit |[replace](https://bioinf.shenwei.me/seqkit/usage/#replace) |Replace name/sequence by regular expression |FASTA/Q |+ only | | |
| 77 | +| |[rename](https://bioinf.shenwei.me/seqkit/usage/#rename) |Rename duplicated IDs |FASTA/Q | | | |
| 78 | +| |[concat](https://bioinf.shenwei.me/seqkit/usage/#concat) |Concatenate sequences with same ID from multiple files |FASTA/Q |+ only | | |
| 79 | +| |[restart](https://bioinf.shenwei.me/seqkit/usage/#restart) |Reset start position for circular genome |FASTA/Q |+ only | | |
| 80 | +| |[mutate](https://bioinf.shenwei.me/seqkit/usage/#mutate) |Edit sequence (point mutation, insertion, deletion) |FASTA/Q |+ only | | |
| 81 | +| |[sana](https://bioinf.shenwei.me/seqkit/usage/#sana) |Sanitize broken single line FASTQ files |FASTQ | | | |
| 82 | +|Ordering |[sort](https://bioinf.shenwei.me/seqkit/usage/#sort) |Sort sequences by id/name/sequence/length |FASTA preffered| | | |
| 83 | +| |[shuffle](https://bioinf.shenwei.me/seqkit/usage/#shuffle) |Shuffle sequences |FASTA preffered| | | |
| 84 | +|BAM processing |[bam](https://bioinf.shenwei.me/seqkit/usage/#bam) |Monitoring and online histograms of BAM record features |BAM | | | |
| 85 | +|Miscellaneous |[sum](https://bioinf.shenwei.me/seqkit/usage/#sum) |Compute message digest for all sequences in FASTA/Q files |FASTA/Q | |✓ | |
| 86 | +| |[merge-slides](https://bioinf.shenwei.me/seqkit/usage/#merge-slides)|Merge sliding windows generated from seqkit sliding |TSV | | |
| 87 | + |
| 88 | +Notes: |
| 89 | + |
| 90 | +- Strand-sensitivity: |
| 91 | + - `+ only`: only processing on the positive/forward strand. |
| 92 | + - `+ and -`: searching on both strands. |
| 93 | + - `+ or/and -`: depends on users' flags/options/arguments. |
| 94 | +- Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads. |
| 95 | + |
| 96 | +## Citation |
| 97 | + |
| 98 | +**W Shen**, S Le, Y Li\*, F Hu\*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. |
| 99 | +***PLOS ONE***. [doi:10.1371/journal.pone.0163962](https://doi.org/10.1371/journal.pone.0163962). |
| 100 | +<span class="__dimensions_badge_embed__" data-doi="10.1371/journal.pone.0163962" data-style="small_rectangle"></span> |
| 101 | + |
| 102 | +## Contributors |
| 103 | + |
| 104 | +- [Wei Shen](https://github.com/shenwei356) |
| 105 | +- [Botond Sipos](https://github.com/botond-sipos): `bam`, `scat`, `fish`, `sana`, `watch`. |
| 106 | +- [others](https://github.com/shenwei356/seqkit/graphs/contributors) |
| 107 | + |
| 108 | +## Acknowledgements |
| 109 | + |
| 110 | +We thank [Lei Zhang](https://github.com/jameslz) for testing SeqKit, |
| 111 | +and also thank [Jim Hester](https://github.com/jimhester/), |
| 112 | +author of [fasta_utilities](https://github.com/jimhester/fasta_utilities), |
| 113 | +for advice on early performance improvements of for FASTA parsing |
| 114 | +and [Brian Bushnell](https://twitter.com/BBToolsBio), |
| 115 | +author of [BBMaps](https://sourceforge.net/projects/bbmap/), |
| 116 | +for advice on naming SeqKit and adding accuracy evaluation in benchmarks. |
| 117 | +We also thank Nicholas C. Wu from the Scripps Research Institute, |
| 118 | +USA for commenting on the manuscript |
| 119 | +and [Guangchuang Yu](http://guangchuangyu.github.io/) |
| 120 | +from State Key Laboratory of Emerging Infectious Diseases, |
| 121 | +The University of Hong Kong, HK for advice on the manuscript. |
| 122 | + |
| 123 | +We thank [Li Peng](https://github.com/penglbio) for reporting many bugs. |
| 124 | + |
| 125 | +We appreciate [Klaus Post](https://github.com/klauspost) for his fantastic packages ( |
| 126 | +[compress](https://github.com/klauspost/compress) and [pgzip](https://github.com/klauspost/pgzip) |
| 127 | +) which accelerate gzip file reading and writing. |
| 128 | + |
| 129 | +## Contact |
| 130 | + |
| 131 | +[Create an issue](https://github.com/shenwei356/seqkit/issues) to report bugs, |
| 132 | +propose new functions or ask for help. |
| 133 | + |
| 134 | +## License |
| 135 | + |
| 136 | +[MIT License](https://github.com/shenwei356/seqkit/blob/master/LICENSE) |
| 137 | + |
| 138 | +## Starchart |
| 139 | + |
| 140 | +<img src="https://starchart.cc/shenwei356/seqkit.svg" alt="Stargazers over time" style="max-width: 100%"> |
| 141 | + |
0 commit comments