Skip to content

Commit

Permalink
Merge pull request #113 from blachlylab/develop
Browse files Browse the repository at this point in the history
Release v0.13.3+htslib-1.13
  • Loading branch information
jblachly authored Oct 1, 2021
2 parents 5babdfa + 26b1663 commit 0f284df
Show file tree
Hide file tree
Showing 121 changed files with 24,030 additions and 4,094 deletions.
19 changes: 18 additions & 1 deletion .github/workflows/unittests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:

- name: Install htslib deps
run: |
sudo apt-get install -y build-essential autoconf zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev
sudo apt-get update && sudo apt-get install -y build-essential autoconf zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev
- name: Get htslib commit
id: get-htslib-commit
Expand All @@ -49,10 +49,27 @@ jobs:
sudo make install
sudo ldconfig
- name: Setup additional test files
run: |
cd htslib
cd test
cd tabix
bgzip -c gff_file.gff > gff_file.gff.gz
tabix gff_file.gff.gz
bgzip -c bed_file.bed > bed_file.bed.gz
tabix bed_file.bed.gz
bgzip -c vcf_file.vcf > vcf_file.vcf.gz
tabix vcf_file.vcf.gz
- name: Run tests
run: dub -q test -b=unittest-cov
env:
LIBRARY_PATH: /usr/local/lib

- name: Run safety tests
run: dub -q test -c=unittest-safety
env:
LIBRARY_PATH: /usr/local/lib

- name: Upload coverage
run: bash <(curl -s https://codecov.io/bash)
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ vcfwriter
bug_minimal
sam_iter
csam_iter
vcf
dhtslib-test-unittest

# DUB
Expand Down
106 changes: 45 additions & 61 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,85 +6,81 @@ dhtslib

# Overview

D bindings and convenience wrappers for [htslib](https://github.com/samtools/htslib),
the most widely-used library for manipulation of high-throughput sequencing data.
`dhtslib` provides D bindings, high-level abstractions, and additional functionality for [htslib](https://github.com/samtools/htslib), the most widely-used library for manipulation of high-throughput sequencing data. We currently support linux and OSX. Windows support is still in progress (see #38). More extensive documentation can be found at our [gitbook](https://blachlylab.gitbook.io/dhtslib/).

# Installation

Add `dhtslib` as a dependency to `dub.json`:

```
"dependencies": {
"dhtslib": "~>0.10.0",
"dhtslib": "~>0.12.3+htslib-1.10",
}
```
(version number 0.10.0 is example; see https://dub.pm/package-format-json)
(version number 0.12.3 is example, `+htslib-1.10` represents the compatible htslib version; see https://dub.pm/package-format-json)

# Requirements

## Dynamically linking to htslib (default)
A system installation of htslib v1.10 or 1.11 is required.

# Statically linking to htslib
`libhts.a` needs to be added to your project's source files.
Remember to link to all dynamic libraries configured when htslib was built. This may
include bz2, lzma, zlib, defalate, crypto, pthreads, curl.
Finally, if statically linking, the `-lhts` flag needs to be removed from compilation
by selecting the dub configuration `source-static` as the dub configuration type for dhtslib
within your own project's dub configuration file:

```
"subConfigurations": {
"dhtslib": "source-static"
},
```
## htslib
A system installation of htslib >= v1.10 is required. You can find detailed install instructions [here](htslib.md).

# Usage
`dhtslib` usage information and examples can be found [here](usage.md).

## D API (OOP Wrappers)
## Dhtslib API (OOP Wrappers)

Object-oriented, idomatic D wrappers are available for:

* SAM/BAM/CRAM files and streams (`dhtslib.sam`)
* VCF/BCF files (`dhtslib.vcf`)
* BGZF compressed files (`dhtslib.bgzf`)
* FASTA indexes (`dhtslib.faidx`)
* SAM/BAM/CRAM files and streams (`dhtslib.sam`)
* Tabix-indexed files (`dhtslib.tabix`)
* VCF/BCF files (`dhtslib.vcf`)

For example, this provides access to BGZF files by line as a consumable InputRange.
Or, for BAM files, the ability to query for a range (e.g. "chr1:1000000-2000000") and obtain an InputRange over the BAM records.
For most file type readers, indexing (`["coordinates"]`) queries return ranges of records. There are multiple options, including
`["chr1", 10_000_000 .. 20_000_000]` and `["chr1:10000000-20000000]`.
See the documentation for more details.
Additional functionality is provided for:

* GFF(2|3) files and streams (`dhtslib.gff`)
* BED files and streams (`dhtslib.bed`)
* FASTQ files and streams (`dhtslib.fastq`)
* Compile-time coordinate system (`dhtslib.coordinates`) to avoid off-by-one errors

All htslib bindings can be found under the `htslib` namespace (in prior versions they were under `dhtlsib.htslib`). These can be used directly as you would with `htslib`.


## htslib API

Direct bindings to htslib C API are available as submodules under `dhtslib.htslib`.
Naming remains the same as the original `.h` include files.
For example, `import dhtslib.htslib.faidx` for direct access to the C function calls.
The current compatible versions are 1.10+
Direct bindings to htslib C API are available as submodules under the `htslib` namespace. Naming remains the same as the original `.h` include files. For example, `import htslib.faidx` for direct access to the C function calls. Where the OOP wrappers manage their own data along the the D garbage collector, these functions use traditional C memory management (or lack thereof). The current compatible htslib versions are 1.10+.

Currently implemented:

* bgzf
* cram (untested)
* faidx
* hfile
* hts\_endian
* hts\_expr (untested)
* hts\_log
* hts\_os (untested)
* hts
* hts\_log
* kbitset (untested)
* kfunc (untested)
* knetfile (untested)
* kroundup
* kstring
* regidx
* sam
* synced\_bcf\_reader (untested)
* tbx
* thread\_pool (untested)
* vcf
* vcf\_sweep (untested)
* vcfutils (untested)

Missing or work-in-progress:

* Some CRAM specific functions, although much CRAM functionality works with `sam_` functions
* hfile
* kbitset, kfunc, khash, klist, knetfile, kseq, ksort (mostly used internally anyway)
* synced\_bcf\_reader
* vcf\_sweep
* vcfutils
* khash (see [dklib](https://github.com/blachlylab/dklib)), klist, kseq, ksort (mostly used internally anyway)

[dstep](https://github.com/jacob-carlborg/dstep) has matured and is an incredibly powerful tool for machine-assisted C-to-D translation. We've used dstep for the majority of bindings in the since version v0.11.0. After dstep translation, we port inline functions by hand as they are not translated, tweak some macros into templates (done although dstep already does an amazing job on simple `#define` macros translating to D templates!), and update the documentation comments to ddoc format.

# FAQ

Expand All @@ -98,40 +94,28 @@ Yes
**A**:
bioD, as a more general bioinformatics framework, is more comparable to bio-python, bio-ruby, bio-rust, etc.
bioD does have some excellent hts file format (BGZF and SAM) handling, and at one time sambamba, which relied on it, was faster than samtools.
However, the development resources poured into `htslib` overall are tremendous, and we with to leverage that rather than writing VCF, tabix, etc. code from scratch.
However, the development resources poured into `htslib` overall are tremendous, and we wish to leverage that rather than writing VCF, tabix, etc. code from scratch.

**Q**: How does this compare to bio-Rust's htslib bindings?

**A**: We love Rust, but dhtslib has way more complete bindings and more and better high level constructs :smile:

**Q**: Why were htslib bindings ported by hand instead of using a C header/bindings translator as in hts-nim or rust-htslib?

**A**:
Whereas dstep and dpp are incredibly convenient for binding creation, we created these by hand from htslib `.h` files for several reasons.
First, this gave the authors of dhtslib a better familiarity with the htslib API including letting us get to know several lesser-known and internal functions.
Second, some elements (particuarlly `#define` macros) are difficult or impossible in some cases for machines to translate, or translate into efficient code; here we were sometimes able to replace these macros with smarter replacements than a simple macro-expansion-direct-translation. (**see 2020 update below -- dstep translates simple #defines into templates**)
Likewise, we were able to turn certain `#defines` and pseudo-generic functions into D templates, and to `pragma(inline, true)` them.
Finally, instead of dumping all the bindings into an interface file, we left the structure of the file intact to make it easier for the D developer to read the source file as the htslib authors intended the C headers to be read. In addition, this leaves docstring/documentation comments intact, whereas in other projects the direct API has no comments and the developer must refer to the C headers.

**(2020 UPDATE)** [dstep](https://github.com/jacob-carlborg/dstep) has matured and is an incredibly powerful tool for machine-assisted C-to-D translation. We've used dstep for the majority of bindings in the new `htslib-110` branch. After dstep translation, we still need to port inline functions by hand (done), tweak some macros into templates (done although dstep already does an amazing job on simple `#define` macros translating to D templates!), backport some fixes for Windows platforms and update the documentation comments to ddoc format.

**A**: We love Rust, but dhtslib has way more complete bindings and more and better high level constructs :smile:. We have also implemented a novel compile-time type-safe coordinate system to mostly avoid off-by-one errors.

**Q**: Why am I getting a segfault?

**A**:
It's easy to get a segfault by using the direct C API incorrectly. Or possibly correctly. We have tried to eliminate most of this (use after free, etc.) in the OOP wrappers. If you are getting a segfault you cannot understand when using purely the high-level D API, please post an issue.
It's easy to get a segfault by using the direct C API incorrectly. Or possibly correctly. We have tried to eliminate most of this (use after free, etc.) in the OOP wrappers via refernece counting. If you are getting a segfault you cannot understand when using purely the high-level D API, please post an issue.


# Bugs and Warnings

Zero-based versus one-based coordinates. Zero-based coordinates are used internally and also by the API for BCF/VCF and SAM/BAM types.
The `fadix` C API expects one-based coordinates; we have built this as a template for the user to specify the coordinate system.
See documentation for more details.
Do not call `hts_log_*` with `ctx` as anything other than a string literal from a destructor, as it is potentialy allocating via `toStringz`

Do not call `hts_log_*` from a destructor, as it is potentialy allocating via `toStringz`

# Programs made with dhtslib
1. [fade](https://github.com/blachlylab/fade): Fragmentase Artifact Detection and Elimination
2. [recontig](https://github.com/blachlylab/recontig): a program to convert different bioinformatics data types from one reference naming convention to another i.e UCSC to ensembl (chr1 to 1)

# See Also
# Related projects

1. [gff3d](https://github.com/blachlylab/gff3d) GFF3 record reader/writer
2. [dklib](https://github.com/blachlylab/dklib) Templatized port of attractivechaos' klib, used extensively in htslib
1. [gff3d](https://github.com/blachlylab/gff3d): GFF3 record reader/writer
2. [dklib](https://github.com/blachlylab/dklib): Templatized port of attractivechaos' klib, used extensively in htslib
17 changes: 17 additions & 0 deletions UPGRADING_HTSLIB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Instructions for upgrading dhtslib's API based on changes in htslib
===================================================================

```
cd htslib
git pull
# insert new version here
git checkout 1.13
git submodule update
# insert old version here
git --no-pager diff 1.12 -- '*.h' > htslib.1.13.diff
```
Look over the output diff and pull over any changes in header files under the htslib folder that we have a corresponding D file for. Pull over any and all changes including documentation and license changes

For new header files, you will need to use dstep to generate bindings. New bindings must be manually inspected and any functions designated as `static inline` will have to be ported by hand from the C header.
14 changes: 14 additions & 0 deletions coordinates/dub.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
{
"name": "coordinates",
"targetType": "library",
"configurations": [
{
"name": "source",
"targetType": "library"
},
{
"name": "unittest",
"debugVersions": ["dhtslib_unittest"]
}
]
}
Loading

0 comments on commit 0f284df

Please sign in to comment.