Skip to content

Python wrapper for SeQuiLa: Distributed analytics for genomics based on Apache Spark!

License

Notifications You must be signed in to change notification settings

biodatageeks/pysequila

Repository files navigation

version PyPI downloads Maven Central status Python-3.8 license coverage GitHub contributors GitHub commit activity

pysequila

pysequila is a Python entrypoint to SeQuiLa, an ANSI-SQL compliant solution for efficient sequencing reads processing and genomic intervals querying built on top of Apache Spark. Range joins, depth of coverage and pileup computations are bread and butter for NGS analysis but the high volume of data make them execute very slowly or even failing to compute.

Requirements

  • Python 3.7, 3.8, 3.9

Features

  • custom data sources for bioinformatics file formats (BAM, CRAM, VCF)
  • depth of coverage calculations
  • pileup calculations
  • reads filtering
  • efficient range joins
  • other utility functions
  • support for both SQL and Dataframe/Dataset API

Setup

$ python -m pip install --user pysequila
or
(venv)$ python -m pip install pysequila

Usage

$ python
>>> from pysequila import SequilaSession
>>> ss = SequilaSession \
  .builder \
  .config("spark.jars.packages", "org.biodatageeks:sequila_2.12:1.1.0") \
  .config("spark.driver.memory", "2g") \
  .getOrCreate()
>>> ss.sql(
      f"""
      CREATE TABLE IF NOT EXISTS reads
      USING org.biodatageeks.sequila.datasources.BAM.BAMDataSource
      OPTIONS(path "/features/data/NA12878.multichrom.md.bam")
      """
>>> ss.sql ("SELECT * FROM  coverage('reads', 'NA12878','/features/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta")
>>> # or using DataFrame/DataSet API
>>> ss.coverage("/features/data/NA12878.multichrom.md.bam", "/features/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta")