GitHub

REAT

REAT is an RNA editing analysis toolkit designed with a focus on performance, low memory footprint, and ease of use.

Refer to the Wikipedia for classification and basic overview of known editing events. REAT can handle all types of edits (eg, A->I, C->U, etc.), except for editing by insertions or deletions.

The target platform for REAT is an x86-64 computer running Linux; working on Windows or ARM is not guaranteed.

Please, feel free to open issues regarding bugs, installation issues, feature requests, or unintuitive behavior. Quality, correctness, and ease of use are vital to us.

Not released yet

REAT is under active development and is not yet been officially released. Despite good tests coverage (both unit and integration), the tool still lacks user feedback. I.e., use at your own risk.

Features

Summarizing editing for provided regions or separate loci
Efficient multithreading
Strand prediction for unstranded libraries
Autoref: simple yet useful inference of single nucleotide polymorphisms (SNP)
Editing Index (EI) for the given set of ROI
Flexible filtering options with reasonable default settings

See details section for more in-depth explanation of some features.

Limitations

Here is a list of known limitations; feel free to open an issue if one of them is critical for you:

Potential RNA editing by insertions/deletions is ignored
No Python or R interface
Lack of support for ARM and Windows/macOS builds

Installation

Pre-built binaries

For a quick start, you can try the pre-built REAT binary posted via GitHub releases. Besides the tagged versions, we also provide the latest tested build from the main branch.

Note that you will have to add executable permissions to the file: sudo chmod +x reat.

Cargo

Rust package manager cargo is the recommended way to build REAT from sources:

cargo install --git https://github.com/alnfedorov/reat
reat --help

Follow this page to install rust toolchain if you don't have one.

In addition, one will need CMake (for zlib-ng), which should be available in most package managers ( e.g, apt install cmake).

Basic usage

REAT supports two modes: ROI-based and site-based.

ROI mode

ROI stands for Region of Interest, a REAT mode in which edits are summarized for each provided genomic region. For instance, it is useful in the context of studying Alu repeats editing.

Example:

BED-like file with ROIs(repeats.bed):

chrY	76783396	76783529	RSINE   .    -   ... # The rest of the columns are ignored
... # Remaining regions omitted

Command:

reat rois --input techrep1.bam techrep2.bam --rois repeats.bed \
          --reference GRCm38.fa --stranding "f/s" --threads $(nproc)

Output columns:

contig, start, end, strand, name - ROI coordinate and name from the input file
trstrand - transcription strand; predicted for unstranded libraries and deducted from the design for stranded experiments
coverage - number of unique reads covering ROI (after applying all filters)
#X - number of X nucleotides in the sequence of a given ROI (always forward strand sequence)
X->Y - the total number of events observed in a given ROI where a reference nucleotide X was replaced by Y. That is, A->A is a number of A matches, and A->G denotes the total number of observed A->I edits

Note that X->Y notation always denotes matches/mismatches relative to the forward strand. For example, T->C mismatches for reverse strand ROI are, in fact, A->G RNA mismatches.

If autoref feature is enabled, edits are summarised after correcting for any potential SNPs.

ROI editing index

For a given set of regions, one can always calculate an editing index (EI) for all possible matches and mismatches.

For example, formally EI for mismatches A->G is defined as follows: EI(A->G) = P(transcribed ROIs are A->G edited) = ( forward(A->G) + reverse(T->C)) / (forward(A->A + A->C + A->G + A->T) + reverse(T->A + T->C + T->G + T->T)).

In other words, it is the probability that a given A in the provided ROI will be A->G edited after the transcription.

Note that the EI is strand independent as it is defined for the transcribed regions of RNA. For example, to study A->I editing, you need to refer exclusively to the A->G column and completely ignore T->C.

If a given set of ROIs represent all Alu repeats in the genome, then A->G EI is the so-called Alu Editing Index (AEI).

Example:

for bamfile in *.bam
do
  reat rois --input $bamfile  --rois alu-repeats.bed \
            --reference hg19.fasta --threads $(nproc) \
            --stranding "s/f" --annotation gencode.gff3.gz \
            --ei=AEI.csv --saveto /dev/null # append Alu editing index values and skip ROI results completely
done

Output columns:

name - name of the experiment
ROI-file - path to the file with rois
unstranded - number of ROIs for which no transcription strand was deduced/predicted
X->Y - editing index for X->Y pair

One can call REAT multiple times with the same CSV file to append rows to the EI table.

Site mode

The REAT site-based mode is a classic scenario for estimating RNA editing for each genomic locus.

Example:

reat site --input rnaseq.bam \
          --reference hg38.fa --threads $(nproc) \
          --stranding "u" --annotation gencode.gff3.gz \
          --saveto /dev/stdout

Output columns:

contig,pos - coordinate of the locus
trstrand - transcription strand; predicted for unstranded libraries and deducted from the design for stranded experiments
refnuc - reference nucleotide from the FASTA assembly
prednuc - predicted reference nucleotide(assembly nucleotide if autoref feature is disabled)
X - the total number of sequenced nucleotides X; X is one of [A, C, G, T].

Similarly to the ROI mode, the reference and sequenced nucleotides X are always reported with respect to the forward strand. That is, a minus strand locus with ten A's corresponds to ten sequenced T's from RNA fragments.

Details

Strand prediction

To predict transcription strand for ROI/loci in unstranded experiments, REAT uses two strategies.

First, ROI/loci strand will be derived from the overlapping genes/exons strand(only works if genome annotation is provided). This approach can be summarized in the following table:

overlapping genes	overlapping exons	predicted strand
+	+	+
-	-	-
+/-	+	+
+/-	-	-
+/-	+/-	.

That is, REAT checks overlapping genes first. If they are genes on the + and the - strand, exons are considered. In the worst-case scenario, an unknown(.) strand is returned.

Second, for ROIs / loci for which REAT could not predict the strand from the annotation, REAT attempts to derive the strand based on the observed A->I editing.

For the + strand transcripts, A->I edits are A->G mismatches, and for the - strand, T->C mismatches. Note that in many cases, this heuristic fails (no A->I editing at all), and such ROIs / loci will be left unstranded in the final table.

Autoref

With sufficient coverage, we can automatically adjust the reference sequence for observed SNVs based on RNA-seq data. REAT uses a simple heuristic for this: if coverage is sufficient and the frequency of the most abundant nucleotide exceeds the threshold, then the reference nucleotide is the most abundant nucleotide at that locus.

Note that hyper-editing flag allows one to skip A->G and T->C corrections to explore potential hyperedited ROI/loci.

How `N`s are handled?

N is routinely used to indicate unknown nucleotides in assemblies and sequencing data. Here are a few notes on how N s are handled by REAT:

Ignored in sequencing data (reads). That is, X->N mismatches are always ignored
For unknown nucleotide N, all canonical nucleotides (A, C, G, T) are treated as mismatches
In rois mode, N reference positions are skipped

Note that the above notes apply to Ns after Autoref (if enabled). That is, in most cases, Ns will be replaced by an appropriate nucleotide during the Autoref pass.

What are include/exclude lists?

In short, these lists specify DNA regions that will be included or excluded from the analysis completely. I.e. counting only positions inside provided regions(include) or completely skipping them(exclude).

Regardless of the overlap with excluded / included regions, ROIs will be printed with their original coordinates and names to make them distinguishable in the subsequent analysis. This is what makes usage of include/exclude regions different from simply subtracting/intersting ROIs with them - original ROIs won't be splitted in the output.

TODO:

Must:

Check if the integration tests are correct.
Test the include/exclude functionality.
Command to create a BED track with edits based on the output for each site.
More unit-tests.
Provide CLI args description in the README

Maybe:

Auto-inference for an exclude list based on the provided fasta (homopolymers, special simple repeats). A module to do this on the fly for autoref corrected positions?
Handle heterozygous sites and known SNPs (excluding RNA based, i.e. cDNA)
Robust error handling (remove unwrap/expect)
Lower autoref/coverage thresholds
Remove PCR duplicates in PE libraries by default?
Count each mate as one half in PE experiments?
Better stranding prediction algorithm (include downstream regions)

Want:

Joint analysis of several samples
Advanced EI for a set of sites / particular repeats (currently can be done using Python/R)
Base Quality Score Recalibration
Indels Realignment
Statistical inference for replicated editing sites / group of sites

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REAT

Not released yet

Features

Limitations

Installation

Pre-built binaries

Cargo

Basic usage

ROI mode

ROI editing index

Site mode

Details

Strand prediction

Autoref

How `N`s are handled?

What are include/exclude lists?

TODO:

Must:

Maybe:

Want:

About

Releases

Packages

Languages

License

biomancy/reat

Folders and files

Latest commit

History

Repository files navigation

REAT

Not released yet

Features

Limitations

Installation

Pre-built binaries

Cargo

Basic usage

ROI mode

ROI editing index

Site mode

Details

Strand prediction

Autoref

How Ns are handled?

What are include/exclude lists?

TODO:

Must:

Maybe:

Want:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

How `N`s are handled?

Packages