REAT is an RNA editing analysis toolkit designed with a focus on performance, low memory footprint, and ease of use.
Refer to the Wikipedia for classification and basic overview of known editing events. REAT can handle all types of edits (eg, A->I, C->U, etc.), except for editing by insertions or deletions.
The target platform for REAT is an x86-64 computer running Linux; working on Windows or ARM is not guaranteed.
Please, feel free to open issues regarding bugs, installation issues, feature requests, or unintuitive behavior. Quality, correctness, and ease of use are vital to us.
REAT is under active development and is not yet been officially released. Despite good tests coverage (both unit and integration), the tool still lacks user feedback. I.e., use at your own risk.
- Summarizing editing for provided regions or separate loci
- Efficient multithreading
- Strand prediction for unstranded libraries
- Autoref: simple yet useful inference of single nucleotide polymorphisms (SNP)
- Editing Index (EI) for the given set of ROI
- Flexible filtering options with reasonable default settings
See details section for more in-depth explanation of some features.
Here is a list of known limitations; feel free to open an issue if one of them is critical for you:
- Potential RNA editing by insertions/deletions is ignored
- No Python or R interface
- Lack of support for ARM and Windows/macOS builds
For a quick start, you can try the pre-built REAT binary posted via GitHub releases. Besides the tagged versions, we also provide the latest tested build from the main branch.
Note that you will have to add executable permissions to the file: sudo chmod +x reat
.
Rust package manager cargo is the recommended way to build REAT from sources:
cargo install --git https://github.com/alnfedorov/reat
reat --help
Follow this page to install rust toolchain if you don't have one.
In addition, one will need CMake (for zlib-ng), which should be available in most package managers (
e.g, apt install cmake
).
REAT supports two modes: ROI-based and site-based.
ROI stands for Region of Interest, a REAT mode in which edits are summarized for each provided genomic region. For instance, it is useful in the context of studying Alu repeats editing.
Example:
- BED-like file with ROIs(repeats.bed):
chrY 76783396 76783529 RSINE . - ... # The rest of the columns are ignored
... # Remaining regions omitted
- Command:
reat rois --input techrep1.bam techrep2.bam --rois repeats.bed \
--reference GRCm38.fa --stranding "f/s" --threads $(nproc)
Output columns:
- contig, start, end, strand, name - ROI coordinate and name from the input file
- trstrand - transcription strand; predicted for unstranded libraries and deducted from the design for stranded experiments
- coverage - number of unique reads covering ROI (after applying all filters)
- #X - number of X nucleotides in the sequence of a given ROI (always forward strand sequence)
- X->Y - the total number of events observed in a given ROI where a reference nucleotide X was replaced by Y. That is, A->A is a number of A matches, and A->G denotes the total number of observed A->I edits
Note that X->Y notation always denotes matches/mismatches relative to the forward strand. For example, T->C mismatches for reverse strand ROI are, in fact, A->G RNA mismatches.
If autoref feature is enabled, edits are summarised after correcting for any potential SNPs.
For a given set of regions, one can always calculate an editing index (EI) for all possible matches and mismatches.
For example, formally EI for mismatches A->G is defined as follows: EI(A->G) = P(transcribed ROIs are A->G edited) = ( forward(A->G) + reverse(T->C)) / (forward(A->A + A->C + A->G + A->T) + reverse(T->A + T->C + T->G + T->T)).
In other words, it is the probability that a given A in the provided ROI will be A->G edited after the transcription.
Note that the EI is strand independent as it is defined for the transcribed regions of RNA. For example, to study A->I editing, you need to refer exclusively to the A->G column and completely ignore T->C.
If a given set of ROIs represent all Alu repeats in the genome, then A->G EI is the so-called Alu Editing Index (AEI).
Example:
for bamfile in *.bam
do
reat rois --input $bamfile --rois alu-repeats.bed \
--reference hg19.fasta --threads $(nproc) \
--stranding "s/f" --annotation gencode.gff3.gz \
--ei=AEI.csv --saveto /dev/null # append Alu editing index values and skip ROI results completely
done
Output columns:
- name - name of the experiment
- ROI-file - path to the file with rois
- unstranded - number of ROIs for which no transcription strand was deduced/predicted
- X->Y - editing index for X->Y pair
One can call REAT multiple times with the same CSV file to append rows to the EI table.
The REAT site-based mode is a classic scenario for estimating RNA editing for each genomic locus.
Example:
reat site --input rnaseq.bam \
--reference hg38.fa --threads $(nproc) \
--stranding "u" --annotation gencode.gff3.gz \
--saveto /dev/stdout
Output columns:
- contig,pos - coordinate of the locus
- trstrand - transcription strand; predicted for unstranded libraries and deducted from the design for stranded experiments
- refnuc - reference nucleotide from the FASTA assembly
- prednuc - predicted reference nucleotide(assembly nucleotide if autoref feature is disabled)
- X - the total number of sequenced nucleotides X; X is one of [A, C, G, T].
Similarly to the ROI mode, the reference and sequenced nucleotides X are always reported with respect to the forward strand. That is, a minus strand locus with ten A's corresponds to ten sequenced T's from RNA fragments.
To predict transcription strand for ROI/loci in unstranded experiments, REAT uses two strategies.
First, ROI/loci strand will be derived from the overlapping genes/exons strand(only works if genome annotation is provided). This approach can be summarized in the following table:
overlapping genes | overlapping exons | predicted strand |
---|---|---|
+ | + | + |
- | - | - |
+/- | + | + |
+/- | - | - |
+/- | +/- | . |
That is, REAT checks overlapping genes first. If they are genes on the + and the - strand, exons are considered. In
the worst-case scenario, an unknown(.
) strand is returned.
Second, for ROIs / loci for which REAT could not predict the strand from the annotation, REAT attempts to derive the strand based on the observed A->I editing.
For the + strand transcripts, A->I edits are A->G mismatches, and for the - strand, T->C mismatches. Note that in many cases, this heuristic fails (no A->I editing at all), and such ROIs / loci will be left unstranded in the final table.
With sufficient coverage, we can automatically adjust the reference sequence for observed SNVs based on RNA-seq data. REAT uses a simple heuristic for this: if coverage is sufficient and the frequency of the most abundant nucleotide exceeds the threshold, then the reference nucleotide is the most abundant nucleotide at that locus.
Note that hyper-editing flag allows one to skip A->G and T->C corrections to explore potential hyperedited ROI/loci.
N
is routinely used to indicate unknown nucleotides in assemblies and sequencing data. Here are a few notes on how N
s are handled by REAT:
- Ignored in sequencing data (reads). That is, X->N mismatches are always ignored
- For unknown nucleotide
N
, all canonical nucleotides (A, C, G, T) are treated as mismatches - In rois mode,
N
reference positions are skipped
Note that the above notes apply to N
s after Autoref (if enabled). That is, in most cases, N
s will be replaced by
an appropriate nucleotide during the Autoref pass.
In short, these lists specify DNA regions that will be included or excluded from the analysis completely. I.e. counting only positions inside provided regions(include) or completely skipping them(exclude).
Regardless of the overlap with excluded / included regions, ROIs will be printed with their original coordinates and names to make them distinguishable in the subsequent analysis. This is what makes usage of include/exclude regions different from simply subtracting/intersting ROIs with them - original ROIs won't be splitted in the output.
- Check if the integration tests are correct.
- Test the include/exclude functionality.
- Command to create a BED track with edits based on the output for each site.
- More unit-tests.
- Provide CLI args description in the README
- Auto-inference for an exclude list based on the provided fasta (homopolymers, special simple repeats). A module to do this on the fly for autoref corrected positions?
- Handle heterozygous sites and known SNPs (excluding RNA based, i.e. cDNA)
- Robust error handling (remove unwrap/expect)
- Lower autoref/coverage thresholds
- Remove PCR duplicates in PE libraries by default?
- Count each mate as one half in PE experiments?
- Better stranding prediction algorithm (include downstream regions)
- Joint analysis of several samples
- Advanced EI for a set of sites / particular repeats (currently can be done using Python/R)
- Base Quality Score Recalibration
- Indels Realignment
- Statistical inference for replicated editing sites / group of sites