buscadoR

The goal of buscadoR is to find plant receptor proteins in information from from hmmer, deeptmhmm and BLAST to find signal peptides, repeats, transmembrane and ectodomains then applies a heuristic to classify into receptor classes.

Prerequisites

buscadoR requires input from hmmer, deeptmhmm and BLAST. A Google Colab notebook is available to help you run these time consuming searches easily. Go to the Notebook to run searches before proceeding. If you wish to run these locally please see the section Running buscadoR searches locally”.

This R package assumes that you have already performed those searches and have the necessary files to hand.

Installation

You can install buscadoR from GitHub with:

# install.packages("devtools")
devtools::install_github("TeamMacLean/buscadoR")

Compiling the results, classifying the proteins

The search results can be combined and the RLK classes predicted with the buscar function. Pass in all the input files, including the original FASTA file.

Note that you can pass either gzipped (.gz) files or plain text files.

library(buscadoR)

fasta_file <- "arabidopsis_proteome.fa"
hmmer_file <- "hmmer_results.txt.gz"
deeptmhmm_file <- "deeptmhmm_results.txt.gz"
blast_file <- "blast_results.txt.gz"

b <- buscar(hmmer = hmmer_file, deeptmhmm = deeptmhmm_file, blast = blast_file, fasta = fasta_file,
            kinase_pfam_min = 100, blast_min = 50)

Examining the results

A set of helper functions to help you extract data into a useable result is provided.

A summary table of each type found can be generated

mesa(b)

group	n
LRR-RK	111
LRR-RP	44
LRR+OTHER-RK	83
LRR+OTHER-RP	23
OTHER-RK	145
OTHER-RP	59
TM-KINASE	11
TM-PROTEIN	12
unclassified_RK	12
unclassified_RP	3

BuscadoR RLK Finding results

A tibble (dataframe) of one row per receptor protein found (ideal for exporting) can be created

to_tibble(b)
#> # A tibble: 503 × 11
#>    seq_name group evide…¹ cut_s…² tm_st…³ tm_end pfams…⁴ pfams…⁵ pfams…⁶ blast…⁷
#>    <chr>    <chr> <chr>     <int>   <int>  <int> <chr>   <chr>   <chr>   <chr>  
#>  1 AT1G497… uncl… t1           55     285    305 Pkinas… PF0006… 361-62… AT2G48…
#>  2 AT1G658… LRR+… t7           27     440    460 PK_Tyr… PF0771… 525-79… AT4G22…
#>  3 AT1G657… OTHE… PFAM         27     436    456 PK_Tyr… PF0771… 522-79… AT1G61…
#>  4 AT1G556… LRR-… PFAM+B…      24     777    797 LRR_8;… PF1385… 102-16… AT1G68…
#>  5 AT1G108… LRR-… PFAM+B…      32     259    279 Pkinas… PF0006… 360-62… AT1G56…
#>  6 AT1G705… OTHE… PFAM+B…      20     261    281 PK_Tyr… PF0771… 326-58… AT1G70…
#>  7 AT1G705… OTHE… PFAM+B…      29     261    281 Pkinas… PF0006… 327-54… AT1G19…
#>  8 AT1G162… OTHE… PFAM+B…      23     316    336 PK_Tyr… PF0771… 392-65… AT1G16…
#>  9 AT1G172… LRR-… PFAM+B…      44     708    728 LRR_8;… PF1385… 94-150… AT4G31…
#> 10 AT1G172… LRR+… t6           23     735    755 Pkinas… PF0006… 800-10… AT5G25…
#> # … with 493 more rows, 1 more variable: blast_loc <chr>, and abbreviated
#> #   variable names ¹evidence, ²cut_site, ³tm_start, ⁴pfams_hit, ⁵pfams_acc,
#> #   ⁶pfams_loc, ⁷blast_hit

Raw search results from the databases can be extracted

pfam_results(b)
#> # A tibble: 4,831 × 11
#>    seq_name    hit    acc      eval seq_f…¹ seq_to hit_f…² hit_to base_…³ b_type
#>    <chr>       <chr>  <chr>   <dbl>   <dbl>  <dbl>   <int>  <int> <chr>   <chr> 
#>  1 AT1G17600.1 LRR_3  PF07… 2.5e- 8     604    622       1     19 PF07725 LRR_P…
#>  2 AT1G12680.1 Pkina… PF00… 7.1e-72     107    355       1    264 PF00069 KINAS…
#>  3 AT1G12680.1 PK_Ty… PF07… 2.7e-36     109    348       3    254 PF07714 KINAS…
#>  4 AT1G12680.1 Kinas… PF14… 1  e- 9     209    343     154    288 PF14531 KINAS…
#>  5 AT1G12680.1 ABC1   PF03… 2.6e- 7     173    251     142    219 PF03109 KINAS…
#>  6 AT1G73450.1 Pkina… PF00… 1.3e-51     841   1143       1    264 PF00069 KINAS…
#>  7 AT1G73450.1 PK_Ty… PF07… 2.7e-19     843   1051       3    212 PF07714 KINAS…
#>  8 AT1G09440.1 Pkina… PF00… 3.7e-50     158    426       2    260 PF00069 KINAS…
#>  9 AT1G09440.1 PK_Ty… PF07… 6.3e-48     159    428       3    259 PF07714 KINAS…
#> 10 AT1G73460.1 Pkina… PF00… 1.4e-51     858   1160       1    264 PF00069 KINAS…
#> # … with 4,821 more rows, 1 more variable: pfam_length <dbl>, and abbreviated
#> #   variable names ¹seq_from, ²hit_from, ³base_acc

also deeptmhmm_results(), blast_results()

Each set of putative proteins can be rendered as a plot

dibujar(b, which = "TM-PROTEIN")

also use LRR-RK, LRR-RP, LRR+OTHER-RK, LRR+OTHER-RP, OTHER-RK, OTHER-RP, TM-KINASE, TM-PROTEIN, unclassified_RK, unclassified_RP

Exporting results

A tibble (dataframe) of one row per receptor protein can be created and written out

res <- to_tibble(b)
readr::write_csv(res, "my_results.csv")

Annotated FASTA sequences can be exported

write_seqs(b, "my_seqs.fa")

Other Stuff

A dataframe compatible with the drawProteins package here on bioconductor can be created for further plotting work. Note the plot is a ggplot2 object and can be styled using that package too.

dp <- as_drawProteins(b)

The definitions used for the classification can be returned

definiciones()
#> # A tibble: 64 × 10
#>    group    evidence SP    TM    kinas…¹ lrr_p…² other…³ at_lr…⁴ at_ot…⁵ at_un…⁶
#>    <chr>    <chr>    <lgl> <lgl> <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>  
#>  1 LRR-RK   PFAM     TRUE  TRUE  TRUE    TRUE    FALSE   FALSE   FALSE   TRUE   
#>  2 LRR-RK   BLAST    TRUE  TRUE  TRUE    FALSE   FALSE   TRUE    FALSE   TRUE   
#>  3 LRR-RK   PFAM+BL… TRUE  TRUE  TRUE    TRUE    FALSE   TRUE    FALSE   TRUE   
#>  4 LRR-RP   PFAM     TRUE  TRUE  FALSE   TRUE    FALSE   FALSE   FALSE   TRUE   
#>  5 LRR-RP   BLAST    TRUE  TRUE  FALSE   FALSE   FALSE   TRUE    FALSE   TRUE   
#>  6 LRR-RP   PFAM+BL… TRUE  TRUE  FALSE   TRUE    FALSE   TRUE    FALSE   TRUE   
#>  7 OTHER-RK PFAM     TRUE  TRUE  TRUE    FALSE   TRUE    FALSE   FALSE   TRUE   
#>  8 OTHER-RK BLAST    TRUE  TRUE  TRUE    FALSE   FALSE   FALSE   TRUE    TRUE   
#>  9 OTHER-RK PFAM+BL… TRUE  TRUE  TRUE    FALSE   TRUE    FALSE   TRUE    TRUE   
#> 10 OTHER-RP PFAM     TRUE  TRUE  FALSE   FALSE   TRUE    FALSE   FALSE   TRUE   
#> # … with 54 more rows, and abbreviated variable names ¹kinase_pfam, ²lrr_pfam,
#> #   ³other_pfam, ⁴at_lrr_blast, ⁵at_other_blast, ⁶at_unspec_blast

Running `buscadoR` searches locally

If you wish to run the searches locally the output formats for each program need to be set carefully to work with buscar().

`hmmer`

The PFAM domain hmms used for the hmmer search are provided here as a file buscador_pfam_hmm. Alternatively you can parse the latest versions from the PFAM-A database. The required IDs can be listed from buscadoR as they are held in the data objects lrr_pfam, non_lrr_pfam and kinase_pfams.

When running hmmscan use --domtblout as the output option.

`deeptmhmm`

There are no options for deeptmhmm. When running this locally, a single output file is created for each sequence. Just concatenate these into one file for input into buscar()

`blastp`

The ecto-domains for use as query sequence in the blastp are available here as a fasta buscador_ecto_fasta

Use blastp with these as the query and the proteins of interest as the subject, select outfmt 6. The resulting text file can be used as input to buscar()

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
R		R
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
buscadoR.Rproj		buscadoR.Rproj
buscadoR.ipynb		buscadoR.ipynb
class_defn_matrix.csv		class_defn_matrix.csv
logo.png		logo.png
logo2.jpg		logo2.jpg
logo3.jpg		logo3.jpg
renv.lock		renv.lock
saved.rds		saved.rds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

buscadoR

Prerequisites

Installation

Compiling the results, classifying the proteins

Examining the results

Exporting results

Other Stuff

Running `buscadoR` searches locally

`hmmer`

`deeptmhmm`

`blastp`

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

buscadoR

Prerequisites

Installation

Compiling the results, classifying the proteins

Examining the results

Exporting results

Other Stuff

Running buscadoR searches locally

hmmer

deeptmhmm

blastp

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Running `buscadoR` searches locally

`hmmer`

`deeptmhmm`

`blastp`

Packages