-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathextended_example.Rmd
404 lines (325 loc) · 13 KB
/
extended_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
---
title: "Plotting the distribution of taxa"
author: "Sur Herrera Paredes"
date: "`r Sys.Date()`"
output:
github_document:
toc: TRUE
---
# Introduction
In this extended example I go through every step to produce a relative
abundance barplot that represents bacterial communities living in
individual hosts.
Bacterial communities are everywhere, and when we characterize them
it is important to describe they overall taxonomic structure as that
gives us clues as to what types of functions might be performed by the
community. At the same time, it is important to show the variability
of these communities and thus it is useful to plot them at the lowest
aggregation level possible.
In this example, I utilize data from a big experiment that
was published [here](https://www.nature.com/articles/nature11237). In
that experiment, we planted individual *Arabidopsis thaliana* plants in
individual pots. The pots each had one of two types of natural soil,
each from two different seasons. We planted eight different accessions in
each of those soils, and plants were harvested at two developmental stages.
Additionally we had unplanted soil only pots, these are the "soil" samples
in the example. For each individual plant, we harvested two fractions
(i.e. E & R), one which we called the Endophytic Compartment (E) and
corresponds to the interior of the root after removing the outer cell wall,
and another which we called Rhizosphere (R) which is the soil within 1mm of
the plant root. So **E** samples contains bacteria inside the root, and **R**
samples contain bacteria immediately surrounding the root.
Plant-bacteria interactions in the root are incredibly important because
the root is both the gut and the brain of the plant. Microbes there
can benefit of the products of the plant photosynthesis as sources
of nutrition, and can also provide chemistry that the plant couldn't perform
by itself. However, the story is more complicated because microbial competition
and the plant immune system also provide a fertile evolutionary environment
for antagonistic interactions.
Ultimately understanding and being able to manipulate plant-bacteria
interactions has a lot of implications as hunger is one of the most pressing
problems of humanity with 800 million people living in hunger. Agriculture
is our only sustainable tool against hunger, and even though it currently
employs a quarter of the World population it has not been enough to tackle
this challenge.
# Getting ready
First you need to get the data. If you haven't check the
[README](https://github.com/surh/scip_barplot/blob/master/README.md) file
of the GitHub repository of the workshop. It is also recommended that
you watch the YouTube [video](https://www.youtube.com/watch?v=siIoupAnILk)
that runs through the example. Finally, you will need to install the
`tidyverse` package.
Once you have everything you need, start an R session and load the tidyverse
package:
```{r}
library(tidyverse)
```
# Read data
First read the OTU table. You may need to change the file path
to wherever you downloaded the files in your machine.
```{r}
Tab <- read_tsv("data/rhizo/otu_table.tsv")
Tab
```
The code above reads the file into a `tibble`, which is a type
of `data.frame` that has some neat additional properties. You
don't need to concern yourself too much with the differences.
The code above also produces a warning, indicating that `read_tsv`
tried to guess the types of data in each column of the table. It
guessed correctly but you should always specify the expected columns
with the option `col_types` (use `?read_tsv` for additional details).
```{r}
Tab <- read_tsv("data/rhizo/otu_table.tsv",
col_types = cols(otu_id = col_character(),
.default = col_number()))
Tab
```
# Basic barplot
We need to think back to the original figure and reformat our data to have one
column for the x-axis and another for the y-axis. This is a requirement for
`ggplot2`. We can to that with `pivot_longer`, a function of the `tidyverse`.
```{r}
Tab %>%
pivot_longer(-otu_id, names_to = "sample_id", values_to = "count")
```
In the code above the options `samples_to` and `names_to` indicate the
names of the new columns in the new tibble.
Lets create a smaller subset of the data to make some basic plots
```{r}
dat <- Tab %>%
pivot_longer(-otu_id, names_to = "sample_id", values_to = "count") %>%
filter(otu_id %in% c("OTU_14834", "OTU_18567", "OTU_14402", "OTU_14822"))
print(dat)
```
Now we create a basic barplot that shows relative abundances of the 4
selected OTUs
```{r}
dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = otu_id), stat = "identity", position = "fill")
```
Now we re-make the plot with some beautification
```{r}
dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = otu_id), stat = "identity", position = "fill") +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90))
```
# Adding sample metadata
Read sample metadata
```{r}
Meta <- read_tsv("data/rhizo/sample_metadata.tsv",
col_types = cols(.default = col_character()))
Meta
```
We **join** our long format dataset with the sample metadata table
```{r}
print(dat)
dat <- dat %>%
left_join(Meta, by = "sample_id")
dat
```
We add facts with `facet_grid`
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = otu_id), stat = "identity", position = "fill") +
facet_grid(~ fraction) +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(color = "black"))
p1
```
We set `scales="free_x"` to plot a different set of samples on each
facet.
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = otu_id), stat = "identity", position = "fill") +
facet_grid(~ fraction, scales = "free_x") +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(color = "black"))
p1
```
We repeat the plot with some beautification
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = otu_id), stat = "identity", position = "fill", width = 1) +
facet_grid(~ fraction, scales = "free_x") +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(color = "black"),
strip.text = element_text(face = "bold"),
strip.background = element_blank())
p1
```
We can add an extra variable to the facet
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = otu_id), stat = "identity", position = "fill", width = 1) +
facet_grid(~ fraction + soil, scales = "free_x") +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(color = "black"),
strip.text = element_text(face = "bold"),
strip.background = element_blank())
p1
```
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = otu_id), stat = "identity", position = "fill", width = 1) +
facet_grid(~ fraction + soil, scales = "free_x", space = "free_x") +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(color = "black"),
strip.text = element_text(face = "bold"),
strip.background = element_blank())
p1
```
## Excercise 1
Re-make the plot above but using accession instead of soil type in the facets
## Excercise 2
Re-make the plot above but show all three variables (fraction, accession and
soil) in the facets. What happens if you change the order of the terms
in the facet formulas? is any of the orders better?
# Adding OTU taxonomy
```{r}
Tax <- read_tsv("data/rhizo/otu_taxonomy.tsv",
col_types = cols(otu_id = col_character(),
Phylum = col_character()))
Tax
```
Let's include all OTUs in our data table now
```{r}
dat <-Tab %>%
pivot_longer(-otu_id, names_to = "sample_id",
values_to = "count")
print(dat)
```
As we did before with sample metadata, we use `left_join` to add OTU
taxonomic information to our data table
```{r}
dat <- dat %>%
left_join(Tax, by = "otu_id")
print(dat)
```
Let's make our basic plot, but using `Phylum` instead of `otu_id`
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = Phylum), stat = "identity", position = "fill", width = 1) +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(face = "bold"))
p1
```
Can we find better colors
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = Phylum), stat = "identity", position = "fill", width = 1) +
scale_fill_brewer(palette = "Paired") +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(face = "bold"))
p1
```
I want phyla to be ordered by abundance and "unclassified" to be
at the end
Code to get order not shown but results are in `mean_freqs`
```{r, echo=FALSE}
mean_freqs <- dat %>%
group_by(sample_id, Phylum) %>%
summarise(phyl_count = sum(count),
.groups = 'drop') %>%
group_by(sample_id) %>%
summarise(phyl_freq = phyl_count / sum(phyl_count),
Phylum = Phylum,
.groups = 'drop') %>%
group_by(Phylum) %>%
summarise(mean = mean(phyl_freq),
.groups = 'drop') %>%
arrange(desc(mean))
print(mean_freqs)
```
Create a vector with the desired order of phyla and use it to convert
phylum column to a factor.
```{r}
phyla_order <- c("Proteobacteria",
"Actinobacteria",
"Bacteroidetes",
"Acidobacteria",
"Firmicutes",
"Cyanobacteria",
"Verrucomicrobia",
"Gemmatimonadetes",
"Armatimonadetes",
"Chloroflexi",
"unclassified")
dat <- dat %>%
mutate(Phylum = factor(Phylum, levels = phyla_order))
dat
```
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
geom_bar(aes(fill = Phylum), stat = "identity", position = "fill", width = 1) +
scale_fill_brewer(palette = "Paired") +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(face = "bold"))
p1
```
## Excercise
Use `scale_color_manual` to manually select a good set of colors for this plot
# Combining sample metadata and OTU taxonomy
We can use `lef_join` twice in a row to get a combinde data table with all
the information
```{r}
dat <- Tab %>%
pivot_longer(-otu_id, names_to = "sample_id", values_to = "count") %>%
left_join(Meta, by = "sample_id") %>%
left_join(Tax, by = "otu_id")
print(dat)
```
To get the same order as above we convert the Phylum column to a factor
```{r}
dat <- dat %>%
mutate(Phylum = factor(Phylum, levels = phyla_order))
dat
```
Now we can combine facets and phylum color and ordering in just one plot
```{r}
p1 <- dat %>%
ggplot(aes(x = sample_id, y = count)) +
facet_grid(~ fraction + soil, scales = "free_x", space = "free_x") +
geom_bar(aes(fill = Phylum), stat = "identity", position = "fill", width = 1) +
scale_y_continuous(name = "Relative abundance", labels = scales::percent) +
scale_fill_brewer(palette = "Paired") +
theme(axis.text.x = element_text(angle = 90, size = 6),
axis.text.y = element_text(color = "black"),
strip.text = element_text(face = "bold"),
strip.background = element_blank())
p1
```
We save the plot to a file
```{r}
ggsave("rhizo_phylo_distribution.png", p1, width = 8, height = 4)
```
# Extra excercises
Look at the files at [data/hmp_v13](data/hmp_v13) which contain much bigger
data tables generated from the Human Microbiome Project (HMP).
Can you make similar plots illustrating the bacterial taxonomic distributions
of the Stool and Saliva? Are there any differences by sex?
Can you order the samples by Proteobacteria relative abundance in the
rhizosphere dataset? what about the Bacteroidetes abundance in the
HMP dataset? (HINT: Use the same approach we used to sort taxonomic groups)
# Session info
```{r}
sessionInfo()
```