Skip to content

Commit dc346fb

Browse files
authored
Add files via upload
new scripts for don't use boxplot for binomial data and histogram for small n.
1 parent 70eb815 commit dc346fb

2 files changed

Lines changed: 221 additions & 0 deletions

File tree

Scripts/BoxPlot_for_Binomial.Rmd

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: "BoxPlot_for_Binomial"
3+
author: "Chenxin Li"
4+
date: "2024-12-10"
5+
output: html_document
6+
---
7+
8+
```{r setup, include=FALSE}
9+
knitr::opts_chunk$set(echo = TRUE)
10+
```
11+
12+
# Friends don't let friends use boxpot for binomial data
13+
14+
# Packages
15+
```{r}
16+
library(tidyverse)
17+
library(RColorBrewer)
18+
library(ggbeeswarm)
19+
20+
library(patchwork)
21+
```
22+
23+
## Data
24+
```{r}
25+
set.seed(666)
26+
```
27+
28+
```{r}
29+
data1 <- data.frame(
30+
response = rnorm(n = 100, mean = 5, sd = 2)
31+
) %>%
32+
mutate(group = "group1")
33+
34+
data2 <- data.frame(
35+
response = c(
36+
rnorm(n = 50, mean = 2.5, sd = 1),
37+
rnorm(n = 50, mean = 7.5, sd = 1)
38+
)) %>%
39+
mutate(group = "group2")
40+
41+
data3 <- data.frame(
42+
response = c(
43+
rnorm(n = 33, mean = 2, sd = 0.5),
44+
rnorm(n = 33, mean = 5, sd = 0.5),
45+
rnorm(n = 33, mean = 8, sd = 0.5)
46+
)) %>%
47+
mutate(group = "group3")
48+
```
49+
50+
## Bad example
51+
```{r}
52+
Box <- rbind(
53+
data1,
54+
data2,
55+
data3
56+
) %>%
57+
ggplot(aes(x = group, y = response)) +
58+
geom_boxplot(aes(fill = group), alpha = 0.8, width = 0.7) +
59+
scale_fill_manual(values = brewer.pal(8, "Set2")) +
60+
labs(title = "Very similar!") +
61+
theme_classic()
62+
63+
Box
64+
```
65+
## Good example
66+
```{r}
67+
Dots <- rbind(
68+
data1,
69+
data2,
70+
data3
71+
) %>%
72+
ggplot(aes(x = group, y = response)) +
73+
geom_quasirandom(aes(color = group), alpha = 0.8) +
74+
scale_color_manual(values = brewer.pal(8, "Set2")) +
75+
labs(title = "I guess not!") +
76+
theme_classic()
77+
78+
Dots
79+
```
80+
# Wrap them together
81+
```{r}
82+
wrap_plots(
83+
Box, Dots,
84+
nrow = 1
85+
) &
86+
theme(legend.position = "none")
87+
88+
ggsave("../Results/BoxPlots_for_binomial.svg", width = 5, height = 2.5)
89+
ggsave("../Results/BoxPlots_for_binomial.png", width = 5, height = 2.5)
90+
```
91+
Before making a box plot, one should check the distribution of their data,
92+
since box plots focus on median and quartiles,
93+
they cannot handle binomial data (and by extension data with multiple modes).
94+
Ploting all the data points using `geom_quasirandom()` from the [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) is the best practice for small sample to moderate (less than tens of thousands) sample sizes,
95+
as dots are robust to small sample sizes,
96+
whereas distribution-based graphics such as violin plots and histograms are not.
97+
See [this section](https://github.com/cxli233/FriendsDontLetFriends#2-friends-dont-let-friends-make-violin-plots-for-small-sample-sizes) and [this section](https://github.com/cxli233/FriendsDontLetFriends/tree/main?tab=readme-ov-file#friends-dont-let-friends-use-histogram-for-small-sample-sizes) for details.

Scripts/Histogram_for_small_n.Rmd

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
---
2+
title: "Histogram_for_small_n"
3+
author: "Chenxin Li"
4+
date: "2024-12-10"
5+
output: html_document
6+
---
7+
8+
```{r setup, include=FALSE}
9+
knitr::opts_chunk$set(echo = TRUE)
10+
```
11+
12+
# Friends don't let friends use histogram for small sample sizes
13+
14+
# Packages
15+
```{r}
16+
library(tidyverse)
17+
library(RColorBrewer)
18+
library(ggbeeswarm)
19+
library(viridis)
20+
library(patchwork)
21+
```
22+
23+
# Data
24+
```{r}
25+
set.seed(666)
26+
```
27+
28+
```{r}
29+
n10 <- data.frame(
30+
response = rnorm(n = 10)
31+
) %>%
32+
mutate(group = "n = 10")
33+
34+
n100 <- data.frame(
35+
response = rnorm(n = 100)
36+
) %>%
37+
mutate(group = "n = 100")
38+
39+
n1000 <- data.frame(
40+
response = rnorm(n = 1000)
41+
) %>%
42+
mutate(group = "n = 1000")
43+
```
44+
45+
# Graphs
46+
```{r}
47+
bins10 <- rbind(
48+
n10, n100, n1000
49+
) %>%
50+
ggplot(aes(x = response)) +
51+
facet_wrap(~group, scales = "free", ncol = 1) +
52+
geom_histogram(bins = 10, width = 0.7, color = "white", alpha = 0.8,
53+
fill = viridis(n = 8, begin = 0.1, end = 0.8)[1]) +
54+
labs(title = "10 bins") +
55+
theme_classic() +
56+
theme(panel.spacing = unit(1, "lines"),
57+
strip.placement = "outside",
58+
strip.background = element_blank(),
59+
strip.text = element_text(hjust = 0))
60+
61+
bins10
62+
```
63+
64+
```{r}
65+
bins30 <- rbind(
66+
n10, n100, n1000
67+
) %>%
68+
ggplot(aes(x = response)) +
69+
facet_wrap(~group, scales = "free", ncol = 1) +
70+
geom_histogram(bins = 30, width = 0.7, color = "white",
71+
fill = viridis(n = 8, begin = 0.1, end = 0.8)[4]) +
72+
labs(title = "30 bins") +
73+
theme_classic() +
74+
theme(panel.spacing = unit(1, "lines"),
75+
strip.placement = "outside",
76+
strip.background = element_blank(),
77+
strip.text = element_text(hjust = 0))
78+
79+
bins30
80+
```
81+
82+
```{r}
83+
bins50 <- rbind(
84+
n10, n100, n1000
85+
) %>%
86+
ggplot(aes(x = response)) +
87+
facet_wrap(~group, scales = "free", ncol = 1) +
88+
geom_histogram(bins = 50, width = 0.7, color = "white",
89+
fill = viridis(n = 8, begin = 0.1, end = 0.8)[7]) +
90+
labs(title = "50 bins") +
91+
theme_classic() +
92+
theme(panel.spacing = unit(1, "lines"),
93+
strip.placement = "outside",
94+
strip.background = element_blank(),
95+
strip.text = element_text(hjust = 0))
96+
97+
bins50
98+
```
99+
100+
# wrap them
101+
```{r}
102+
wrap_plots(
103+
bins10, bins30, bins50 +
104+
labs(caption = "\nWow, the appearance does change with different bin numbers."),
105+
ncol = 3
106+
) &
107+
theme(plot.caption = element_text(size = 10))
108+
109+
ggsave("../Results/Histogram_for_small_n.svg", height = 6, width = 8)
110+
ggsave("../Results/Histogram_for_small_n.png", height = 6, width = 8)
111+
```
112+
I've seen histogram being proposed as the replacement for bar plots.
113+
However, a serious caveat for histogram is that histograms are not robust to bin numbers for small (and even moderate) sample sizes.
114+
What is a histogram anyway? In a histogram, we first bin the data into a defined number of bins.
115+
Then we count how many observations are there for each bin and graph them.
116+
117+
In this example, I sampled _the same_ normal distribution 3 times with different sample sizes (n = 10, 100, and 1000).
118+
Even though they came from _the same_ normal distribution, the histograms look quite different based on the number of bins.
119+
To showcase this, I plotted histogram for 10, 30, and 50 bins.
120+
121+
First of all, histogram makes no sense for small sample sizes. With small sample sizes (n < 30), the better practice is to graph all data points.
122+
Second of all, you can see that the shape of the histogram is only robust to changing bin number when the sample size is fairly large (like 1000).
123+
Even if n = 100, the appearance of the histogram can change drastically as the number of bins changes.
124+

0 commit comments

Comments
 (0)