ModernDive_book/05-regression.Rmd at v2 · moderndive/ModernDive_book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
(ref:moderndivepart) Statistical Modeling with `moderndive`

```{r regression-conditional-text, echo=FALSE, results="asis", purl=FALSE}
if (is_latex_output()) {
  cat("# (PART) (ref:moderndivepart) {-}")
} else {
  cat("# (PART) Statistical Modeling with moderndive {-} ")
}
```

# Simple Linear Regression {#regression}

```{r regression-setup, include=FALSE, purl=FALSE}
# Used to define Learning Check numbers:
chap <- 5
lc <- 0

# Set R code chunk defaults:
opts_chunk$set(
  echo = TRUE,
  eval = TRUE,
  warning = FALSE,
  message = TRUE,
  tidy = FALSE,
  purl = TRUE,
  out.width = "\\textwidth",
#  fig.height = 4,
  fig.align = "center"
)

# Set output digit precision
options(scipen = 99, digits = 3)

# In kable printing replace all NA's with blanks
options(knitr.kable.NA = "")

# Set random number generator see value for replicable pseudo-randomness.
set.seed(76)
```

We have introduced data visualization in Chapter \@ref(viz), data wrangling in Chapter \@ref(wrangling), and data importing and "tidy" data in Chapter \@ref(tidy). In this chapter, we work with **regression**, a method that helps us study the relationship between an *outcome variable* or *response* and one or more *explanatory variables* or *regressors*.
The method starts by proposing a *statistical model*. Data is then collected and used to estimate the coefficients or parameters for the model, and these results are typically used for two purposes:

1. For **explanation** when we want to describe how changes in one or more of the regressors are associated with changes in the response, quantify those changes, establish which of the regressors truly have an association with the response, or determine whether the model used to describe the relationship between the response and the explanatory variables seems appropriate.
1. For **prediction** when we want to determine, based on the observed values of the regressors, what will the value of the response be? We are not concerned about how all the regressors relate and interact with one another or with the response, we simply want as good predictions as possible.

As an illustration, assume that we want to study the relationship between blood pressure and potential risk factors such as daily salt intake, age, and physical activity levels.
The response is blood pressure, and the regressors are the risk factors.
If we use linear regression for explanation, we may want to determine whether reducing daily salt intake has a real effect on lowering blood pressure, or by how much blood pressure decreases if an individual reduces their salt intake by half.
This information may help target individuals of a specific age group with advice on dietary changes to manage blood pressure.
On the other hand, if we use linear regression for prediction, we would like to determine, as accurately as possible, the blood pressure of a given individual based on the data collected about their salt intake, age, and physical activity levels. In this chapter, we will use linear regression for explanation.

The most basic and commonly-used type of regression is *linear regression*\index{regression!linear}.
Linear regression involves a *numerical* response and one or more regressors that can be *numerical* or *categorical*. It is called linear regression because the **statistical model** that describes the relationship between the expected response and the regressors is assumed to be linear. In particular, when the model has a single regressor, the linear regression is the equation of a line.
Linear regression is the foundation for almost any other type of regression or related method.

In Chapter \@ref(regression), we introduce linear regression with only one regressor. In Section \@ref(model1), the explanatory variable is numerical. This scenario is known as *simple linear regression*. In Section \@ref(model2), the explanatory variable is categorical.

In Chapter \@ref(multiple-regression) on multiple regression, we extend these ideas and work with models with two explanatory variables.  In Section \@ref(model4), we work with two numerical explanatory variables. In Section \@ref(model3), we work with one numerical and one categorical explanatory variable and study the model with and without interactions.

In Chapter \@ref(inference-for-regression) on inference for regression, we revisit the regression models and analyze the results using *statistical inference*, a method discussed in Chapters \@ref(sampling), \@ref(confidence-intervals), and \@ref(hypothesis-testing) on sampling, bootstrapping and confidence intervals, and hypothesis testing and $p$-values, respectively. The focus there is also be on using linear regression for prediction instead of explanation.

We begin with regression with a single explanatory variable\index{regression!basic}. We also introduce the *correlation coefficient*, discuss "correlation versus causation," and determine whether the model *fits* the data observed.


### Needed packages {-#reg-packages}

We now load all the packages needed for this chapter (this assumes you've already installed them). In this chapter, we introduce some new packages:

1. The `tidyverse` "umbrella" [@R-tidyverse] package. Recall from our discussion in Section \@ref(tidyverse-package) that loading the `tidyverse` package by running `library(tidyverse)` loads the following commonly used data science packages all at once:
    + `ggplot2` for data visualization
    + `dplyr` for data wrangling
    + `tidyr` for converting data to "tidy" format
    + `readr` for importing spreadsheet data into R
    + As well as the more advanced `purrr`, `tibble`, `stringr`, and `forcats` packages
2. The `moderndive` \index{R packages!moderndive} package of datasets and functions for tidyverse-friendly introductory linear regression as well as a data frame summary function.

If needed, read Section \@ref(packages) for information on how to install and load R packages.

```{r regression-load-packages, message=FALSE}
library(tidyverse)
library(moderndive)
```


```{r regression-load-internal, message=FALSE, echo=FALSE, purl=FALSE}
# Packages needed internally, but not in text.
library(mvtnorm)
library(broom)
library(kableExtra)
library(patchwork)
```


## One numerical explanatory variable {#model1}

Before we introduce the model needed for simple linear regression, we present an example.
Why do some countries exhibit high fertility rates while others have significantly lower ones? Are there correlations between fertility rates and life expectancy across different continents and nations? Could underlying socioeconomic factors be influencing these trends?

These are all questions that are of interest to demographers and policy makers, as understanding fertility rates is important for planning and development. By analyzing the dataset of UN member states, which includes variables such as country codes (ISO), fertility rates, and life expectancy for 2022, researchers can uncover patterns and make predictions about fertility rates based on life expectancy.

```{r regression-dynamic-text, echo=FALSE, purl=FALSE}
# This code is used for dynamic non-static in-line text output purposes
n_countries <- un_member_states_2024 |> nrow()
```

In this section, we aim to explain differences in fertility rates as a function of one numerical variable: life expectancy. Could it be that countries with higher life expectancy also have lower fertility rates? Could it be instead that countries with higher life expectancy tend to have higher fertility rates? Or could it be that there is no relationship between life expectancy and fertility rates? We answer these questions by modeling the relationship between fertility rates and life expectancy using *simple linear regression* \index{regression!simple linear} where we have:

1. A numerical outcome variable $y$ (the country's fertility rate) and
2. A single numerical explanatory variable $x$ (the country's life expectancy).


### Exploratory data analysis {#model1EDA}

The data on the `r n_countries` current UN member states (as of 2024) can be found in the `un_member_states_2024` data frame included in the `moderndive` package. However, to keep things simple we include only those rows that don't have missing data with `na.omit()` and `select()` only the subset of the variables we'll consider in this chapter, and save this data in a new data frame called `UN_data_ch5`:

```{r regression-create-UN_data_ch5}
UN_data_ch5 <- un_member_states_2024 |>
  select(iso,
         life_exp = life_expectancy_2022,
         fert_rate = fertility_rate_2022,
         obes_rate = obesity_rate_2016)|>
  na.omit()
```

```{r regression-create-n_demo_ch5, include=FALSE}
n_demo_ch5 <- nrow(UN_data_ch5)
```

A crucial step before doing any kind of analysis or modeling is performing an *exploratory data analysis*, \index{data analysis!exploratory} or EDA for short. EDA gives you a sense of the distributions of the individual variables in your data, whether any potential relationships exist between variables, whether there are outliers and/or missing values, and (most importantly) how to build your model. Here are three common steps in an EDA:

1. Most crucially, looking at the raw data values.
1. Computing summary statistics, such as means, medians, and maximums.
1. Creating data visualizations.

We perform the first common step in an exploratory data analysis: looking at the raw data values. Because this step seems so trivial, unfortunately many data analysts ignore it. However, getting an early sense of what your raw data looks like can often prevent many larger issues down the road.

You can do this by using RStudio's spreadsheet viewer or by using the `glimpse()` function as introduced in Subsection \@ref(exploredataframes) on exploring data frames:

```{r regression-glimpse-UN_data_ch5}
glimpse(UN_data_ch5)
```

Observe that ``Rows: `r n_demo_ch5` `` indicates that there are `r n_demo_ch5` rows/observations in `UN_data_ch5` after filtering out the missing values, where each row corresponds to one observed country/member state. It is important to note that the *observational unit* \index{observational unit} is an individual country. Recall from Subsection \@ref(exploredataframes) that the observational unit is the "type of thing" that is being measured by our variables.

A full description of all the variables included in `un_member_states_2024` can be found by reading the associated help file (run [`?un_member_states_2024`](https://moderndive.github.io/moderndive/reference/un_member_states_2024.html) in the console). Let's describe only the `r UN_data_ch5 |> ncol()` variables we selected in `UN_data_ch5`:

1. `iso`: An identification variable used to distinguish between the `r n_demo_ch5` countries in the filtered dataset.
2. `fert_rate`: A numerical variable representing the country's fertility rate in 2022 corresponding to the expected number of children born per woman in child-bearing years. This is the outcome variable $y$ of interest.
3. `life_exp`: A numerical variable representing the country's average life expectancy in 2022 in years. This is the primary explanatory variable $x$ of interest.
4. `obes_rate`: A numerical variable representing the country's obesity rate in 2016. This will be another explanatory variable $x$ that we use in the *Learning check* at the end of this subsection.

```{r regression-create-sample_size, echo=FALSE}
sample_size <- 5
```


An alternative way to look at the raw data values is by choosing a random sample of the rows in `UN_data_ch5` by piping it into the `slice_sample()` \index{R packages!dplyr!slice\_sample()} function from the `dplyr` package. Here we set the `n` argument to be ``r sample_size``, indicating that we want a random sample of `r sample_size` rows. We display the results in Table \@ref(tab:random-countries). Note that due to the random nature of the sampling, you will likely end up with a different subset of `r sample_size` rows.

```{r regression-sample-rows, eval=FALSE}
UN_data_ch5 |>
  slice_sample(n = 5)
```

```{r random-countries, echo=FALSE, purl=FALSE}
UN_data_ch5 |>
  slice_sample(n = sample_size) |>
  kbl(
    digits = 3,
    caption = paste("A random sample of", sample_size,
                    "out of the 193 total countries (181 without missing data)"),
    booktabs = TRUE,
    linesep = ""
  ) |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```

We have looked at the raw values in our `UN_data_ch5` data frame and got a preliminary sense of the data. We can now compute summary statistics. We start by computing the mean and median of our numerical outcome variable `fert_rate` and our numerical explanatory variable `life_exp`. We do this by using the `summarize()` function from `dplyr` along with the `mean()` and `median()` summary functions we saw in Section \@ref(summarize).

```{r regression-compute-mean, eval=FALSE}
UN_data_ch5 |>
  summarize(mean_life_exp = mean(life_exp),
            mean_fert_rate = mean(fert_rate),
            median_life_exp = median(life_exp),
            median_fert_rate = median(fert_rate))
```

```{r regression-compute-mean-sized, echo=FALSE}
UN_data_ch5 |>
  summarize(mean_life_exp = mean(life_exp),
            mean_fert_rate = mean(fert_rate),
            median_life_exp = median(life_exp),
            median_fert_rate = median(fert_rate)) |>
  kbl() |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```

However, what if we want other summary statistics as well, such as the standard deviation (a measure of spread), the minimum and maximum values, and various percentiles?

Typing all these summary statistic functions in `summarize()` would be long and tedious. Instead, we use the convenient [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function in the `moderndive` \index{R packages!moderndive!tidy\_summary()} package. This function takes in a data frame, summarizes it, and returns commonly used summary statistics in tidy format. We take our `UN_data_ch5` data frame, `select()` only the outcome and explanatory variables `fert_rate` and `life_exp`, and pipe them into the `tidy_summary` function:

```{r regression-select-vars, eval=FALSE}
UN_data_ch5 |>
  select(fert_rate, life_exp) |>
  tidy_summary()
```

```{r regression-select-vars-sized, echo=FALSE}
UN_data_ch5 |>
  select(fert_rate, life_exp) |>
  tidy_summary() |>
  kbl() |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```

We can also do this more directly by providing which `columns` we'd like a summary of inside the `tidy_summary()` function:

```{r regression-alt, eval=FALSE}
UN_data_ch5 |>
  tidy_summary(columns = c(fert_rate, life_exp))
```

```{r regression-conditional, echo=FALSE}
UN_data_ch5 |>
  tidy_summary(columns = c(fert_rate, life_exp)) |>
  kbl() |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```


Both return the same results for the numerical variables `fert_rate` and `life_exp`:

- `column`: the name of the column being summarized
- `n`: the number of non-missing values
- `group`: `NA` (missing) for numerical columns, but will break down a categorical variable into its levels
- `type`: which type of column is it (`numeric`, `character`, `factor`, or `logical`)
- `min`: the *minimum* value
- `Q1`: the 1st quartile: the value at which 25% of observations are smaller than it (the *25th percentile*)
- `mean`: the average value for measuring central tendency
- `median`: the 2nd quartile: the value at which 50% of observations are smaller than it (the *50th percentile*)
- `Q3`: the 3rd quartile: the value at which 75% of observations are smaller than it (the *75th percentile*)
- `max`: the *maximum* value
- `sd`: the standard deviation value for measuring spread


```{r regression-create-summary_df, echo=FALSE}
summary_df <- UN_data_ch5 |>
  select(fert_rate, life_exp) |>
  tidy_summary() |>
  filter(type == "numeric")
fert_summary_df <- summary_df  |>
  filter(column == "fert_rate")
life_summary_df <- summary_df  |>
  filter(column == "life_exp")
```

Looking at this output, we can see how the values of both variables distribute. For example, the median fertility rate was `r fert_summary_df$median[1]`, whereas the median life expectancy was `r life_summary_df$median[1]` years. The middle 50% of fertility rates was between `r fert_summary_df$Q1[[1]]` and `r fert_summary_df$Q3[[1]]` (the first and third quartiles), and the middle 50% of life expectancies was from `r life_summary_df$Q1[[1]]` to `r life_summary_df$Q3[[1]]`.

The `tidy_summary()` function only returns what are known as *univariate* \index{univariate} summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist *bivariate* \index{bivariate} summary statistics: functions that take in two variables and return some summary of those two variables.

In particular, when the two variables are numerical, we can compute the \index{correlation (coefficient)} *correlation coefficient*. Generally speaking, *coefficients* are quantitative expressions of a specific phenomenon. A *correlation coefficient* measures the *strength of the linear relationship between two numerical variables*. Its value goes from -1 and 1 where:

* -1 indicates a perfect *negative relationship*:
  As one variable increases, the value of the other variable tends to go down, following a straight line.
* 0 indicates no relationship:
  The values of both variables go up/down independently of each other.
* +1 indicates a perfect *positive relationship*:
  As the value of one variable goes up, the value of the other variable tends to go up as well in a linear fashion.

Figure \@ref(fig:correlation1) gives examples of nine different correlation coefficient values for hypothetical numerical variables $x$ and $y$.

```{r correlation1, echo=FALSE, fig.cap="Nine different correlation coefficients.", fig.height=ifelse(knitr::is_latex_output(), 3, 4), purl=FALSE}
correlation <- c(-0.9999, -0.9, -0.75, -0.3, 0, 0.3, 0.75, 0.9, 0.9999)
n_sim <- 100
values <- NULL
for (i in seq_along(correlation)) {
  rho <- correlation[i]
  sigma <- matrix(c(5, rho * sqrt(50), rho * sqrt(50), 10), 2, 2)
  sim <- rmvnorm(
    n = n_sim,
    mean = c(20, 40),
    sigma = sigma
  ) |>
    as.data.frame() |>
    as_tibble() |>
    mutate(correlation = round(rho, 2))

  values <- bind_rows(values, sim)
}

corr_plot <- ggplot(data = values, mapping = aes(V1, V2)) +
  geom_point() +
  facet_wrap(~correlation, ncol = 3) +
  labs(x = "x", y = "y") +
  theme(
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank()
  )

if (is_latex_output()) {
  corr_plot +
    theme(
      strip.text = element_text(colour = "black"),
      strip.background = element_rect(fill = "grey93")
    )
} else {
  corr_plot
}
```

For example, observe in the top right plot that for a correlation coefficient of -0.75 there is a negative linear relationship between $x$ and $y$, but it is not as strong as the negative linear relationship between $x$ and $y$ when the correlation coefficient is -0.9 or -1.

The correlation coefficient can be computed using the [`get_correlation()`](https://moderndive.github.io/moderndive/reference/get_correlation.html) \index{R packages!moderndive!get\_correlation()} function in the `moderndive` package. In this case, the inputs to the function are the two numerical variables for which we want to calculate the correlation coefficient.

We put the name of the outcome variable on the left-hand side of the `~` "tilde" sign, while putting the name of the explanatory variable on the right-hand side. This is known as R's \index{R!formula notation} *formula notation*. We will use this same "formula" syntax with regression later in this chapter.

```{r regression-demo-code}
UN_data_ch5 |>
  get_correlation(formula = fert_rate ~ life_exp)
```

An alternative way to compute correlation is to use the `cor()` summary function within a `summarize()`:

```{r regression-summarize, eval=FALSE}
UN_data_ch5 |>
  summarize(correlation = cor(fert_rate, life_exp))
```

```{r regression-summarize-v2, echo=FALSE, purl=FALSE}
cor_ch5 <- UN_data_ch5 |>
  summarize(correlation = cor(fert_rate, life_exp,
                              use = "complete.obs")) |>
  round(3) |>
  pull()
```

In our case, the correlation coefficient of `r cor_ch5` indicates that the relationship between fertility rate and life expectancy is "moderately negative." There is a certain amount of subjectivity in interpreting correlation coefficients, especially those that are not close to the extreme values of -1, 0, and 1. To develop your intuition about correlation coefficients, play the "Guess the Correlation" 1980's style video game mentioned in Subsection \@ref(additional-resources-basic-regression).

We now perform the last step in EDA: creating data visualizations. Since both the `fert_rate` and `life_exp` variables are numerical, a scatterplot is an appropriate graph to visualize this data. We do this using `geom_point()` and display the result in Figure \@ref(fig:numxplot1). Furthermore, we set the `alpha` value to `0.1` to check for any overplotting.

```{r numxplot1, fig.cap="Scatterplot of relationship of life expectancy and fertility rate.", fig.height=ifelse(knitr::is_latex_output(), 4.5, 5)}
ggplot(UN_data_ch5,
       aes(x = life_exp, y = fert_rate)) +
  geom_point(alpha = 0.1) +
  labs(x = "Life Expectancy", y = "Fertility Rate")
```

We do not see much for overplotting due to little to no overlap in the points. Most life expectancy entries appear to fall between 70 and 80 years, while most fertility rate entries fall between 1.5 and 3.5 births. Furthermore, while opinions may vary, it is our opinion that the relationship between fertility rate and life expectancy is "moderately negative." This is consistent with our earlier computed correlation coefficient of `r cor_ch5`.

We build on the scatterplot in Figure \@ref(fig:numxplot1) by adding a "best-fitting" line: of all possible lines we can draw on this scatterplot, it is the line that "best" fits through the cloud of points. We do this by adding a new `geom_smooth(method = "lm", se = FALSE)` layer to the `ggplot()` code that created the scatterplot in Figure \@ref(fig:numxplot1). The `method = "lm"` argument sets the line to be a "**l**inear **m**odel." The `se = FALSE` \index{R packages!ggplot2!geom\_smooth()} argument suppresses _standard error_ uncertainty bars. (We'll define the concept of _standard error_ later in Subsection \@ref(sampling-variation).)

```{r numxplot3, fig.cap="Scatterplot of life expectancy and fertility rate with regression line.", message=FALSE, fig.height=ifelse(knitr::is_latex_output(), 4, 5)}
ggplot(UN_data_ch5, aes(x = life_exp, y = fert_rate)) +
  geom_point(alpha = 0.1) +
  labs(x = "Life Expectancy",
    y = "Fertility Rate",
    title = "Relationship of life expectancy and fertility rate") +
  geom_smooth(method = "lm", se = FALSE)
```

The line in the resulting Figure \@ref(fig:numxplot3) is called a "regression line." The regression line \index{regression!line} is a visual summary of the relationship between two numerical variables, in our case the outcome variable `fert_rate` and the explanatory variable `life_exp`. The negative slope of the blue line is consistent with our earlier observed correlation coefficient of `r cor_ch5` suggesting that there is a negative relationship between these two variables: as a country's population has higher life expectancy it tends to have a lower fertility rate. We'll see later, however, that while the correlation coefficient and the slope of a regression line always have the same sign (positive or negative), they typically do not have the same value.

Furthermore, a regression line is "best-fitting" in that it minimizes some mathematical criteria. We present these mathematical criteria in Subsection \@ref(leastsquares), but we suggest you read this subsection only after first reading the rest of this section on regression with one numerical explanatory variable.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Conduct a new exploratory data analysis with the same outcome variable $y$ being `fert_rate` but with `obes_rate` as the new explanatory variable $x$. Remember, this involves three things:

(a) Looking at the raw data values.
(a) Computing summary statistics.
(a) Creating data visualizations.

What can you say about the relationship between obesity rate and fertility rate based on this exploration?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
What is the main purpose of performing an exploratory data analysis (EDA) before fitting a regression model?

- A. To predict future values.
- B. To understand the relationship between variables and detect potential issues.
- C. To create more variables.
- D. To generate random samples.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which of the following is correct about the correlation coefficient?

- A. It ranges from -2 to 2.
- B. It only measures the strength of non-linear relationships.
- C. It ranges from -1 to 1 and measures the strength of linear relationships.
- D. It is always zero.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


### Simple linear regression {#model1table}

You may recall from secondary/high school algebra that the equation of a line is $y = a + b\cdot x$. (Note that the $\cdot$ symbol is equivalent to the $\times$ "multiply by" mathematical symbol. We'll use the $\cdot$ symbol in the rest of this book as it is more succinct.) It is defined by two coefficients $a$ and $b$. The intercept coefficient $a$ is the value of $y$ when $x = 0$. The slope coefficient $b$ for $x$ is the increase in $y$ for every increase of one in $x$. This is also called the "rise over run."

However, when defining a regression line like the one in Figure \@ref(fig:numxplot3), we use slightly different notation: the equation of the regression line is $\widehat{y} = b_0 + b_1 \cdot x$ \index{regression!equation of a line}. The intercept coefficient is $b_0$, so $b_0$ is the value of $\widehat{y}$ when $x = 0$. The slope coefficient for $x$ is $b_1$, i.e., the increase in $\widehat{y}$ for every increase of one in $x$. Why do we put a "hat" on top of the $y$? It's a form of notation commonly used in regression to indicate that we have a \index{regression!fitted value} "fitted value," or the value of $y$ on the regression line for a given $x$ value as discussed further in Subsection \@ref(model1points).

We know that the regression line in Figure \@ref(fig:numxplot3) has a negative slope $b_1$ corresponding to our explanatory $x$ variable `life_exp`. Why? Because as countries tend to have higher `life_exp` values, they tend to have lower `fert_rate` values. However, what is the numerical value of the slope $b_1$? What about the intercept $b_0$?  We do not compute these two values by hand, but rather we use a computer!

We can obtain the values of the intercept $b_0$ and the slope for `life_exp` $b_1$ by outputting the *linear regression coefficients*. This is done in two steps:

1. We first "fit" the linear regression model using the `lm()` function and save it in `demographics_model`.
1. We get the regression coefficients by applying `coef()`  to `demographics_model`.

```{r regression-lm-fertility, eval=FALSE}
# Fit regression model:
demographics_model <- lm(fert_rate ~ life_exp, data = UN_data_ch5)
# Get regression coefficients
coef(demographics_model)
```
```{r regression-lm-fertility-alt, echo=FALSE, purl=FALSE}
demographics_model <- lm(fert_rate ~ life_exp,
                         data = UN_data_ch5)
demo_line <- demographics_model |>
  get_regression_table() |>
  pull(estimate)
```

We first focus on interpreting the regression coefficients, and later revisit the code that produced it. The coefficients are the intercept $b_0 = `r demo_line[1]`$ and the slope $b_1 = `r demo_line[2]`$ for `life_exp`. Thus the equation of the regression line in Figure \@ref(fig:numxplot3) follows:

$$
\begin{aligned}
\widehat{y} &= b_0 + b_1 \cdot x\\
\widehat{\text{fertility}\_\text{rate}} &= b_0 + b_{\text{life}\_\text{expectancy}} \cdot \text{life}\_\text{expectancy}\\
&= `r demo_line[1] |> round(3)` + (`r demo_line[2]`) \cdot \text{life}\_\text{expectancy}
\end{aligned}
$$

The intercept $b_0$ = `r demo_line[1]` is the average fertility rate $\widehat{y}$ = $\widehat{\text{fertility}\_\text{rate}}$ for those countries that had a `life_exp` of 0. Or in graphical terms, where the line intersects the $y$ axis for $x$ = 0. Note, however, that while the \index{regression!equation of a line!intercept} intercept of the regression line has a mathematical interpretation, it has no *practical* interpretation here, since observing a `life_exp` of 0 is impossible. Furthermore, looking at the scatterplot with the regression line in Figure \@ref(fig:numxplot3), no countries had a life expectancy anywhere near 0.

Of greater interest is the \index{regression!equation of a line!slope} slope $b_{\text{life}\_\text{expectancy}}$ for `life_exp` of `r demo_line[2]`. This summarizes the relationship between the fertility rate and life expectancy variables. Note that the sign is negative, suggesting a negative relationship between these two variables. This means countries with higher life expectancies tend to have lower fertility rates. Recall that the correlation coefficient is `r cor_ch5`. They both have the same negative sign, but have a different value. Recall also that the correlation's interpretation is the "strength of linear association." The \index{regression!interpretation of the slope} slope's interpretation is a little different:

> For every increase of 1 unit in `life_exp`, there is an *associated* decrease of, *on average*, `r abs(demo_line[2])` units of `fert_rate`.

We only state that there is an *associated* increase and not necessarily a *causal* increase. Perhaps it may not be that higher life expectancies directly cause lower fertility rates. Instead, wealthier countries could tend to have stronger educational backgrounds, improved health, a higher standard of living, and have lower fertility rates, while at the same time these wealthy countries also tend to have higher life expectancies. Just because two variables are strongly associated, it does not necessarily mean that one causes the other. This is summed up in the often-quoted phrase, "correlation is not necessarily causation."  We discuss this idea further in Subsection \@ref(correlation-is-not-causation).

Furthermore, we say that this associated decrease is *on average* `r abs(demo_line[2])` units of `fert_rate`, because you might have two countries whose `life_exp` values differ by 1 unit, but their difference in fertility rates may not be exactly $`r demo_line[2]`$. What the slope of $`r demo_line[2]`$ is saying is that across all possible countries, the *average* difference in fertility rate between two countries whose life expectancies differ by one is $`r demo_line[2]`$.

Now that we have learned how to compute the equation for the regression line in Figure \@ref(fig:numxplot3) using the model coefficient values and how to interpret the resulting intercept and slope, we revisit the code that generated these coefficients:

```{r regression-lm-fertility-alt2, eval=FALSE}
# Fit regression model:
demographics_model <- lm(fert_rate ~ life_exp, data = UN_data_ch5)
# Get regression coefficients:
coef(demographics_model)
```

First, we "fit" the linear regression model to the `data` using the `lm()` \index{lm()} function and save this as `demographics_model`. When we say "fit," we mean "find the best fitting line to this data." `lm()` stands for "linear model" and is used as `lm(y ~ x, data = df_name)` where:

* `y` is the outcome variable, followed by a tilde `~`. In our case, `y` is set to `fert_rate`.
* `x` is the explanatory variable. In our case, `x` is set to `life_exp`.
* The combination of `y ~ x` is called a *model formula*. (Note the order of `y` and `x`.) In our case, the model formula is `fert_rate ~ life_exp`. We saw such model formulas earlier with the [`get_correlation()`](https://moderndive.github.io/moderndive/reference/get_correlation.html) function in Subsection \@ref(model1EDA).
* `df_name` is the name of the data frame that contains the variables `y` and `x`. In our case, `data` is the `UN_data_ch5` data frame.

Second, we take the saved model in `demographics_model` and apply the `coef()` function to it to obtain the regression coefficients. This gives us the components of the regression equation line: the intercept $b_0$ and the slope $b_1$.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Fit a simple linear regression using `lm(fert_rate ~ obes_rate, data = UN_data_ch5)` where `obes_rate` is the new explanatory variable $x$. Learn about the "best-fitting" line from the regression coefficients by applying the `coef()` function. How do the regression results match up with your earlier exploratory data analysis?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does the intercept term $b_0$ represent in simple linear regression?

- A. The change in the outcome for a one-unit change in the explanatory variable.
- B. The predicted value of the outcome when the explanatory variable is zero.
- C. The standard error of the regression.
- D. The correlation between the outcome and explanatory variables.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What best describes the "slope" of a simple linear regression line?

- A. The increase in the explanatory variable for a one-unit increase in the outcome.
- B. The average of the explanatory variable.
- C. The change in the outcome for a one-unit increase in the explanatory variable.
- D. The minimum value of the outcome variable.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What does a negative slope in a simple linear regression indicate?

- A. The outcome variable decreases as the explanatory variable increases.
- B. The explanatory variable remains constant as the outcome variable increases.
- C. The correlation coefficient is zero.
- D. The outcome variable increases as the explanatory variable increases.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


### Observed/fitted values and residuals {#model1points}

We just saw how to get the value of the intercept and the slope of a regression line from the output of the `coef()` function. Now instead say we want information on individual observations. For example, we focus on the 21st of the `r n_demo_ch5` countries in the `UN_data_ch5` data frame in Table \@ref(tab:country-21). This corresponds to the UN member state of Bosnia and Herzegovina (BIH).

```{r country-21, echo=FALSE, purl=FALSE}
index <- which(UN_data_ch5$iso == "BIH")
target_point <- demographics_model |>
  get_regression_points() |>
  slice(index)
x <- target_point$life_exp
y <- target_point$fert_rate
y_hat <- target_point$fert_rate_hat
resid <- target_point$residual
UN_data_ch5 |>
  slice(index) |>
  kbl(
#    digits = 4,
    caption = "Data for the 21st country out of 193",
    booktabs = TRUE,
    linesep = ""
  ) |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```

What is the value $\widehat{y}$ on the regression line corresponding to this country's `life_exp` value of `r x`? In Figure \@ref(fig:numxplot4) we mark three values corresponding to these results for Bosnia and Herzegovina and give their statistical names:

* Circle: The *observed value* $y$ = `r y` is this country's actual fertility rate.
* Square: The *fitted value* $\widehat{y}$ is the value on the regression line for $x =  \texttt{life\_exp} = `r x`$, computed with the intercept and slope in the previous regression table:

$$\widehat{y} = b_0 + b_1 \cdot x = `r demo_line[1]` + (`r demo_line[2]`) \cdot `r x` = `r y_hat`$$
* Arrow: The length of this arrow is the *residual* \index{regression!residual} and is computed by subtracting the fitted value $\widehat{y}$ from the observed value $y$. The residual can be thought of as a model's error or "lack of fit" for a particular observation.  In the case of this country, it is $y - \widehat{y} = `r y` - `r y_hat` = `r resid`$.

```{r numxplot4, echo=FALSE, fig.cap="Example of observed value, fitted value, and residual.", fig.height=ifelse(knitr::is_latex_output(), 3, 4), message=FALSE, purl=FALSE}
best_fit_plot <- ggplot(UN_data_ch5, aes(x = life_exp, y = fert_rate)) +
  geom_point(color = "grey") +
  labs(
    x = "Life Expectancy", y = "Fertility Rate",
    title = "Relationship of Fertility Rate and Life Expectancy"
  ) +
  geom_smooth(method = "lm", se = FALSE) +
  annotate("point", x = x, y = y_hat, col = "red", shape = 15, size = 4) +
  annotate("segment",
           x = x, xend = x, y = y, yend = y_hat, color = "blue",
           arrow = arrow(type = "closed", length = unit(0.04, "npc"))
  ) +
  annotate("point", x = x, y = y, col = "red", size = 4)
best_fit_plot
```

Now say we want to compute both the fitted value $\widehat{y} = b_0 + b_1 \cdot x$ and the residual $y - \widehat{y}$ for *all* `r n_demo_ch5` UN member states with complete data as of 2024. Recall that each country corresponds to one of the `r n_demo_ch5` rows in the `UN_data_ch5` data frame and also one of the `r n_demo_ch5` points in the regression plot in Figure \@ref(fig:numxplot4).

We could repeat the previous calculations we performed by hand `r n_demo_ch5` times, but that would be tedious and time consuming. Instead, we use a computer with the `get_regression_points()` function. We apply the `get_regression_points()` function to `demographics_model`, which is where we saved our `lm()` model in the previous section. In Table \@ref(tab:regression-points-1) we present the results of only the 21st through 24th courses for brevity.

```{r regression-reg-points, eval=FALSE}
regression_points <- get_regression_points(demographics_model)
regression_points
```

```{r regression-points-alt, echo=FALSE, purl=FALSE}
set.seed(76)
regression_points <- get_regression_points(demographics_model)
regression_points |>
  slice(c(index, index + 1, index + 2, index + 3)) |>
  kable(
    digits = 3,
    caption = "Regression points (for only the 21st through 24th countries)",
    booktabs = TRUE,
    linesep = ""
  )

# This code is used for dynamic non-static in-line text output purposes
n_regression_points <- regression_points |> nrow()
fert_rate_24 <- regression_points$fert_rate[24]
life_exp_24 <- regression_points$life_exp[24] |> format(round(2), nsmall = 3)
fert_rate_hat_24 <- regression_points$fert_rate_hat[24] |> round(3) |> format(nsmall = 3)
residual_24 <- regression_points$residual[24] |> round(3) |> format(nsmall = 3)
```

This function is an example of what is known in computer programming as a *wrapper function*. \index{functions!wrapper} It takes other pre-existing functions and "wraps" them into a single function that hides its inner workings. This concept is illustrated in Figure \@ref(fig:moderndive-figure-wrapper).

```{r moderndive-figure-wrapper, echo=FALSE, fig.cap="The concept of a wrapper function.", out.width="50%", purl=FALSE}
include_graphics("images/shutterstock/wrapper_function.png")
```

So all you need to worry about is what the inputs look like and what the outputs look like; you leave all the other details "under the hood of the car." In our regression modeling example, the `get_regression_points()` function takes a saved `lm()` linear regression model as input and returns a data frame of the regression predictions as output. If you are interested in learning more about the `get_regression_points()` function's inner workings, check out Subsection \@ref(underthehood).

We inspect the individual columns and match them with the elements of Figure \@ref(fig:numxplot4):

* The `fert_rate` column represents the observed outcome variable $y$. This is the y-position of the `r n_regression_points` black points.
* The `life_exp` column represents the values of the explanatory variable $x$. This is the x-position of the `r n_regression_points` black points.
* The `fert_rate_hat` column represents the fitted values $\widehat{y}$. This is the corresponding value on the regression line for the `r n_regression_points` $x$ values.
* The `residual` column represents the residuals $y - \widehat{y}$. This is the `r n_regression_points` vertical distances between the `r n_regression_points` black points and the regression line.

Just as we did for the 21st country in the `UN_data_ch5` dataset (in the first row of the table), we repeat the calculations for the 24th country (in the fourth row of Table \@ref(tab:regression-points-1)). This corresponds to the country of Brunei (BRN):

* `fert_rate` $= `r fert_rate_24`$ is the observed `fert_rate` $y$ for this country.
* `life_exp` $= `r life_exp_24`$ is the value of the explanatory variable `life_exp` $x$ for Brunei.
* `fert_rate_hat` $= `r fert_rate_hat_24` = `r demo_line[1]` + (`r demo_line[2]`) \cdot `r life_exp_24`$ is the fitted value $\widehat{y}$ on the regression line for this country.
* `residual` $= `r residual_24` = `r fert_rate_24` - `r fert_rate_hat_24`$ is the value of the residual for this country. In other words, the model's fitted value was off by `r residual_24` fertility rate units for Brunei.

If you like, you can skip ahead to Subsection \@ref(leastsquares) to learn about the processes behind what makes "best-fitting" regression lines. As a primer, a "best-fitting" line refers to the line that minimizes the *sum of squared residuals* out of all possible lines we can draw through the points. In Section \@ref(model2), we'll discuss another common scenario of having a categorical explanatory variable and a numerical outcome variable.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is a "wrapper function" in the context of statistical modeling in R?

- A. A function that directly fits a regression model without using any other functions.
- B. A function that combines other functions to simplify complex operations and provide a user-friendly interface.
- C. A function that removes missing values from a dataset before analysis.
- D. A function that only handles categorical data in regression models.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Generate a data frame of the residuals of the *Learning check* model where you used `obes_rate` as the explanatory $x$ variable.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Which of the following statements is true about the regression line in a simple linear regression model?

- A. The regression line represents the average of the outcome variable.
- B. The regression line minimizes the sum of squared differences between the observed and predicted values.
- C. The regression line always has a slope of zero.
- D. The regression line is only useful when there is no correlation between variables.

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```

## One categorical explanatory variable {#model2}

It is an unfortunate truth that life expectancy is not the same across all countries in the world. International development agencies are interested in studying these differences in life expectancy in the hopes of identifying where governments should allocate resources to address this problem. In this section, we explore differences in life expectancy in two ways:

1. Differences between continents: Are there significant differences in average life expectancy between the six populated continents of the world: Africa, North America, South America, Asia, Europe, and Oceania?
1. Differences within continents: How does life expectancy vary within the world's five continents? For example, is the spread of life expectancy among the countries of Africa larger than the spread of life expectancy among the countries of Asia?

To answer such questions, we use an updated version of the `gapminder` data frame we visualized in Figure \@ref(fig:gapminder) in Subsection \@ref(gapminder) on the grammar of graphics. This updated `un_member_states_2024` data we saw earlier in this chapter. It is included in the `moderndive` \index{R packages!moderndive} package and has international development statistics such as life expectancy, GDP per capita, and population for `r n_countries` countries for years near 2024. We use this data for basic regression again, but now using an explanatory variable $x$ that is categorical, as opposed to the numerical explanatory variable model we used in the previous Section \@ref(model1):

1. A numerical outcome variable $y$ (a country's life expectancy) and
1. A single categorical explanatory variable $x$ (the continent that the country is a part of).

When the explanatory variable $x$ is categorical, the concept of a "best-fitting" regression line is a little different than the one we saw previously in Section \@ref(model1) where the explanatory variable $x$ was numerical. We study these differences shortly in Subsection \@ref(model2table), but first we conduct an exploratory data analysis.


### Exploratory data analysis {#model2EDA}

The data on the `r n_countries` countries can be found in the `un_member_states_2024` data frame included in the `moderndive` package. However, to keep things simple, we `select()` only the subset of the variables we'll consider in this chapter and focus only on rows where we have no missing values with `na.omit()`. We'll save this data in a new data frame called `gapminder2022`:

```{r regression-create-gapminder2022, message=FALSE}
gapminder2022 <- un_member_states_2024 |>
  select(country, life_exp = life_expectancy_2022, continent, gdp_per_capita) |>
  na.omit()
```

```{r regression-create-life_exp_worldwide, echo=FALSE, purl=FALSE}
# Hidden: internally compute mean and median life expectancy
life_exp_worldwide <- gapminder2022 |>
  summarize(median = median(life_exp, na.rm = TRUE),
            mean = mean(life_exp, na.rm = TRUE))
mean_africa <- gapminder2022 |>
  filter(continent == "Africa") |>
  summarize(mean_africa = mean(life_exp)) |>
  pull(mean_africa)

# This code is used for dynamic non-static in-line text output purposes
gapminder2022_rows <- gapminder2022 |> nrow()
```

We perform the first common step in an exploratory data analysis: looking at the raw data values. You can do this by using RStudio's spreadsheet viewer or by using the `glimpse()` command as introduced in Subsection \@ref(exploredataframes) on exploring data frames:

```{r regression-glimpse-gapminder2022}
glimpse(gapminder2022)
```

Observe that ``Rows: `r gapminder2022_rows` `` indicates that there are `r gapminder2022_rows` rows/observations in `gapminder2022`, where each row corresponds to one country. In other words, the *observational unit* is an individual country. Furthermore, observe that the variable `continent` is of type `<fct>`, which stands for _factor_, which is R's way of encoding categorical variables.

A full description of all the variables included in `un_member_states_2024` can be found by reading the associated help file (run [`?un_member_states_2024`](https://moderndive.github.io/moderndive/reference/un_member_states_2024.html) in the console). However, we fully describe only the `r ncol(gapminder2022)` variables we selected in `gapminder2022`:

1. `country`: An identification variable of type character/text used to distinguish the `r gapminder2022_rows` countries in the dataset.
1. `life_exp`: A numerical variable of that country's life expectancy at birth. This is the outcome variable $y$ of interest.
1. `continent`: A categorical variable with five levels. Here "levels" correspond to the possible categories: Africa, Asia, Americas, Europe, and Oceania. This is the explanatory variable $x$ of interest.
1. `gdp_per_capita`: A numerical variable of that country's GDP per capita in US inflation-adjusted dollars that we'll use as another outcome variable $y$ in the *Learning check* at the end of this subsection.

We next look at a random sample of three out of the `r gapminder2022_rows` countries in Table \@ref(tab:model2-data-preview).

```{r regression-alt2, eval=FALSE}
gapminder2022 |> sample_n(size = 3)
```
```{r model2-data-preview, echo=FALSE, purl=FALSE}
gapminder2022 |>
  sample_n(3) |>
  kbl(
    digits = 3,
    caption = "Random sample of 3 out of 188 countries",
    booktabs = TRUE,
    linesep = ""
  ) |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```

Random sampling will likely produce a different subset of 3 rows for you than what's shown. Now that we have looked at the raw values in our `gapminder2022` data frame and got a sense of the data, we compute summary statistics. We again apply `tidy_summary()` from the `moderndive` package. Recall that this function takes in a data frame, summarizes it, and returns commonly used summary statistics. We take our `gapminder2022` data frame, `select()` only the outcome and explanatory variables `life_exp` and `continent`, and pipe them into `tidy_summary()` in Table \@ref(tab:lifeexp-cont):

```{r regression-select-vars-alt, eval=FALSE}
gapminder2022 |> select(life_exp, continent) |> tidy_summary()
```

```{r lifeexp-cont, echo=FALSE}
gapminder2022 |>
  select(life_exp, continent) |>
  tidy_summary() |>
  kbl(
    caption = "Summary of life expectancy and continent variables",
    booktabs = TRUE,
    linesep = ""
  ) |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 9, 16),
    latex_options = c("HOLD_position")
  )
```

```{r regression-alt2-dup1, include=FALSE}
# For discussion in bullet points below
gapminder2022 |> count(continent)
```

The `tidy_summary()` output now reports summaries for categorical variables and for the numerical variables we reviewed before. Let's focus just on discussing the results for the categorical `factor` variable `continent`:

- `n`: The number of non-missing entries for each group
- `group`: Breaks down a categorical variable into its unique levels. For this variable, it is corresponding to Africa, Asia, North and South America, Europe, and Oceania.
- `type`: The data type of the variable. Here, it is a `factor`.
- `min` to `sd`: These are missing since calculating the five-number summary, the mean, and standard deviation for categorical variables doesn't make sense.

```{r regression-assign-median_life_exp, echo=FALSE, purl=FALSE}
# This code is used for dynamic non-static in-line text output purposes
median_life_exp <- life_summary_df$median |> round(2)
mean_life_exp <- life_summary_df$mean |> round(2)
```

Turning our attention to the summary statistics of the numerical variable `life_exp`, we observe that the global median life expectancy in 2022 was `r median_life_exp`. Thus, half of the world's countries (`r floor(n_countries/2)` countries) had a life expectancy of less than `r median_life_exp`. The mean life expectancy of `r mean_life_exp` is lower, however. Why is the mean life expectancy lower than the median?

We can answer this question by performing the last of the three common steps in an exploratory data analysis: creating data visualizations. We visualize the distribution of our outcome variable $y$ = `life_exp` in Figure \@ref(fig:lifeexp2022hist).

```{r lifeexp2022hist, echo=TRUE, fig.cap="Histogram of life expectancy in 2022.", fig.height=ifelse(knitr::is_latex_output(), 3, 4)}
ggplot(gapminder2022, aes(x = life_exp)) +
  geom_histogram(binwidth = 5, color = "white") +
  labs(x = "Life expectancy",
       y = "Number of countries",
       title = "Histogram of distribution of worldwide life expectancies")
```

We see that this data is *left-skewed*, also known as *negatively* \index{skew} skewed: there are a few countries with low life expectancy that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers; hence, the median is greater than the mean in this case.

Remember, however, that we want to compare life expectancies both between continents and within continents. In other words, our visualizations need to incorporate some notion of the variable `continent`. We can do this easily with a faceted histogram. Recall from Section \@ref(facets) that facets allow us to split a visualization by the different values of another variable. We display the resulting visualization in Figure \@ref(fig:catxplot0b) by adding a \index{R packages!ggplot2!facet\_wrap()} `facet_wrap(~ continent, nrow = 2)` layer.

```{r regression-facet-hist-lifeexp, eval=FALSE}
ggplot(gapminder2022, aes(x = life_exp)) +
  geom_histogram(binwidth = 5, color = "white") +
  labs(x = "Life expectancy",
       y = "Number of countries",
       title = "Histogram of distribution of worldwide life expectancies") +
  facet_wrap(~ continent, nrow = 2)
```

```{r catxplot0b, echo=FALSE, fig.cap="Life expectancy in 2022 by continent (faceted).", fig.height=ifelse(knitr::is_latex_output(), 4, 4), purl=FALSE}
faceted_life_exp <- ggplot(gapminder2022, aes(x = life_exp)) +
  geom_histogram(binwidth = 5, color = "white") +
  labs(
    x = "Life expectancy", y = "Number of countries",
    title = "Histogram of distribution of worldwide life expectancies"
  ) +
  facet_wrap(~continent, nrow = 2)

# Make the text black and reduce darkness of the grey in the facet labels
if (is_latex_output()) {
  faceted_life_exp +
    theme(
      strip.text = element_text(colour = "black"),
      strip.background = element_rect(fill = "grey93")
    )
} else {
  faceted_life_exp
}
```

Observe that unfortunately the distribution of African life expectancies is much lower than the other continents. In Europe, life expectancies tend to be higher and furthermore do not vary as much. On the other hand, both Asia and Africa have the most variation in life expectancies.

Recall that an alternative method to visualize the distribution of a numerical variable split by a categorical variable is by using a side-by-side boxplot. We map the categorical variable `continent` to the $x$-axis and the different life expectancies within each continent on the $y$-axis in Figure \@ref(fig:catxplot1).

```{r catxplot1, fig.cap="Life expectancy in 2022 by continent (boxplot).", fig.height=ifelse(knitr::is_latex_output(), 2.5, 4)}
ggplot(gapminder2022, aes(x = continent, y = life_exp)) +
  geom_boxplot() +
  labs(x = "Continent", y = "Life expectancy",
       title = "Life expectancy by continent")
```

Some people prefer comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram. This is because we can make quick comparisons between the categorical variable's levels with imaginary horizontal lines. For example, observe in Figure \@ref(fig:catxplot1) that we can quickly convince ourselves that Europe has the highest median life expectancies by drawing an imaginary horizontal line near $y = 81$. Furthermore, as we observed in the faceted histogram in Figure \@ref(fig:catxplot0b), Africa and Asia have the largest variation in life expectancy as evidenced by their large interquartile ranges (the size of the boxes).

It’s important to remember, however, that the solid lines in the middle of the boxes correspond to the medians (the middle value) rather than the mean (the average). So, for example, if you look at Asia, the solid line denotes the median life expectancy of around 75 years. This tells us that half of all countries in Asia have a life expectancy below 75 years, whereas half have a life expectancy above 75 years. We compute the median and mean life expectancy for each continent with a little more data wrangling and display the results in Table \@ref(tab:catxplot0).

```{r regression-grouped-summary, eval=TRUE, results='hide'}
life_exp_by_continent <- gapminder2022 |>
  group_by(continent) |>
  summarize(median = median(life_exp), mean = mean(life_exp))
life_exp_by_continent
```
```{r catxplot0, echo=FALSE, purl=FALSE}
life_exp_by_continent |>
  kbl(
    digits = 3,
    caption = "Life expectancy by continent",
    booktabs = TRUE,
    linesep = ""
  ) |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```

Observe the order of the second column `median` life expectancy: Africa is lowest, Europe the highest, and the others have similar medians between Africa and Europe. This ordering corresponds to the ordering of the solid black lines inside the boxes in our side-by-side boxplot in Figure \@ref(fig:catxplot1).

```{r regression-create-life_exp_model, echo=FALSE, purl=FALSE}
# This code is used for dynamic non-static in-line text output purposes
# Coding model earlier to calculate the intercepts etc below
life_exp_model <- lm(life_exp ~ continent, data = gapminder2022)

# Africa / Intercept
intercept <- get_regression_table(life_exp_model) |>
  filter(term == "intercept") |>
  pull(estimate) |>
  round(2)

mean_africa <- intercept

# Americas
offset_north_america <- get_regression_table(life_exp_model) |>
  filter(term == "continent: North America") |>
  pull(estimate) |>
  round(2)
mean_north_america <- intercept + offset_north_america

offset_south_america <- get_regression_table(life_exp_model) |>
  filter(term == "continent: South America") |>
  pull(estimate) |>
  round(2)
mean_south_america <- intercept + offset_south_america

# Asia
offset_asia <- get_regression_table(life_exp_model) |>
  filter(term == "continent: Asia") |>
  pull(estimate) |>
  round(2)
mean_asia <- intercept + offset_asia

# Europe
offset_europe <- get_regression_table(life_exp_model) |>
  filter(term == "continent: Europe") |>
  pull(estimate) |>
  round(2)
mean_europe <- intercept + offset_europe

# Oceania
offset_oceania <- get_regression_table(life_exp_model) |>
  filter(term == "continent: Oceania") |>
  pull(estimate) |>
  round(2)
mean_oceania <- intercept + offset_oceania
```

We now turn our attention to the values in the third column `mean`. Using Africa's mean life expectancy of `r intercept` as a *baseline for comparison*, we start making comparisons to the mean life expectancies of the other four continents and put these values in Table \@ref(tab:continent-mean-life-expectancies), which we'll revisit later on in this section.

1. For Asia, it is `r mean_asia` - `r intercept` = `r offset_asia` years higher.
1. For Europe, it is `r mean_europe` - `r intercept` = `r offset_europe` years higher.
1. For North America, it is `r mean_north_america` - `r intercept` = `r offset_north_america` years higher.
1. For Oceania, it is `r mean_oceania` - `r intercept` = `r offset_oceania` years higher.
1. For South America, it is `r mean_south_america` - `r intercept` = `r offset_south_america` years higher.

```{r continent-mean-life-expectancies, echo=FALSE, purl=FALSE}
gapminder2022 |>
  group_by(continent) |>
  summarize(mean = mean(life_exp, na.rm = TRUE)) |>
  mutate(`Difference versus Africa` = mean - mean_africa) |>
  kbl(
    digits = 2,
    caption = "Mean life expectancy by continent and relative differences from mean for Africa",
    booktabs = TRUE,
    linesep = ""
  ) |>
  kable_styling(
    font_size = ifelse(is_latex_output(), 10, 16),
    latex_options = c("HOLD_position")
  )
```

```{block, type="learncheck", purl=FALSE}
\vspace{-0.15in}
**_Learning check_**
\vspace{-0.1in}
```

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Conduct a new exploratory data analysis with the same explanatory variable $x$ being `continent` but with `gdp_per_capita` as the new outcome variable $y$. What can you say about the differences in GDP per capita between continents based on this exploration?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** When using a categorical explanatory variable in regression, what does the baseline group represent?

  - A. The group with the highest mean
  - B. The group chosen for comparison with all other groups
  - C. The group with the most data points
  - D. The group with the lowest standard deviation

```{block, type="learncheck", purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


### Linear regression {#model2table}

In Subsection \@ref(model1table) we introduced simple linear regression, which involves modeling the relationship between a numerical outcome variable $y$ and a numerical explanatory variable $x$. In our life expectancy example, we now instead have a categorical explanatory variable `continent`. Our model will not yield a "best-fitting" regression line like in Figure \@ref(fig:numxplot3), but rather *offsets* relative to a baseline for comparison.\index{offset}

As we did in Subsection \@ref(model1table) when studying the relationship between fertility rates and life expectancy, we output the regression coefficients for this model. Recall that this is done in two steps:

1. We "fit" the linear regression model using `lm(y ~ x, data)` and save it in `life_exp_model`.
1. We get the regression coefficients by applying the `coef()` function to `life_exp_model`.

```{r regression-fit-lm}
life_exp_model <- lm(life_exp ~ continent, data = gapminder2022)
coef(life_exp_model)
```

We once again focus on the values in these coefficient values. Why are there now 6 entries? We break them down one by one:

1. `intercept` corresponds to the mean life expectancy of countries in Africa of `r intercept` years.

1. `continentAsia` corresponds to countries in Asia and the value +`r offset_asia` is the same difference in mean life expectancy relative to Africa we displayed in Table \@ref(tab:continent-mean-life-expectancies). In other words, the mean life expectancy of countries in Asia is $`r intercept` + `r offset_asia` = `r mean_asia`$.
1. `continentEurope` corresponds to countries in Europe and the value +`r offset_europe` is the same difference in mean life expectancy relative to Africa we displayed in Table \@ref(tab:continent-mean-life-expectancies). In other words, the mean life expectancy of countries in Europe is $`r intercept` + `r offset_europe` = `r mean_europe`$.
1. `continentNorth America` corresponds to countries in North America and the value +`r offset_north_america` is the same difference in mean life expectancy relative to Africa we displayed in Table \@ref(tab:continent-mean-life-expectancies). In other words, the mean life expectancy of countries in North America is $`r intercept` + `r offset_north_america` = `r mean_north_america`$.
1. `continentOceania` corresponds to countries in Oceania and the value +`r offset_oceania` is the same difference in mean life expectancy relative to Africa we displayed in Table \@ref(tab:continent-mean-life-expectancies). In other words, the mean life expectancy of countries in Oceania is $`r intercept` + `r offset_oceania` = `r mean_oceania`$.
1. `continentSouth America` corresponds to countries in South America and the value +`r offset_south_america` is the same difference in mean life expectancy relative to Africa we displayed in Table \@ref(tab:continent-mean-life-expectancies). In other words, the mean life expectancy of countries in South America is $`r intercept` + `r offset_south_america` = `r mean_south_america`$.

To summarize, the 6 values for the regression coefficients correspond to the "baseline for comparison" continent Africa (the intercept) as well as five "offsets" from this baseline for the remaining 5 continents: Asia, Europe, North America, Oceania, and South America.

You might be asking at this point why was Africa chosen as the "baseline for comparison" group. This is the case for no other reason than it comes first alphabetically of the six continents; by default R arranges factors/categorical variables in alphanumeric order. You can change this baseline group to be another continent if you manipulate the variable `continent`'s factor "levels" using the `forcats` package. See [Chapter 15](https://r4ds.had.co.nz/factors.html) of *R for Data Science* [@rds2016] for examples.

We now write the equation for our fitted values $\widehat{y} = \widehat{\text{life exp}}$.

$$
\begin{aligned}
\widehat{y} = \widehat{\text{life exp}} &= b_0 + b_{\text{Asia}}\cdot\mathbb{1}_{\text{Asia}}(x) + b_{\text{Europe}}\cdot\mathbb{1}_{\text{Europe}}(x) \\
& \qquad + b_{\text{North America}}\cdot\mathbb{1}_{\text{North America}}(x) + b_{\text{Oceania}}\cdot\mathbb{1}_{\text{Oceania }}(x) \\
& \qquad + b_{\text{South America}}\cdot\mathbb{1}_{\text{South America}}(x)\\
&= `r intercept` + `r offset_asia`\cdot\mathbb{1}_{\text{Asia}}(x) + `r offset_europe`\cdot\mathbb{1}_{\text{Euro}}(x)  \\
& \qquad + `r offset_north_america`\cdot\mathbb{1}_{\text{North America}}(x) + `r offset_oceania`\cdot\mathbb{1}_{\text{Oceania}}(x) \\
& \qquad + `r offset_south_america`\cdot\mathbb{1}_{\text{South America}}(x)
\end{aligned}
$$

Whoa! That looks daunting! Don't fret, however, as once you understand what all the elements mean, things simplify greatly. First, $\mathbb{1}_{A}(x)$ is what's known in mathematics as an "indicator function." It returns only one of two possible values, 0 and 1, where

$$
\mathbb{1}_{A}(x) = \left\{
\begin{array}{ll}
1 & \text{if } x \text{ is in } A \\
0 & \text{if } \text{otherwise} \end{array}
\right.
$$

In a statistical modeling context, this is also known as a *dummy variable*. \index{dummy variable} In our case, we consider the first such indicator variable $\mathbb{1}_{\text{Amer}}(x)$. This indicator function returns 1 if a country is in the Asia, 0 otherwise:

$$
\mathbb{1}_{\text{Amer}}(x) = \left\{
\begin{array}{ll}
1 & \text{if } \text{country } x \text{ is in Asia} \\
0 & \text{otherwise}\end{array}
\right.
$$

Second, $b_0$ corresponds to the intercept as before; in this case, it is the mean life expectancy of all countries in Africa. Third, the $b_{\text{Asia}}$, $b_{\text{Europe}}$, $b_{\text{North America}}$, $b_{\text{Oceania}}$, and $b_{\text{South America}}$ represent the 5 "offsets relative to the baseline for comparison" in the regression coefficients.