-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathae-9-odds-notes.qmd
167 lines (117 loc) · 4.35 KB
/
ae-9-odds-notes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "AE 9: Odds"
author: "Notes"
format: pdf
editor: visual
---
## Packages
```{r}
#| label: load-pkgs-data
#| message: false
library(tidyverse)
library(tidymodels)
library(knitr)
heart_disease <- read_csv(here::here("data/framingham.csv")) %>%
select(totChol, TenYearCHD) %>%
drop_na() %>%
mutate(high_risk = as.factor(TenYearCHD)) %>%
select(totChol, high_risk)
```
## Linear regression vs. logistic regression
State whether a linear regression model or logistic regression model is more appropriate for each scenario:
1. Use age and education to predict if a randomly selected person will vote in the next election.
2. Use budget and run time (in minutes) to predict a movie's total revenue.
3. Use age and sex to calculate the probability a randomly selected adult will visit Duke Health in the next year.
## Heart disease
### Data: Framingham study
This data set is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.
We want to use the total cholesterol to predict if a randomly selected adult is high risk for heart disease in the next 10 years.
- `high_risk`:
- 1: High risk of having heart disease in next 10 years
- 0: Not high risk of having heart disease in next 10 years
- `totChol`: total cholesterol (mg/dL)
### Outcome: `high_risk`
```{r}
#| out-width: "70%"
ggplot(data = heart_disease, aes(x = high_risk)) +
geom_bar() +
scale_x_discrete(labels = c("1" = "High risk", "0" = "Low risk")) +
labs(
title = "Distribution of 10-year risk of heart disease",
x = NULL)
```
```{r}
heart_disease %>%
count(high_risk)
```
### Calculating probability and odds
1. What is the probability a randomly selected person in the study is **not** high risk for heart disease?
```{r}
heart_disease %>%
count(high_risk) %>%
mutate(p = n / sum(n))
p_not <- heart_disease %>%
count(high_risk) %>%
mutate(p = n / sum(n)) %>%
filter(high_risk == 0) %>%
pull(p)
```
The probability that a randomly selected person in the study is not high risk for heart disease is `r round(p_not, 3)`.
2. What are the **odds** a randomly selected person in the study is **not** high risk for heart disease?
```{r}
odds_not <- p_not / (1 - p_not)
odds_not
```
The odds that a randomly selected person in the study is not high risk for heart disease is `r round(odds_not, 3)`.
### Logistic regression model
Fit a logistic regression model to understand the relationship between total cholesterol and risk for heart disease.
Let $\pi$ be the probability an adult is high risk.
The statistical model is
$$\log\Big(\frac{\pi_i}{1-\pi_i}\Big) = \beta_0 + \beta_1 \times TotChol_i$$
```{r}
heart_disease_fit <- logistic_reg() %>%
set_engine("glm") %>%
fit(high_risk ~ totChol, data = heart_disease, family = "binomial")
tidy(heart_disease_fit) %>% kable(digits = 3)
```
3. Write the regression equation. Round to 3 digits.
$$\log\Big(\frac{\pi_i}{1-\pi_i}\Big) = -2.894 + 0.005 \times TotChol_i$$
### Calculating log-odds, odds and probabilities
Based on the model, if a randomly selected person has a total cholesterol of 250 mg/dL,
4. What are the log-odds they are high risk for heart disease?
```{r}
new_person <- tibble(totChol = 250)
log_odds <- predict(heart_disease_fit, new_data = new_person, type = "raw")
log_odds
```
5. What are the odds they are high risk for heart disease?
```{r}
odds <- exp(log_odds)
odds
```
6. What is the probability they are high risk for heart disease?
*Use the odds to calculate your answer.*
```{r}
# using odds
odds / (1 + odds)
# using predict
predict(heart_disease_fit, new_data = new_person, type = "prob")
```
### Comparing observations
Suppose a person's cholesterol changes from 250 mg/dL to 200 mg/dL.
7. How do you expect the log-odds that this person is high risk for heart disease to change?
```{r}
new_people <- tibble(totChol = c(250, 200))
log_odds <- predict(heart_disease_fit, new_data = new_people, type = "raw")
log_odds
```
8. How do you expect the odds that this person is high risk for heart disease to change?
```{r}
# odds
exp(log_odds)
# probabilities
## using odds
odds / (1 + odds)
## using predict
predict(heart_disease_fit, new_data = new_people, type = "prob")
```