DS303_SP25/vowel_project_pt3.qmd at main · AGreb21/DS303_SP25 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Vowel Analysis Final Report"
author:
  - "Student Name"
  - "DS303, SP25"
  - "Prof. Amber Camp"
date: 3/14/25
format: html
toc: true
editor: visual
theme: spacelab
---

## Vowel Analysis Final Report

### Load packages

```{r, echo = FALSE, message = FALSE}
library(tidyverse)
library(lme4)
library(lmerTest)

# install.packages("phonR")
library(phonR)
```

## Load data

Load your personal data (make sure you update from P102 -\> your P#)

```{r}
# read in data
P102 <- read_csv("data/P102.csv")

# convert variables to factor where needed
convert_to_factor <- function(df, cols) {
  df[cols] <- lapply(df[cols], as.factor)
  return(df)
}

P102 <- convert_to_factor(P102, c("ppt", "word", "ipa", "arpa", "onset", "offset", "environment", "real_word", "sex", "ethnicity", "birthplace", "home_state", "years_lived_hi", "L1", "exposure_langs_yn", "exposure_langs", "speak_other", "item_num", "rep"))

# remove a couple words you won't be needing
P102 <- P102 %>%
  dplyr::filter(!word %in% c("cot", "caught")) # added dplyr to specify which 'filter' to use

```

Class data:

```{r}
# read in data
all_data <- read_csv("data/DS303_combined.csv")

# convert variables to factor where needed
all_data <- convert_to_factor(all_data, c("ppt", "word", "ipa", "arpa", "onset", "offset", "environment", "real_word", "sex", "ethnicity", "birthplace", "home_state", "years_lived_hi", "L1", "exposure_langs_yn", "exposure_langs", "speak_other", "item_num", "rep"))

# remove a couple words you won't be needing
all_data <- all_data %>%
  dplyr::filter(!word %in% c("cot", "caught"))

```

## Explain the Data

(1 point)

In paragraph form:

-   Describe where the data comes from
-   Summarize the contents of the data (how many observations, participants, items, etc.)
-   Mention any pre-processing steps taken. For example, I pre-processed this data by removing words that were obviously mispronounced before even sending it to you. Then, above, you converted certain variables to factor and removed the words "cot" and "caught", which are not relevant to your investigation. Have you done any additional processing?

**This data was collected from all the people in our class speaking the series of words into the microphone where it was then processed by Amber, Lydia and Logan into the data set we now have here. There were 13 different participants (people in the class) and 32 different words that were spoken/collected, 3 times each. Each participant also filled out a survey regarding age, gender, ethnicity, etc.. so we could analyze the different affects of those variables. in the pre-processing of the data words that got mispronounced were removed, as well as the words "cot" and "caught" due to their similar pronunciation.**

## Variables of Interest

(1 point)

For this project, you will explore and analyze the [**class-wide data set**]{.underline}. In paragraph form:

-   Briefly introduce the purpose of this project
-   Identify and explain your variables of interest
-   State research questions or hypotheses about this data

**The purpose of this project is to analyze and evaluate different aspects of speech, and more specifically with vowels. The variables of interest for me that I will be looking closer at are sex, ethnicity, and age. I think that age will not really have an impact on the f1 and f2 variables, however i do think sex and ethnicity will have significant impact.**

## EDA and Vowel Plots

(3 points)

-   Generate two vowel plots using `phonR`: one for your own data, and one for class data

-   In a couple sentences, state your observations. Do you see any patterns or differences?\]

These are the two models that we made in the vowel project part 2, the first one being my data, and the second one being the whole class data

-   Include at least one visual that supports your hypothesis/justifies your models below, and explain

```{r}
## clean up outliers
# convert f1 and f2 values to z-scores
# z-scores help normalize the data
P102_clean <- P102 %>%
  group_by(ipa) %>%
  mutate(
    f1_z = (f1 - mean(f1)) / sd(f1), # basic z-score transformation
    f2_z = (f2 - mean(f2)) / sd(f2)
  ) %>%
  filter(abs(f1_z) <= 2, abs(f2_z) <= 2) # 2 to 3sd is typical for this type of data

## plot the trimmed data
with(P102_clean, plotVowels(f1, f2, ipa, plot.tokens = TRUE, pch.tokens = ipa, cex.tokens = 1.2, alpha.tokens = 0.2, plot.means = TRUE, pch.means = ipa, cex.means = 2, var.col.by = ipa, ellipse.line = TRUE, pretty = TRUE))

# looks a lot cleaner, right?
```

```{r}
## remove outliers
all_clean <- all_data %>%
  group_by(ppt, ipa) %>% # notice that we added ppt as a grouping
  mutate(
    f1_z = (f1 - mean(f1)) / sd(f1),
    f2_z = (f2 - mean(f2)) / sd(f2)
  ) %>%
  filter(abs(f1_z) <= 1.25, abs(f2_z) <= 1.25)

# plot clean data
with(all_clean, plotVowels(f1, f2, ipa, plot.tokens = TRUE, pch.tokens = ipa, cex.tokens = 1.2, alpha.tokens = 0.2, plot.means = TRUE, pch.means = ipa, cex.means = 2, var.col.by = ipa, ellipse.line = TRUE, pretty = TRUE))
```

```{r}

```

## Model Selection and Justification

(3 points)

-   You will build and analyze **two different statistical models** to investigate the relationship between your predictors and outcome variable

-   The two models should differ in some way (e.g., one could include an additional or interaction term while the other does not)

-   What statistical models will you use to investigate the relationship between your predictors and outcome variable? (linear vs. logistic regression? mixed effects model?)

-   Why did you select these models?

-   Which variable(s) are included?

I'm going to use linear regression models because f1 and f2 are continuous numerical variables, and this will allow me to compare how age, sex, and ethnicity affect the vowel variables. the first model will include all of age sex and ethnicity ass additive predictors, and the second model will include an interaction between age and sex to test whether the effect of age on f1 and f2 varies by sex.

```{r}
model1_f1 <- lm(f1 ~ age + ethnicity + sex, data = all_clean)
model1_f2 <- lm(f2 ~ age + ethnicity + sex, data = all_clean)

summary(model1_f1)
summary(model1_f2)

model2_f1 <- lm(f1 ~ age * sex + ethnicity, data = all_clean)
model2_f2 <- lm(f2 ~ age * sex + ethnicity, data = all_clean)

summary(model2_f1)
summary(model2_f2)


```

## Model Comparisons and Best Fit

(3 points)

-   Build and run both models and display their summaries

-   Compare the two models, assess model fit, and determine the better fitting one

model 1 seems to be the better fitting model because the interaction term doesn't significantly improve the models fit

```{r}
model1_f1 <- lm(f1 ~ age + ethnicity + sex, data = all_clean)
model1_f2 <- lm(f2 ~ age + ethnicity + sex, data = all_clean)

summary(model1_f1)
summary(model1_f2)

model2_f1 <- lm(f1 ~ age * sex + ethnicity, data = all_clean)
model2_f2 <- lm(f2 ~ age * sex + ethnicity, data = all_clean)

summary(model2_f1)
summary(model2_f2)
```

## Interpretation of Results

(3 points)

I will do this interpretation and discussion/conclusion based off of model 1 as that's the one i chose

-   Interpret coefficients and significance
-   Explain how the predictor variable(s) influence the outcome

For each additional year of age, f1 increases by 8.42 and f2 decreases by 31.45 however that is not statistically significant.

For all the different ethnicity there is not any significant trends for f1, however for f2 asian mixed and white ethnicity all have way lower f2 values, and those are statistically significant

sex is a statistically significant predictor for both f1 and f2, with males having a 135.04 lower f1 value than females, and males having a 276.79 lower f2 value compared to females

## Discussion and Conclusion

(3 points)

-   Summarize key findings
-   Discuss implications
-   Mention limitations

For f1 sex was the only significant predictor of the variables i testes, with males tending to have lower f1 values. And for f2 both sex and ethnicity were significant predictors, with males from Asian mixed and White ethnic groups having lower f2 values. Sex seems to be the only consistent predictor i fund among the variables i tested with males having way lower values for both, which makes sense given males tend to have deeper/lower voices. Some limitations on this data set was that the participants were very limited, having only 13 participants because that's all the people that are in the class. I would definitely want to see more tests done on a larger scale with. much more data to see how it compares to what i found.