-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlinear_modelling_slides.html
2021 lines (1260 loc) · 66.9 KB
/
linear_modelling_slides.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<title>Introduction to Linear Modelling</title>
<meta charset="utf-8" />
<meta name="author" content="Dr. Laurie Baker" />
<meta name="date" content="2021-03-08" />
<link href="libs/anchor-sections-1.0/anchor-sections.css" rel="stylesheet" />
<script src="libs/anchor-sections-1.0/anchor-sections.js"></script>
<link rel="stylesheet" href="xaringan-themer2.css" type="text/css" />
</head>
<body>
<textarea id="source">
class: title-slide, center, bottom
# Introduction to Linear Modelling
<img src="images/tidydata_5.jpg" title="Data Analysis assembly line: Wrangle, Visualise, Model" alt="Data Analysis assembly line: Wrangle, Visualise, Model" width="550" height="70%" style="display: block; margin: auto;" />
## Data Science Campus
### Dr. Laurie Baker
#### Artwork by @allison_horst
### `r Sys.Date()`
---
# You
* Are familiar with R.
* Are new to linear modelling or haven't covered it in a while.
* Are new to linear modelling in R.
<img src="images/tidydata_5.jpg" title="Data Analysis assembly line: Wrangle, Visualise, Model" alt="Data Analysis assembly line: Wrangle, Visualise, Model" width="600" height="70%" style="display: block; margin: auto;" />
Artwork by @allison_horst
???
I'm assuming you are:
* Are familiar with R.
* Are new to linear modelling or haven't covered it in a while.
* Are new to linear modelling in R.
---
# Getting started
* For this adventure you'll need the `tidyverse` meta-package, `broom` to tidy our models, and `GGally` to plot our coefficients.
```r
#install.packages(tidyverse)
#install.packages(broom)
#install.packages(GGally)
library(tidyverse)
library(broom)
library(GGally)
```
* And we'll be working with the data.frame `faculty_salaries.csv`.
???
We'll be using the tidyverse, broom, and GGally packages for this workshop. We will be working with the data.frame: faculty_salaries.csv
If it's the first time you've used the packages you will want to uncomment and run the install.packages code. If you have already downloaded the packages you can just call library.
---
class: inverse, middle, center
# Exploratory Data Analysis
???
The first section of the course is on Exploratory Data Analysis
---
# Learning Objectives
* What is the difference between a continuous and categorical variable?
--
* What is variation and covariation?
--
* Where does Exploratory Data Analysis fit in with analysis?
--
* How to use plots to explore variation in
* A continuous variable
* A categorical variable
--
* How to use plots to explore covariation between
* Two continuous variables
* A categorical and continuous variable.
???
The learning objectives for this section are to understand:
* What is the difference between a continuous and categorical variable?
* What is variation and covariation?
* Where does Exploratory Data Analysis fit in with analysis?
* How to use plots to explore variation in a continuous and a categorical variable
And finally
* How to use plots to explore variation in a continuous and categorical variable.
---
## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Numerical Variables: Continuous or Discrete?
<img src="images/continuous_discrete.png" title="Numerical variables can be continuous or discrete. Continuous variables are measured data that can have infinite values within a possible range. An example is chick height (e.g. 3.1 inched) and weight (34.16 grams). For discrete variables observations can only exist at limited values, often counts (e.g. 8 legs and 4 spots)." alt="Numerical variables can be continuous or discrete. Continuous variables are measured data that can have infinite values within a possible range. An example is chick height (e.g. 3.1 inched) and weight (34.16 grams). For discrete variables observations can only exist at limited values, often counts (e.g. 8 legs and 4 spots)." width="550" height="70%" style="display: block; margin: auto;" />
Artwork by @allison_horst
---
## <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Categorical Variables: Nominal, Ordinal, Binary?
<img src="images/nominal_ordinal_binary.png" title="Categorical variables can be nominal: unordered descriptions (e.g. turtles, snail, butterfly), ordinal: ordered descriptions (unhappy, okay, awesome), or binary: only 2 mutually exclusive outcomes (dinosaurs = extinct, sharks = alive." alt="Categorical variables can be nominal: unordered descriptions (e.g. turtles, snail, butterfly), ordinal: ordered descriptions (unhappy, okay, awesome), or binary: only 2 mutually exclusive outcomes (dinosaurs = extinct, sharks = alive." width="600" height="70%" style="display: block; margin: auto;" />
Artwork by @allison_horst
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 576 512"><path d="M208 0c-29.9 0-54.7 20.5-61.8 48.2-.8 0-1.4-.2-2.2-.2-35.3 0-64 28.7-64 64 0 4.8.6 9.5 1.7 14C52.5 138 32 166.6 32 200c0 12.6 3.2 24.3 8.3 34.9C16.3 248.7 0 274.3 0 304c0 33.3 20.4 61.9 49.4 73.9-.9 4.6-1.4 9.3-1.4 14.1 0 39.8 32.2 72 72 72 4.1 0 8.1-.5 12-1.2 9.6 28.5 36.2 49.2 68 49.2 39.8 0 72-32.2 72-72V64c0-35.3-28.7-64-64-64zm368 304c0-29.7-16.3-55.3-40.3-69.1 5.2-10.6 8.3-22.3 8.3-34.9 0-33.4-20.5-62-49.7-74 1-4.5 1.7-9.2 1.7-14 0-35.3-28.7-64-64-64-.8 0-1.5.2-2.2.2C422.7 20.5 397.9 0 368 0c-35.3 0-64 28.6-64 64v376c0 39.8 32.2 72 72 72 31.8 0 58.4-20.7 68-49.2 3.9.7 7.9 1.2 12 1.2 39.8 0 72-32.2 72-72 0-4.8-.5-9.5-1.4-14.1 29-12 49.4-40.6 49.4-73.9z"/></svg> **nominal** if they have no order and **ordinal** if there is an order, **binary** if there are only two options.
???
- A `Categorical variable` is
- `nominal` if they have no order (e.g. 'Ghana' and 'Uruguay')
- `ordinal` if there is an order associated with them (e.g. 'low', 'medium', and 'high' income).
---
# Read in the data
```r
salaries <- read_csv("../data/faculty-data.csv")
salaries <- salaries %>%
mutate(department = as_factor(department))
```
???
Let's read in our faculty-data.csv and store this into the data.frame `salaries`. Let's make sure that department is a factor.
---
## Explore the data
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> What are the variables in our data set?
```r
head(salaries)
```
```
## # A tibble: 6 x 6
## ids department bases experience raises salary
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 sociology 39012. 3 2122. 45379.
## 2 2 biology 51872. 9 542. 56747.
## 3 3 english 64341. 3 543. 65971.
## 4 4 informatics 68975. 2 1737. 72449.
## 5 5 statistics 78262. 9 470. 82492.
## 6 6 sociology 40527. 3 1884. 46180.
```
???
We can explore our data by looking at the first six rows using the function `head`. From the output, what are the variables in our data set?
Answer:
- ids = id for each individual
- department = department (e.g. sociology, biology etc.)
- bases = starting salary
- experience = years of experience
- raises = raise per year
- salary = current salary
---
## Explore the data
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> How many departments are in the data?
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> How many employees are in each department?
```r
summary(salaries)
```
```
## ids department bases experience
## Min. : 1.00 sociology :20 Min. :37247 Min. :0.00
## 1st Qu.: 25.75 biology :20 1st Qu.:48071 1st Qu.:3.00
## Median : 50.50 english :20 Median :59432 Median :5.00
## Mean : 50.50 informatics:20 Mean :60226 Mean :4.82
## 3rd Qu.: 75.25 statistics :20 3rd Qu.:72671 3rd Qu.:7.25
## Max. :100.00 Max. :87010 Max. :9.00
## raises salary
## Min. : 450.6 Min. :42827
## 1st Qu.: 481.6 1st Qu.:54746
## Median : 540.4 Median :62375
## Mean :1033.5 Mean :65448
## 3rd Qu.:1741.0 3rd Qu.:78632
## Max. :2177.4 Max. :91342
```
???
We can use summary to get a feel for the spread of our data including the range of saliaries, and the number of individuals in each department.
---
## Exercise
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> What does each row in the data.frame represent?
```r
#View(______)
```
--
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Which variables are numeric?
--
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Are they continuous or discrete?
--
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Which variables are categorical?
--
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Are they nominal, ordinal or binary?
???
We can use View (with a capital V) to inspect our data. What does each row in our data.frame represent?
Answer: One employee at the University (i.e. a staff member)
Looking at the output from summary:
- Which variables are numeric?
Answer: bases, experience, raises, salary
- What variables are categorical?
Answer: This is a tricky one. In fact ids (although represented as a number) are categorical. Department is also categorical.
- Are the categorical variables nominal or ordinal?
Answer: Again this is tricky one. Both variables are nominal. The id number does not tell us anything about the person (e.g. if they are more experienced etc.) in which cases it is nominal.
---
## What is variation and covariation?
- `Variation`: is the tendency of values of a variable to change from measurement to measurement.
--
- `Covariation`: tendency of values of a variable to change with the values of another variable.
???
Introduce the slide: Now we know more about the structure of our data we can explore the variation and covariation in the variables. Knowing the variation and covariation between variables can help us to understand the spread of the data and potential relationships in the data that may give insight into modelling.
Variation is the tendency of values of a variable to change from measurement to measurement.
Covariation is the tendency of values of a variable to change with the values of another variable.
(Lead in to next slide)
Visualisation is a great initial tool to explore these relationships further. How you visualise this variation depends on whether the variable is categorical or continuous.
---
## Where does Exploratory Data Analysis fit?
<img src="images/tidydata_5.jpg" title="Data Analysis assembly line: Wrangle, Visualise, Model" alt="Data Analysis assembly line: Wrangle, Visualise, Model" width="550" height="70%" style="display: block; margin: auto;" />
--
- Hypothesis generation
--
- Data exploration
--
- Formal statistical testing
???
So where does exploratory data analysis fall into our analysis work flow and how can we build it in effectively?
Data Visualisation is helpful for
- Hypothesis generation,
- Data exploration,
- and is supported by formal statistical testing.
---
class: inverse, middle, center
# Using plots to explore variation and covariation
???
In this section we'll explore how we can use plotting to explore variation and covariation within our data.
To choose the most suitable visualisation to explore variation in a variable and covariation bewteen variables, we need to consider the type of variable(s) e.g. continuous or categorical.
---
# Single continuous variable
.left-code[
```r
ggplot(data = salaries) +
geom_density(aes(salary), fill = "blue")
```
]
.right-plot[
<img src="linear_modelling_slides_files/figure-html/first-plot1a-1.png" width="75%" height="70%" />
]
???
To explore the variation in a single continuous variable, such as salary, we can use a density plot. The area under the density plot sums to one. It shows us the density of the data that lies at different values, in this case salary.
What do you notice from this plot about professor's salaries?
Answer: The data is bimodal, it has two peaks, there are a lot of professors whose salary is near 55,000 USD, and a large proportion of the professors whose salary is about 82,000.
---
# Single categorical variable
.left-code[
```r
ggplot(data = salaries) +
geom_bar(aes(x = department, fill = department)) +
labs(x = "Department", y = "# Staff", fill = "Department")
```
]
.right-plot[
<img src="linear_modelling_slides_files/figure-html/first-plot2a-1.png" width="75%" height="70%" />
]
???
For categorical variables we may want to visualise the counts of individuals in different categories. In this case we can look at the number of individuals in each department using a bar plot.
What do you notice about the data from this plot?
Answer: each department has 20 individuals. This means that we have an even spread of the data.
---
# Two continuous variables
.left-code[
```r
ggplot(data = salaries) +
geom_point(aes(x = experience, y = salary, color = department))
```
]
.right-plot[
<img src="linear_modelling_slides_files/figure-html/first-plot3a-1.png" width="75%" height="70%" />
]
???
To visualise two continuous variables we can use a scatter plot. In this case we can make a scatterplot to look at how salary (y-axis) changes with years of experience (x-axis). This is done using geom_point().
Q: Can you spot the third variable?
A: Department.
---
# Two continuous variables
* **positive relationship** as one variable increases the other variable increases
???
Plotting two continuous variables, we can see how they change in relation to each other. Between the variables we may see:
* **positive relationship** as one variable increases the other variable increases
--
* **negative relationship** as one variable increases the other decreases
???
* **negative relationship** as one variable increases the other decreases
--
* **no relationship** no discernible pattern of change in one variable with the other.
???
* **no relationship** no discernible pattern of change in one variable with the other.
--
* **non-linear relationship** we may also be able to pick out other patterns, e.g. *polynomials*.
???
* **non-linear relationship** we may also be able to pick out other patterns, e.g. *polynomials*.
---
# Do salaries increase with experience?
.left-code[
```r
ggplot(data = salaries) +
geom_point(aes(x = experience, y = salary, color = department))
```
]
.right-plot[
<img src="linear_modelling_slides_files/figure-html/first-plot4a-1.png" width="75%" height="70%" />
]
???
What is the relationship between salary and experience?
Answer: positive relationship
Does it change by department?
Answer: Yes, some departments are steeper than others.
---
# A discrete and continuous variable
.left-code[
```r
ggplot(data = salaries) +
geom_boxplot(aes(x = department, y = salary, color = department))
```
]
.right-plot[
<img src="linear_modelling_slides_files/figure-html/first-plot5a-1.png" width="75%" height="70%" />
]
???
We can use boxplots to visualise a discrete and continuous variable. In this case we can look at the spread of salary for each department.
* A box plot gives us a visual representation of the distribution of numeric data using quartiles. It can be a good way to see how the data is spread and to identify potential outliers.
* The box plot shows the median (second quartile) in the middle of the plot.
* The first and third quartile represent the interquartile range (25\% to 75\%).
* The minimum and maximum are defined as the (Q1 - 1.5 x IQ) and (Q3 + 1.5 x IQ).
---
# A discrete and continuous variable
.left-code[
```r
ggplot(data = salaries) +
geom_violin(aes(x = department, y = salary, fill = department))
```
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Which departments have the highest salaries?
<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M504 256c0 136.997-111.043 248-248 248S8 392.997 8 256C8 119.083 119.043 8 256 8s248 111.083 248 248zM262.655 90c-54.497 0-89.255 22.957-116.549 63.758-3.536 5.286-2.353 12.415 2.715 16.258l34.699 26.31c5.205 3.947 12.621 3.008 16.665-2.122 17.864-22.658 30.113-35.797 57.303-35.797 20.429 0 45.698 13.148 45.698 32.958 0 14.976-12.363 22.667-32.534 33.976C247.128 238.528 216 254.941 216 296v4c0 6.627 5.373 12 12 12h56c6.627 0 12-5.373 12-12v-1.333c0-28.462 83.186-29.647 83.186-106.667 0-58.002-60.165-102-116.531-102zM256 338c-25.365 0-46 20.635-46 46 0 25.364 20.635 46 46 46s46-20.636 46-46c0-25.365-20.635-46-46-46z"/></svg> Do you think the departments are very different from one another?
]
.right-plot[
<img src="linear_modelling_slides_files/figure-html/first-plot6a-1.png" width="75%" height="70%" />
]
???
Violin plots are what you get when you cross a box plot with a density plot. If you were to rotate the plot and look at the top half of the plot (it's symmetrical), this would be the same as the density plot.
From violin plots we can get a better idea of the spread of the data and the shape. For instance look at the biology department. The distribution of salaries are skewed towards 55,000 but there is more of the data th is below 55,000 than above it.
---
class: inverse, middle, center
# Model Basics and Construction
???
In the next section we will cover model basics and construction.
---
# Learning objectives
* What is a model family and fitted model?
???
In this section, we will become familiar with:
* What is a model family and fitted model?
--
* What is the difference between a response and an explanatory variable?
???
* What is the difference between a response and an explanatory variable?
--
* How to construct a linear model in R.
???
* How to construct a linear model in R.
--
* What are the slope and intercept in a linear model?
???
* What are the slope and intercept in a linear model?
--
* Picking out key information from the model table
???
* Picking out key information from the model table
--
* How to extract specific parameters from the model object
???
* How to extract specific parameters from the model object
---
## Model families and fitted models
A goal of models is to partition data into **patterns** and **residuals**.
1. **Family of models:**
* Express precise, but generic pattern you wish to capture.
* E.g., a straight line `\(y = ax + b\)` or quadratic curve `\(y = ax^2 + bx + c\)`.
???
* Ideally your model will
* capture **signals** (i.e. patterns)
* and ignore **noise** (i.e. random variation).
A goal of models is to partition data into **patterns** and **residuals**.
There are two key parts to a model:
1. Family of models:
* define a family of models that express precise, but generic pattern you wish to capture. For example, a straight line `\(y = ax + b\)` or quadratic curve `\(y = ax^2 + bx + c\)`. Where `\(x\)` and `\(y\)` are known variables from your data, and `\(a\)`, `\(b\)`, and `\(c\)` are parameters that can vary to capture different patterns.
--
2. **Fitted model**
* The **model family** is expressed as an equation.
* In model fitting, the different parameters are able to take on different values to **adjust** the shape of the **fitted line** to the data.
???
2. Fitted model
* After you've chosen your model family, the next step is to generate a fitted model from that family that is closest to your data.
* the **model family** is expressed as an equation, where the different parameters are able to take on different values to adjust the shape of the fitted line to the data.
--
* N.B. The fitted model is just the closest model from a family of models.
???
N.B. The fitted model is just the closest model from a family of models.
---
## Model families and fitted analogy
<img src="images/clothesline.jpg" height="300px" />
<img src="images/tailor_fit.jpg" height="300px" />
* Models are garments and fitting models is like tailoring.
???
If we use clothes as an analogy, the family of models is like an assortment of garments you could choose to 'clothe' the data in. Just as some clothes will be more suitable than others depending on what you wish to do (e.g. nice dress to a wedding), the same is true for models. The type of model will depend on the type of data you have and what you wish to do with your analysis.
Model fitting is similar to getting a garment tailored. Just as you might make alterations to improve the fit of a garment, you will adapt the chosen model to get a better fit to the data.
---
## What is the difference between a response and an explanatory variable?
* **Response variable**: the measured variable you are trying to explain
* `Dependent`, `target` (machine learning)
???
In data science you will see a number of terms used to refer to the same things. Here, we will use `response variable` to refer to the measured variable you are trying to explain. We will use `explanatory variables` to refer to the measured variables that we use to try to explain the response variable. Other terms that you may come across for these concepts include:
* **response variable**: `dependent`, `target` (machine learning)
--
* **Explanatory variables**: measured variables that we use to try to explain the response variable.
* `Independent`, `features` (machine learning)
???
* **explanatory variables**: `independent`, `features` (machine learning)
--
* Generally the response variable is shown on the y axis.
???
Generally the response variable is shown on the y axis
---
## Linear Models
Linear models take the mathematical form:
`\(y = ax + b\)`
* `\(y\)` is the response variable
* `\(a\)` is the slope
* `\(x\)` is an explanatory variable
* `\(b\)` is the intercept.
???
Linear regression is one of the most important and widely used regression techniques. Its main advantage is its simple structure and ease of interpreting results.
The linear model takes the mathematical form:
`\(y = ax + b\)`
where
* `\(y\)` is the response variable
* `\(a\)` is the slope
* `\(x\)` is an explanatory variable
* `\(b\)` is the intercept.
---
## Linear Models
a and b are generally denoted by regression parameters `\(\hat{\beta_{0}}\)` and `\(\hat{\beta_{1}}\)`. The `\(\hat{}\)` indicates that these are estimated.
`$$\hat{y_{i}} = \hat{\beta_{0}}+\hat{\beta_{1}}x_{i}+\epsilon_{i}$$`
Notation:
???
a and b are denoted by regression parameters beta zero hat and beta one hat. The hat indicates that these are estimated parameters.
--
* `\(\hat{y_{i}}\)` is the response variable.
???
where y i hat is the reponse variable.
--
* `\(\hat{\beta_{0}}\)` is the intercept of the line that best fits the data.
???
where beta zero hat is the intercept of the line that best fits the data.
--
* `\(\hat{\beta_{1}}\)` is the slope of the line that best fits the data.
???
where beta one hat is the slope of the line that best fits the data.
--
* `\(x_{i}\)` is the explanatory variable.
???
where x i is the explanatory variable.
--
* `\(\epsilon_{i}\)` is the error term.
???
and epsilon i is the error term.
--
* `\(_i\)` subscript indicates that the value can vary across cases/individuals/observations.
???
where the i subscript indicates that the value can vary across cases, individuals or observations depending on the model.
---
# Three different hypotheses
1. Starting salary and the rate at which salaries increase is the **same**.
???
With our models we will explore three different hypotheses about our data:
1. Starting salary and the rate at which salaries increase is the **same**.
--
2. Starting salary **differs** by department, but the rate at which salaries increase in the **same**.
???
2. Starting salary **differs** by department, but the rate at which salaries increase in the **same**.
--
3. Starting salary **differs** by department and the rate at which salaries increase is **different**.
???
3. Starting salary **differs** by department and the rate at which salaries increase is **different**.
---
# Hierarchical model animation
<iframe src="http://mfviz.com/hierarchical-models/" width="95%" height="90%" frameBorder="0"></iframe>
???
Our example today is inspired by the data visualisation project created by Michael Freeman.
---
class: inverse, middle, center
# Starting salary and the rate at which salaries increase is the **same**.
???
Let's look at the first hypothesis, that the "Starting salary and the rate at which salaries increase is the same.
---
## How to construct a linear model in R?
Form:
```r
m1 <- lm(y ~ x, data = df)
m1 <- lm(y ~ x1 + x2 + x3, data = df)
```
???
We can construct a linear model in R using the lm function. First you specify the formula as response variable tilde explanatory variable(s) and then the data = df.
--
* Our hypothesis: starting salary and the rate at which salaries increase is the **same**.
* What are our response and explanatory variables in this case?
???
We are interested in the relationship between how salary increases with years of experience. Our hypothesis is that the starting salary and the rate at which salaries increase in the same.
What are our response and explanatory variables in this case?
--
```r
m1 <- lm(salary ~ experience, data = salaries)
```
???
The response variable is the salary and the explanatory variable is experience.
--
`$$salary_{i}=\hat{\beta_{0}}+\hat{\beta_{1}}experience_{i}+\epsilon_{i}$$`
???
Here it is in its mathematical form.
---
class: inverse, middle, center
# Assessing Model Fit
???
In this next section we're going to look at how to assess model fit.
---
# Learning Objectives
* How to pick out key information from the table from a fitted model.
???
The learning objectives of the next section are: how to pick out key information from the table from a fitted model.
--
* How to inspect model residuals to assess model fit.
???
How to inspect the model residuals to assess model fit.
--
* How to use Adjusted R-squared and AIC to compare models.
???
How to use Adjusted R-squared and AIC to compare models.
---
# Picking out key information
```r
summary(m1)
```
```
##
## Call:
## lm(formula = salary ~ experience, data = salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20694 -11479 -3459 13420 24053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62154.0 2564.3 24.238 <2e-16 ***
## experience 683.5 456.9 1.496 0.138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13140 on 98 degrees of freedom
## Multiple R-squared: 0.02232, Adjusted R-squared: 0.01235
## F-statistic: 2.238 on 1 and 98 DF, p-value: 0.1379
```
???
The summary gives us a range of diagnostic information about the model we’ve fitted:
* **Call:** This is an R feature that shows what function and parameters were used to create the model.
* **Residuals:** difference of the observed value and the predicted value.
* **Estimate:** coefficient estimates for the intercept and explanatory variables.
* **Std Error:** standard errors (i.e. an estimate of the uncertainty) of the coefficient estimates.
* **t value**: t-statistic for the t-test comparing whether the coefficient is different to 0.
* **Pr(>|t|):** p-value for the t statistics, giving the significance of coefficient.
* **Residual standard error:** an expression of the variation of the observations around the regression line.
* **Multiple R-squared/Adj. R-squared:** The proportion of the variance in the observed values explained by the model. The Adj. R-squared takes into account the number of variables in the model.
* **F-statistic, p-value:** Model fit info (allow you to compare different models to assess the best one)
---
# Key Information from the Model Summary
* **Intercept interpretation:** the starting salary is 62154 USD.
* **Slope interpretation:** with every year of experience an employee's salary increases by 683.5 USD.
* **Amount of variation explained:** Overall the model is a poor fit, with only 0.01235 percent of the variation explained by the model as shown by the Adjusted R-squared.
???
* **Intercept interpretation:** the starting salary is 62154 USD.
* **Slope interpretation:** with every year of experience an employee's salary increases by 683.5 USD.
* **Amount of variation explained:** Overall the model is a poor fit, with only 0.01235 percent of the variation explained by the model as shown by the Adjusted R-squared.
---
# Extracting the fitted model
Getting the fitted model.
```r
m1_results <- broom::augment(m1)
head(m1_results)
```
```
## # A tibble: 6 x 8
## salary experience .fitted .resid .std.resid .hat .sigma .cooksd