Companion R package for the textbook Exploratory Data Analysis for Machine Learning by Tony Thrall.
You can install the development version of eda4mlr from GitHub:
# install.packages("remotes")
remotes::install_github("tthrall/eda4mlr")The package provides the following datasets:
| Dataset | Observations | Variables | Description |
|---|---|---|---|
| `handedness` | 6 | 3 | Handedness counts by sex for chi-squared independence test |
| `lg_competencies` | 7 | 4 | Seven data science competency areas from the IC CRG framework |
| `lg_course_prereq` | 19 | 3 | Course prerequisite edges (course → course) with rationale |
| `lg_courses` | 15 | 3 | Fifteen courses providing complete coverage of all skills |
| `lg_has_skill` | 41 | 3 | Learner skill edges (learner → skill) with proficiency level |
| `lg_learners` | 6 | 5 | Six fictional learner profiles with varying backgrounds |
| `lg_prerequisite` | 22 | 2 | Skill prerequisite edges (skill → skill) for conceptual dependencies |
| `lg_proficiency_levels` | 5 | 4 | Five-level proficiency scale (None through Master) with guidance |
| `lg_requires_skill` | 25 | 3 | Work role skill requirements (role → skill) with minimum proficiency |
| `lg_schema` | 6 | 5 | Knowledge graph schema defining node types, edge types, and constraints |
| `lg_skills` | 18 | 6 | Eighteen knowledge and skill areas (KSAs) from the data science competency framework |
| `lg_teaches` | 19 | 3 | Course teaching edges (course → skill) with proficiency ceiling |
| `lg_work_roles` | 3 | 4 | Three work roles: Data Analyst, Data Scientist, AI/ML Specialist |
| `lit_digest` | 4 | 2 | 1936 Literary Digest poll predictions vs. actual result |
| `mnist_example` | 10 | 5 | Sample (10 images; 7,840 rows in long format) MNIST handwritten digit images (one per digit 0:9) |
| `mnist_test` | 1,000 | 5 | Subset (1000 images; 784,000 rows in long format) of the MNIST test database of handwritten digits, in long format |
| `mnist_train` | 1,000 | 5 | Subset (1000 images; 784,000 rows in long format) of the MNIST training database of handwritten digits, in long format |
| `nb10` | 100 | 2 | Repeated weighings of a standard weight (deficit in micrograms below 10g) |
| `oecd_bli` | 36 | 26 | OECD Better Life indicators by country (2015) |
| `oecd_bli_indicators` | 24 | 5 | Metadata for OECD Better Life indicators |
| `olympic_running` | 312 | 4 | Olympic track event winning times (1896:2016) |
| `portacaval_studies` | 3 | 4 | Study counts by design type and reported improvement level |
| `portacaval_survival` | 2 | 3 | Survival rates comparing randomized vs. non-randomized designs |
| `salk_blind` | 3 | 3 | Randomized controlled double-blind trial results |
| `salk_nfip` | 3 | 4 | NFIP observed-control design results |
| `truman_dewey` | 4 | 5 | 1948 election polling predictions vs. actual result |
| `ucb_admissions` | 24 | 4 | Graduate admissions by department and sex (Simpson's paradox) |
| `us_elections` | 14 | 6 | Gallup poll accuracy for US presidential elections (1952:2004) |
| `wine_quality` | 6,497 | 13 | Wine physicochemical properties and quality ratings |
Handedness counts by sex from Freedman, Pisani & Purves (2007), used to illustrate chi-squared tests for independence.
library(eda4mlr)
data(handedness)
handedness
#> # A tibble: 6 × 3
#> sex hnd count
#> <chr> <chr> <int>
#> 1 male right 934
#> 2 male left 113
#> 3 male ambi 20
#> 4 female right 1070
#> 5 female left 92
#> 6 female ambi 8The learning_graph object is a knowledge graph for skills-based learning in data science, based on the IC Data Science Competency Resource Guide (2023) with structure inspired by Workera.ai's skills intelligence platform. It demonstrates graph theory concepts including directed acyclic graphs, bipartite structures, weighted edges, and path algorithms.
The knowledge graph contains five node types (skills, courses, learners, work roles, competencies) and five edge types (has_skill, requires_skill, prerequisite, course_prereq, teaches). The complete object is available as learning_graph, with individual components exported as lg_* tibbles for convenience.
data(learning_graph)
# Structure overview
names(learning_graph)
#> [1] "metadata" "proficiency_levels" "nodes" "edges"
# Access nodes
names(learning_graph$nodes)
#> [1] "competencies" "skills" "work_roles" "courses" "learners"
# Access edges
names(learning_graph$edges)
#> [1] "has_skill" "requires_skill" "prerequisite" "course_prereq" "teaches"
# Example: view skills
learning_graph$nodes$skills
#> # A tibble: 18 × 6
#> skill_id skill_tag skill_name cmp_id k_or_s description
#> <int> <chr> <chr> <int> <chr> <chr>
#> 1 1 algorithms Algorithms 1 k Knowledge of designing...
#> 2 2 programming Programming 1 s Skill in programming...
#> ...
# Or use individual exports
data(lg_skills)
data(lg_prerequisite)Sample images from the MNIST database of handwritten digits, useful for demonstrating image data and dimension reduction.
data(mnist_example)
data(mnist_train)
data(mnist_test)
# mnist_example contains one image per digit (0-9)
# mnist_train and mnist_test contain 1000 images each
dim(mnist_train)
#> [1] 1000 5Well-being indicators for 36 countries across 11 dimensions (housing, income, jobs, education, environment, health, etc.).
data(oecd_bli)
data(oecd_bli_indicators)
# See available indicators
oecd_bli_indicators$indicator
#> [1] "Stakeholder Engagement" "Voter Turnout"
#> [3] "Air Pollution" "Water Quality"
#> ...Fastest running times for Olympic track events from 1896 to 2016, for men and women across seven distances.
data(olympic_running)
head(olympic_running)
#> # A tibble: 6 × 4
#> year length sex time
#> <int> <int> <chr> <dbl>
#> 1 1896 100 male 12
#> ...Physicochemical measurements for Portuguese Vinho Verde wines, with quality ratings from expert tasters.
data(wine_quality)
dim(wine_quality)
#> [1] 6497 13| Topic | Source |
|---|---|
| Handedness by Sex | Freedman, Pisani, Purves (4e) |
| LearningGraph | IC DSci-CRG & Workera.ai |
| MNIST subsets | Yann LeCun's MNIST Database |
| NB10 Repeated Weighings | Freedman, Pisani, Purves (4e) |
| OECD Better Life Index | OECD Better Life Index (2015) |
| Olympics | Olympics.com via tsibbledata |
| Polling and Elections | Freedman, Pisani, Purves (4e) |
| Portacaval Shunt | Freedman, Pisani, Purves (4e) |
| Salk Vaccine Trial | Freedman, Pisani, Purves (4e) |
| UC Berkeley Admissions | Freedman, Pisani, Purves (4e) |
| Wine Quality | UCI Machine Learning Repository |
- EDA for Machine Learning — the companion textbook
If you use these datasets, please cite the original sources:
Handedness by Sex
Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.). W.W. Norton & Company
LearningGraph
Office of the Director of National Intelligence (2023) Competency Resource Guide for Data Science (UNCLASSIFIED) Structure inspired by Workera.ai skills intelligence platform https://workera.ai/
MNIST subsets
LeCun, Y., Cortes, C., & Burges, C. J. C. (1998) The MNIST database of handwritten digits http://yann.lecun.com/exdb/mnist/
NB10 Repeated Weighings
Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 6, Sec. 1. W.W. Norton & Company
OECD Better Life Index
OECD (2015). OECD Better Life Index https://www.oecdbetterlifeindex.org/
Olympics
tsibbledata: Diverse Datasets for 'tsibble' https://cran.r-project.org/package=tsibbledata
Polling and Elections
Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 19. W.W. Norton & Company
Portacaval Shunt
Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 2. W.W. Norton & Company
Salk Vaccine Trial
Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 1. W.W. Norton & Company
UC Berkeley Admissions
Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 4. W.W. Norton & Company Also: Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975) Science, 187(4175), 398-404
Wine Quality
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009) Modeling wine preferences by data mining from physicochemical properties Decision Support Systems, 47(4), 547-553 https://doi.org/10.1016/j.dss.2009.05.016
MIT License