Skip to content

Companion Datasets for "Exploratory Data Analysis for Machine Learning"

License

Notifications You must be signed in to change notification settings

tthrall/eda4mlr

Repository files navigation

eda4mlr

Companion R package for the textbook Exploratory Data Analysis for Machine Learning by Tony Thrall.

Installation

You can install the development version of eda4mlr from GitHub:

# install.packages("remotes")
remotes::install_github("tthrall/eda4mlr")

Datasets

The package provides the following datasets:

Dataset Observations Variables Description
`handedness` 6 3 Handedness counts by sex for chi-squared independence test
`lg_competencies` 7 4 Seven data science competency areas from the IC CRG framework
`lg_course_prereq` 19 3 Course prerequisite edges (course → course) with rationale
`lg_courses` 15 3 Fifteen courses providing complete coverage of all skills
`lg_has_skill` 41 3 Learner skill edges (learner → skill) with proficiency level
`lg_learners` 6 5 Six fictional learner profiles with varying backgrounds
`lg_prerequisite` 22 2 Skill prerequisite edges (skill → skill) for conceptual dependencies
`lg_proficiency_levels` 5 4 Five-level proficiency scale (None through Master) with guidance
`lg_requires_skill` 25 3 Work role skill requirements (role → skill) with minimum proficiency
`lg_schema` 6 5 Knowledge graph schema defining node types, edge types, and constraints
`lg_skills` 18 6 Eighteen knowledge and skill areas (KSAs) from the data science competency framework
`lg_teaches` 19 3 Course teaching edges (course → skill) with proficiency ceiling
`lg_work_roles` 3 4 Three work roles: Data Analyst, Data Scientist, AI/ML Specialist
`lit_digest` 4 2 1936 Literary Digest poll predictions vs. actual result
`mnist_example` 10 5 Sample (10 images; 7,840 rows in long format) MNIST handwritten digit images (one per digit 0:9)
`mnist_test` 1,000 5 Subset (1000 images; 784,000 rows in long format) of the MNIST test database of handwritten digits, in long format
`mnist_train` 1,000 5 Subset (1000 images; 784,000 rows in long format) of the MNIST training database of handwritten digits, in long format
`nb10` 100 2 Repeated weighings of a standard weight (deficit in micrograms below 10g)
`oecd_bli` 36 26 OECD Better Life indicators by country (2015)
`oecd_bli_indicators` 24 5 Metadata for OECD Better Life indicators
`olympic_running` 312 4 Olympic track event winning times (1896:2016)
`portacaval_studies` 3 4 Study counts by design type and reported improvement level
`portacaval_survival` 2 3 Survival rates comparing randomized vs. non-randomized designs
`salk_blind` 3 3 Randomized controlled double-blind trial results
`salk_nfip` 3 4 NFIP observed-control design results
`truman_dewey` 4 5 1948 election polling predictions vs. actual result
`ucb_admissions` 24 4 Graduate admissions by department and sex (Simpson's paradox)
`us_elections` 14 6 Gallup poll accuracy for US presidential elections (1952:2004)
`wine_quality` 6,497 13 Wine physicochemical properties and quality ratings

Handedness Data

Handedness counts by sex from Freedman, Pisani & Purves (2007), used to illustrate chi-squared tests for independence.

library(eda4mlr)
data(handedness)
handedness
#> # A tibble: 6 × 3
#>   sex    hnd   count
#>   <chr>  <chr> <int>
#> 1 male   right   934
#> 2 male   left    113
#> 3 male   ambi     20
#> 4 female right  1070
#> 5 female left     92
#> 6 female ambi      8

LearningGraph Knowledge Graph

The learning_graph object is a knowledge graph for skills-based learning in data science, based on the IC Data Science Competency Resource Guide (2023) with structure inspired by Workera.ai's skills intelligence platform. It demonstrates graph theory concepts including directed acyclic graphs, bipartite structures, weighted edges, and path algorithms.

The knowledge graph contains five node types (skills, courses, learners, work roles, competencies) and five edge types (has_skill, requires_skill, prerequisite, course_prereq, teaches). The complete object is available as learning_graph, with individual components exported as lg_* tibbles for convenience.

data(learning_graph)

# Structure overview
names(learning_graph)
#> [1] "metadata"           "proficiency_levels" "nodes"              "edges"

# Access nodes
names(learning_graph$nodes)
#> [1] "competencies" "skills"       "work_roles"   "courses"      "learners"

# Access edges
names(learning_graph$edges)
#> [1] "has_skill"      "requires_skill" "prerequisite"   "course_prereq"  "teaches"

# Example: view skills
learning_graph$nodes$skills
#> # A tibble: 18 × 6
#>    skill_id skill_tag          skill_name             cmp_id k_or_s description
#>       <int> <chr>              <chr>                   <int> <chr>  <chr>
#>  1        1 algorithms         Algorithms                  1 k      Knowledge of designing...
#>  2        2 programming        Programming                 1 s      Skill in programming...
#> ...

# Or use individual exports
data(lg_skills)
data(lg_prerequisite)

MNIST Handwritten Digits

Sample images from the MNIST database of handwritten digits, useful for demonstrating image data and dimension reduction.

data(mnist_example)
data(mnist_train)
data(mnist_test)

# mnist_example contains one image per digit (0-9)
# mnist_train and mnist_test contain 1000 images each
dim(mnist_train)
#> [1] 1000    5

OECD Better Life Index

Well-being indicators for 36 countries across 11 dimensions (housing, income, jobs, education, environment, health, etc.).

data(oecd_bli)
data(oecd_bli_indicators)

# See available indicators
oecd_bli_indicators$indicator
#>  [1] "Stakeholder Engagement"    "Voter Turnout"
#>  [3] "Air Pollution"             "Water Quality"
#> ...

Olympic Running Data

Fastest running times for Olympic track events from 1896 to 2016, for men and women across seven distances.

data(olympic_running)
head(olympic_running)
#> # A tibble: 6 × 4
#>    year length sex   time
#>   <int>  <int> <chr> <dbl>
#> 1  1896    100 male   12
#> ...

Wine Quality Data

Physicochemical measurements for Portuguese Vinho Verde wines, with quality ratings from expert tasters.

data(wine_quality)
dim(wine_quality)
#> [1] 6497   13

Data Sources

Topic Source
Handedness by Sex Freedman, Pisani, Purves (4e)
LearningGraph IC DSci-CRG & Workera.ai
MNIST subsets Yann LeCun's MNIST Database
NB10 Repeated Weighings Freedman, Pisani, Purves (4e)
OECD Better Life Index OECD Better Life Index (2015)
Olympics Olympics.com via tsibbledata
Polling and Elections Freedman, Pisani, Purves (4e)
Portacaval Shunt Freedman, Pisani, Purves (4e)
Salk Vaccine Trial Freedman, Pisani, Purves (4e)
UC Berkeley Admissions Freedman, Pisani, Purves (4e)
Wine Quality UCI Machine Learning Repository

Related

Citations

If you use these datasets, please cite the original sources:

Handedness by Sex

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.). W.W. Norton & Company

LearningGraph

Office of the Director of National Intelligence (2023) Competency Resource Guide for Data Science (UNCLASSIFIED) Structure inspired by Workera.ai skills intelligence platform https://workera.ai/

MNIST subsets

LeCun, Y., Cortes, C., & Burges, C. J. C. (1998) The MNIST database of handwritten digits http://yann.lecun.com/exdb/mnist/

NB10 Repeated Weighings

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 6, Sec. 1. W.W. Norton & Company

OECD Better Life Index

OECD (2015). OECD Better Life Index https://www.oecdbetterlifeindex.org/

Olympics

tsibbledata: Diverse Datasets for 'tsibble' https://cran.r-project.org/package=tsibbledata

Polling and Elections

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 19. W.W. Norton & Company

Portacaval Shunt

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 2. W.W. Norton & Company

Salk Vaccine Trial

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 1. W.W. Norton & Company

UC Berkeley Admissions

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 4. W.W. Norton & Company Also: Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975) Science, 187(4175), 398-404

Wine Quality

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009) Modeling wine preferences by data mining from physicochemical properties Decision Support Systems, 47(4), 547-553 https://doi.org/10.1016/j.dss.2009.05.016

License

MIT License

About

Companion Datasets for "Exploratory Data Analysis for Machine Learning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages