eda4mlr

Companion R package for the textbook Exploratory Data Analysis for Machine Learning by Tony Thrall.

Installation

You can install the development version of eda4mlr from GitHub:

# install.packages("remotes")
remotes::install_github("tthrall/eda4mlr")

Datasets

The package provides the following datasets:

Dataset	Observations	Variables	Description
`handedness`	6	3	Handedness counts by sex for chi-squared independence test
`lg_competencies`	7	4	Seven data science competency areas from the IC CRG framework
`lg_course_prereq`	19	3	Course prerequisite edges (course → course) with rationale
`lg_courses`	15	3	Fifteen courses providing complete coverage of all skills
`lg_has_skill`	41	3	Learner skill edges (learner → skill) with proficiency level
`lg_learners`	6	5	Six fictional learner profiles with varying backgrounds
`lg_prerequisite`	22	2	Skill prerequisite edges (skill → skill) for conceptual dependencies
`lg_proficiency_levels`	5	4	Five-level proficiency scale (None through Master) with guidance
`lg_requires_skill`	25	3	Work role skill requirements (role → skill) with minimum proficiency
`lg_schema`	6	5	Knowledge graph schema defining node types, edge types, and constraints
`lg_skills`	18	6	Eighteen knowledge and skill areas (KSAs) from the data science competency framework
`lg_teaches`	19	3	Course teaching edges (course → skill) with proficiency ceiling
`lg_work_roles`	3	4	Three work roles: Data Analyst, Data Scientist, AI/ML Specialist
`lit_digest`	4	2	1936 Literary Digest poll predictions vs. actual result
`mnist_example`	10	5	Sample (10 images; 7,840 rows in long format) MNIST handwritten digit images (one per digit 0:9)
`mnist_test`	1,000	5	Subset (1000 images; 784,000 rows in long format) of the MNIST test database of handwritten digits, in long format
`mnist_train`	1,000	5	Subset (1000 images; 784,000 rows in long format) of the MNIST training database of handwritten digits, in long format
`nb10`	100	2	Repeated weighings of a standard weight (deficit in micrograms below 10g)
`oecd_bli`	36	26	OECD Better Life indicators by country (2015)
`oecd_bli_indicators`	24	5	Metadata for OECD Better Life indicators
`olympic_running`	312	4	Olympic track event winning times (1896:2016)
`portacaval_studies`	3	4	Study counts by design type and reported improvement level
`portacaval_survival`	2	3	Survival rates comparing randomized vs. non-randomized designs
`salk_blind`	3	3	Randomized controlled double-blind trial results
`salk_nfip`	3	4	NFIP observed-control design results
`truman_dewey`	4	5	1948 election polling predictions vs. actual result
`ucb_admissions`	24	4	Graduate admissions by department and sex (Simpson's paradox)
`us_elections`	14	6	Gallup poll accuracy for US presidential elections (1952:2004)
`wine_quality`	6,497	13	Wine physicochemical properties and quality ratings

Handedness Data

Handedness counts by sex from Freedman, Pisani & Purves (2007), used to illustrate chi-squared tests for independence.

library(eda4mlr)
data(handedness)
handedness
#> # A tibble: 6 × 3
#>   sex    hnd   count
#>   <chr>  <chr> <int>
#> 1 male   right   934
#> 2 male   left    113
#> 3 male   ambi     20
#> 4 female right  1070
#> 5 female left     92
#> 6 female ambi      8

LearningGraph Knowledge Graph

The learning_graph object is a knowledge graph for skills-based learning in data science, based on the IC Data Science Competency Resource Guide (2023) with structure inspired by Workera.ai's skills intelligence platform. It demonstrates graph theory concepts including directed acyclic graphs, bipartite structures, weighted edges, and path algorithms.

The knowledge graph contains five node types (skills, courses, learners, work roles, competencies) and five edge types (has_skill, requires_skill, prerequisite, course_prereq, teaches). The complete object is available as learning_graph, with individual components exported as lg_* tibbles for convenience.

data(learning_graph)

# Structure overview
names(learning_graph)
#> [1] "metadata"           "proficiency_levels" "nodes"              "edges"

# Access nodes
names(learning_graph$nodes)
#> [1] "competencies" "skills"       "work_roles"   "courses"      "learners"

# Access edges
names(learning_graph$edges)
#> [1] "has_skill"      "requires_skill" "prerequisite"   "course_prereq"  "teaches"

# Example: view skills
learning_graph$nodes$skills
#> # A tibble: 18 × 6
#>    skill_id skill_tag          skill_name             cmp_id k_or_s description
#>       <int> <chr>              <chr>                   <int> <chr>  <chr>
#>  1        1 algorithms         Algorithms                  1 k      Knowledge of designing...
#>  2        2 programming        Programming                 1 s      Skill in programming...
#> ...

# Or use individual exports
data(lg_skills)
data(lg_prerequisite)

MNIST Handwritten Digits

Sample images from the MNIST database of handwritten digits, useful for demonstrating image data and dimension reduction.

data(mnist_example)
data(mnist_train)
data(mnist_test)

# mnist_example contains one image per digit (0-9)
# mnist_train and mnist_test contain 1000 images each
dim(mnist_train)
#> [1] 1000    5

OECD Better Life Index

Well-being indicators for 36 countries across 11 dimensions (housing, income, jobs, education, environment, health, etc.).

data(oecd_bli)
data(oecd_bli_indicators)

# See available indicators
oecd_bli_indicators$indicator
#>  [1] "Stakeholder Engagement"    "Voter Turnout"
#>  [3] "Air Pollution"             "Water Quality"
#> ...

Olympic Running Data

Fastest running times for Olympic track events from 1896 to 2016, for men and women across seven distances.

data(olympic_running)
head(olympic_running)
#> # A tibble: 6 × 4
#>    year length sex   time
#>   <int>  <int> <chr> <dbl>
#> 1  1896    100 male   12
#> ...

Wine Quality Data

Physicochemical measurements for Portuguese Vinho Verde wines, with quality ratings from expert tasters.

data(wine_quality)
dim(wine_quality)
#> [1] 6497   13

Data Sources

Topic	Source
Handedness by Sex	Freedman, Pisani, Purves (4e)
LearningGraph	IC DSci-CRG & Workera.ai
MNIST subsets	Yann LeCun's MNIST Database
NB10 Repeated Weighings	Freedman, Pisani, Purves (4e)
OECD Better Life Index	OECD Better Life Index (2015)
Olympics	Olympics.com via tsibbledata
Polling and Elections	Freedman, Pisani, Purves (4e)
Portacaval Shunt	Freedman, Pisani, Purves (4e)
Salk Vaccine Trial	Freedman, Pisani, Purves (4e)
UC Berkeley Admissions	Freedman, Pisani, Purves (4e)
Wine Quality	UCI Machine Learning Repository

Citations

If you use these datasets, please cite the original sources:

Handedness by Sex

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.). W.W. Norton & Company

LearningGraph

Office of the Director of National Intelligence (2023) Competency Resource Guide for Data Science (UNCLASSIFIED) Structure inspired by Workera.ai skills intelligence platform https://workera.ai/

MNIST subsets

LeCun, Y., Cortes, C., & Burges, C. J. C. (1998) The MNIST database of handwritten digits http://yann.lecun.com/exdb/mnist/

NB10 Repeated Weighings

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 6, Sec. 1. W.W. Norton & Company

OECD Better Life Index

OECD (2015). OECD Better Life Index https://www.oecdbetterlifeindex.org/

Olympics

tsibbledata: Diverse Datasets for 'tsibble' https://cran.r-project.org/package=tsibbledata

Polling and Elections

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 19. W.W. Norton & Company

Portacaval Shunt

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 2. W.W. Norton & Company

Salk Vaccine Trial

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 1. W.W. Norton & Company

UC Berkeley Admissions

Freedman, D., Pisani, R., & Purves, R. (2007) Statistics (4th ed.), Ch. 1, Sec. 4. W.W. Norton & Company Also: Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975) Science, 187(4175), 398-404

Wine Quality

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009) Modeling wine preferences by data mining from physicochemical properties Decision Support Systems, 47(4), 547-553 https://doi.org/10.1016/j.dss.2009.05.016

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
R		R
data-raw		data-raw
data		data
inst/templates		inst/templates
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
eda4mlr.Rproj		eda4mlr.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eda4mlr

Installation

Datasets

Handedness Data

LearningGraph Knowledge Graph

MNIST Handwritten Digits

OECD Better Life Index

Olympic Running Data

Wine Quality Data

Data Sources

Related

Citations

License

About

Uh oh!

Releases

Packages

Languages

License

tthrall/eda4mlr

Folders and files

Latest commit

History

Repository files navigation

eda4mlr

Installation

Datasets

Handedness Data

LearningGraph Knowledge Graph

MNIST Handwritten Digits

OECD Better Life Index

Olympic Running Data

Wine Quality Data

Data Sources

Related

Citations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages