-
Notifications
You must be signed in to change notification settings - Fork 5
DBmaps R Package Development
The goal of this project is to develop a software package (tentatively named DBmaps) to facilitate the analyses of data from a relational database. Joining/merging tables is often fundamental to this work so that information from different tables can be linked in the same resulting data structure. Currently, this requires significant practical knowledge of the tables, including the ways in which information can be merged and the key variables that link pairs of tables together. Especially in a complex database, the relevant information may only be indirectly linked, potentially through a chain of intermediate tables.
To that end, this project seeks to construct a set of tools that will automate some of these tasks. Producing a map of the database will help to show the ways in which information can be joined/merged between pairs of tables. This map can be specified by the user, which serves as documentation about their knowledge of the database. Portions of the map can also be potentially filled in through an automatic identification of tables that can be joined and the best candidates for linking variables. A search algorithm can be used to explore the map. The eventual goal is for a user to select the variables from specific tables and to automate the production of the resulting data set.
Within R, there are efficient tools in place for joining/merging tables. However, no package exists to automatically explore the relations between tables in a database, to construct a map of these relations, or to automate joining/merging of indirectly linked variables through intermediate tables.
The package will include development of a number of important functions for database exploration, mapping, and merging.
-
create.DB.map: Construct a data.frame or data.table object. Each row includes tables x and y, an indicator of whether they can be joined/merged, the linking variables by.x and by.y. Optional columns will include the type of join/merge (left,right,inner,outer) and potential aggregations within each table (x and y), such as grouped sums. This is a new function that would be coded in R from scratch.
-
map.DB.pairs: Starting with a list of tables, search each pair to establish whether a) the two tables can be joined/merged, and b) the linking variables by.x and by.y that would facilitate the merge. Linking variables would be suggested based upon either the similarities of names or containing the same unique values. This is a new function that would be coded in R from scratch.
-
multi.merge: Starting with a list of tables, linking variables, and joining types, perform sequential merging of all of the tables. This would reduce the coding labor of calling nested merges, such as merge(a, merge(b, merge(c, merge(d, merge(e, f))))). This function would be coded in R based upon enhancing a working prototype.
-
materialized.view: Starting with inputs of a) the user's selected variables from specific tables in the database and b) a mapping of the database as constructed by create.DB.map(), automatically generate a materialized view. This involves using the database map to merge tables, potentially through intermediate links, and also potentially while performing aggregations such as grouped computations on some tables. This is a new function that would be coded in R from scratch.
-
visualize.DB.map: Given a mapping of the database as constructed by create.DB.map(), build a visualization that shows the relationships between tables, links, and potential aggregations. This is a new function that would be coded in R from scratch.
Producing an automated mapping of a relational database will help facilitate exploration of the data and documentation of the links. This will help analysts more quickly understand the structure of the database and the ways in which data from various tables can be brought together into materialized views. This will also reduce the coding labor and potential for mistakes in joining/merging multiple tables.
- EVALUATING MENTOR: David Shilane [email protected] is the author of 7 R packages (formulaic, getDTeval, DTwrappers, DTwrappers2, simitation, nRegression, and tvtools).
- Co-mentor Toby Dylan Hocking has experience in 10+ years of R-GSOC.
Contributors, please do one or more of the following tests before contacting the mentors above.
Easy: Consider the following data.table objects:
library(data.table)
students <- data.table(id = c("A", "B", "C", "D"), Birthdate = c("2001-08-04", "2002-04-28", "2002-06-13", "2002-02-09"))
scores <- data.table(id = c("B", "C", "E"), homework = c(87,94,92), quiz = c(91, 90, 87))
Use the merge() function to perform the four types of merges. Show the code and the outputs.
Medium: Write a function that will merge tables x and y and then merge this combined table with another table z. Make sure your function includes any relevant parameters.
Hard: Can the contributor write a package with Rd files, tests, and vignettes? Show examples of any work you've contributed to in building an R package.
Contributors, please post a link to your test results here.
EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
- Sagar Udasi, GitHub Profile, Test Results
- Divendra Yadav, GitHub Profile, Test Results
- Akshat Maurya, GitHub Profile, Test Results
- Disha Davey, GitHub Profile, Test Results
- Yue(Zelda) Zhang, GitHub Profile, Test Results
Please do not edit this footer! Instead click Edit button in upper right.