Skip to content

grepreaper R Package Development

ayraa.ai edited this page Feb 12, 2025 · 2 revisions

Background

This project aims to develop an R software package (grepreaper) for advanced file reading applications. Many tools for basic file reading are available within R's packages. These tools include some parameters, such as skipping preliminary rows or limiting the read to a specified number of records. However, a number of capabilities can be enabled by linking file reading to grep at the command line. Some of these features include:

  • Counting the number of records prior to reading the data.
  • Extracting only the records that match a specified pattern.
  • Reading and aggregating data from multiple files simultaneously (assuming identical structures).

Within the data.table package, the fread() command enables users to specify their own command-line statements. This allows users to run their own grep commands or use other tools. With that said, command-line programming syntax is specialized and not necessarily familiar to many users of R. To that end, the grepreaper package aims to develop wrapper functions that will facilitate these advanced capabilities. With these tools, R users can benefit from advanced file reading capabilities without having to learn command-line tools. With grepreaper, everyone can reap the rewards of programming with grep.

Related work

To our knowledge, there are no existing packages that build grep wrappers for advanced file reading. The authors of this package are also working on a companion package for file reading with AWK at the command line.

Details of your coding project

The package will include development and revision of a number of functions:

  • grep.count: Count the number of relevant rows of data in one or more files, overall or matching search patterns.

  • grep.read: Read and aggregate data from one or more files while allowing for pattern matching.

The coding process will involve using R to perform a few tasks: a) use string-printing techniques to construct the grep statement, b) build variations on the coding statement to match different types of pattern-matching (such as full or partial patterns, along with regular or inverted matches), and c) read the files through data.table's fread() command. Ideally these functions will also provide options to show either a) the resulting counts/data, b) the command-line grep statement, or c) both. You will begin with some partially-developed code that can be revised and expanded upon.

Expected impact

This package will create user-friendly wrapper functions that facilitate connections between R and command-line grep. These tools will enable R users to read and aggregate data from multiple files, search a range of files for matching patterns, and identify the size of the potential data prior to reading. This will give many R users more tools for file reading.

Mentors

  • EVALUATING MENTOR: David Shilane [email protected] is the author of 7 R packages (formulaic, getDTeval, DTwrappers, DTwrappers2, simitation, nRegression, and tvtools).
  • Co-mentor Toby Dylan Hocking has experience in 10+ years of R-GSOC.

Tests

Contributors, please do one or more of the following tests before contacting the mentors above.

Easy: Consider the diamonds data from the ggplot2 package:

library(ggplot2) library(data.table) data(diamonds)

Within R, use a call to the grep() function to identify rows of data that match the pattern 'VS'. Count the number of qualifying rows.

Medium: Within the data.table package, write a call to fread() that feeds a command-line grep statement into the cmd parameter. Use this to count the number of rows that match the pattern 'VS'.

Hard: Can the contributor write a package with Rd files, tests, and vignettes? Show examples of any work you've contributed to in building an R package.

Solutions of tests

Contributors, please post a link to your test results here.

EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
  1. Aarya Pandey, GitHub Profile, Test Results

Please do not edit this footer! Instead click Edit button in upper right.