Git Workflow for Large Drill Projects

Experience has shown that Apache Drill places unusual demands on the development of larger features.

All work must be presented as a complete pull request, but
Pull requests must be small.

Clearly, these two requirements are in conflict for other than the most trivial projects. Do too much work in a single PR, and folks complain bitterly about the amount of code to review, and reviews stretch to 6 or 8 weeks in length, greatly slowing work. On the other hand, divide a project up into smaller PRs and folks postpone doing the review until they can see the entire picture.

This page attempts to work out a process to handle large contributions.

First Attempt: Drill-on-YARN All-in-One

The middle of 2016 saw an attempt to contribute the Drill-on-YARN (DoY) project to Apache Drill. DoY is about 14K lines of new code. This turned out to be far too large for the Drill committers to review.

We started to do a process of walking people through each file. But, with a firm understanding of YARN and DoY, the reviewers rapidly got lost and the process bogged down. Eventually, we elected to keep DoY as closed source until we determined an alternative solution.

Moving forward, we will likely submit DoY as a "contrib" project: it still needs review, but is not considered part of the core Drill and so will not need the Drill committers to become experts on YARN.

Second Attempt: External Sort Refactoring

2017 saw the release of a refactoring of the external sort to better manage memory and spilling. This project required a large refactoring of existing code. Work was broken into independent PRs for some tasks, with a larger PR for the external sort work. The small PRs took only about a month each to commit, but the bit PR took far longer. A key factor was that GitHub provides no good tools to show where code was simply moved from one file to another; reviewers are forced to review the code as if it was new.

Third Attempt: External Sort Unit Testing

2017 Q2 saw another large chunk of work for the external sort: a second refactoring to allow unit testing, along with the unit test framework. Learning from previous attempts, this cycle did the following:

Released the work as a series of PRs, each dependent on previous PRs.
Released the external sort work as the "tip" PR that can be reviewed in parallel, but run only when the previous PRs are committed.

The idea is to have a series of, say, four PRs: A, B, C, and D. A provided the foundation for unit testing, B built on A to add a second layer of framework. C built on B to add the actual test classes, and D build on C to refactor and test the external sort. In this model, A is a "prerequisite" for B, C and D.

In this model, A builds on Master. B also builds on Master, but since it needs A, will not compile until A is committed. And so on. The advantage is that the work is broken down into four smaller chunks (overcoming the DoY issue.)

In practice, however, reviews still were sequential: most reviewers would start only when the prerequisite PRs were committed. The final PR (the external sort work) will take about 8 weeks from issuing the PR to commit.

Late in the process we identified a slightly different structure that may work better. A is based on master, B on A, C on B and so on. This allows each PR to be built and run (unlike the previous structure.) This is the structure to be used in the next attempt.

To support this, a personal "Dev" branch existed in my private repo. All work was done on that branch, then files were pulled from that branch to create the PR branches above. Doing so is tedious: over 100 files changed and each had to be mapped to a PR. In some cases, different changes within the same file had to be mapped to different PRs. Two tools help:

For whole files, create the target branch, say branch A. Use git checkout Dev <file path> to copy the target file from the Dev branch to the PR branch. Repeat for each of the four PR branches.
When a single file has changes that must be split across PR branches, use the Eclipse compare tool to compare the (unchanged) branch A file with the Dev branch version. Merge only the changes needed for A.

The key cost with the above approach is that, during the 6-8 weeks of review, code changes. The process used:

Make the change in the Dev branch and test.
Use the above techniques to pull changes (entire files or just changed lines) from Dev to the proper PR branch.

Needless to say, doing the above turned out to be very tedious and error-prone. Many, many hours were spent jockeying code from one place to another -- time that should have been spend adding value to Drill itself.

Fourth Attempt: Fixing Memory Fragmentation

The memory fragmentation fix project will submit a number of PRs as we work on various layers of the system. As usual, we anticipate 3-4 weeks duration for each PR. But, the project must move fast. We will divide the work into smallish PRs, but they will depend on one another as in the external sort unit test work. We will leverage the idea of layered branches as done at the end of the previous cycle.

This project depends on PRs done previously that have been outstanding for 3-4 weeks, and will likely continue to be outstanding for up to another month.

For development:

Create a "Dev3" branch to hold the work.
Create a "Dev3-Baseline" branch to hold work that Dev3 depends upon. This branch will be created by applying the prequsite PRs as patches.
Dev3 will be rebased onto Dev3-Baseline, and rebased again each time Dev3-Baseline changes.

For PRs:

As before, files will be pulled from Dev3 to create PRs assigned to specific Drill JIRA tickets.
Ideally, each PR will be independent, based only on master.
Upper layers will have prerequisites. If branch B depends on A, it will be created as a branch from A to allow reviewers to build the code.
Work will be ongoing. Likely, during the 3-4 weeks of review, code will change. We will incur the overhead cost of merging changes from Dev3 into each PR branch, then retesting each PR branch.

Problems may occur if we end up with a diamond: B and C depend on A, and D depends on B and C. A diamond may force a "hard sync": D cannot be created until either B or C are committed to reduce the perquisite branches to just one.

Details

To create Dev3-Baseline

Create the Dev3-Baseline branch from master.
Use these instructions to create a patch:

git checkout DRILL-5423
git format-patch master --stdout > /tmp/5423.patch
git checkout Dev3-Baseline
git git am --signoff <  /tmp/5423.patch

Push the new branch to origin.
Rebase Dev3 onto Dev3-Baseline:

git rebase --onto Dev3-Baseline master Dev3

Results

The above did not work as intended. Instead of having Dev3 be based on Dev3-Baseline, the Dev3-Baseline commits showed up in the Dev3 history. Not sure if I did something wrong or if the concept is flawed...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly