[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058

Blazer-007 · 2024-09-22T16:20:51Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

[✅] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-2159

Description

[✅] Here are some details about my PR, including screenshots (if applicable):
- Currently, in Iceberg Distcp it is not possible to specify which partitions to copy. This PR aims to do that by adding support for partition level copy in Iceberg distcp.
- It supports partition copy between two different iceberg tables meaning with different UUIDs.

Tests

[✅] My PR adds the following unit tests OR does not need testing for this extremely good reason:
- IcebergPartitionDatasetTest
- IcebergOverwritePartitionsStepTest
- IcebergTableTest [ Updated ]
  - testGetPartitionSpecificDataFiles()
  - testReplacePartitions()
- IcebergMatchesAnyPropNamePartitionFilterPredicateTest
- IcebergPartitionFilterPredicateUtilTest

Commits

[✅] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

phet

this is a great start! mostly suggestions to leverage a bit more of the existing classes (rather than creating near clones) and also to simplify some interfaces (esp. for the partition filter predicates) to take in specific params, rather than Properties. given the latter may hold just about anything, the API "contract" they define is weaker than we'd want.

...main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetFinder.java

...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

phet · 2024-09-24T23:40:43Z

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

+      CopyableFile fileEntity = CopyableFile.fromOriginAndDestination(
+              actualSourceFs, srcFileStatus, targetFs.makeQualified(destPath), copyConfig)
+          .fileSet(fileSet)
+          .datasetOutputPath(targetFs.getUri().getPath())
+          .build();


you skip first doing this, like in IcebergDataset:

// preserving ancestor permissions till root path's child between src and dest List<OwnerAndPermission> ancestorOwnerAndPermissionList = CopyableFile.resolveReplicatedOwnerAndPermissionsRecursively(actualSourceFs, srcPath.getParent(), greatestAncestorPath, copyConfig);

is that intentional? do you feel it's not necessary or actually contra-indicated?

In the IcebergDataset the path of tables are exactly since table UUID are same on source and destination here it can be different, so copying permissions atleast in first draft is not necessary I believe.

Even if there is need that we need to make sure ancestor path, parent path are ones we want, that's why I have removed it for now.

phet · 2024-09-24T23:42:34Z

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

+    // Adding this check to avoid adding post publish step when there are no files to copy.
+    if (CollectionUtils.isNotEmpty(destDataFiles)) {
+      copyEntities.add(createPostPublishStep(destDataFiles));
+    }


I agree this is one difference with IcebergDataset::generateCopyEntities, which always wants to add its post-publish step. (but it shouldn't be hard to refactor to isolate this difference)

phet · 2024-09-24T23:47:14Z

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

+   * @throws IOException if an I/O error occurs
+   */
+  @Override
+  Collection<CopyEntity> generateCopyEntities(FileSystem targetFs, CopyConfiguration copyConfig) throws IOException {


this impl is really, really similar to the one it's based on in its base class. deriving from a class and then overriding methods w/ only small changes is pretty nearly cut-and-paste code. sometimes it's inevitable, but let's avoid when we can. in this case, could we NOT override this method, but only GetFilePathsToFileStatusResult getFilePathsToFileStatus(...) so this derived class's version runs the new code instead:

IcebergTable srcIcebergTable = getSrcIcebergTable(); List<DataFile> srcDataFiles = srcIcebergTable.getPartitionSpecificDataFiles(this.partitionFilterPredicate); List<DataFile> destDataFiles = getDestDataFiles(srcDataFiles); Configuration defaultHadoopConfiguration = new Configuration(); for (FilePathsWithStatus filePathsWithStatus : getFilePathsStatus(srcDataFiles, destDataFiles, this.sourceFs)) { ...

I will list down my reason here -

In IcebergDataset implementation it is assumed that srcPath and destPath are same which is not the case here, if you see the code we are using srcPath, srcFileStatus but here those needs to be changed to destPath & srcFileStatus for readability and maintaining the code as well.

Currently I have added just ReplacePartitionStep as post publish step but IcebergRegisterStep too needs to be added based on Schema Validation scenario which I will be raising as different PR because that needs a proper validation so that we are not corrupting datafiles on dest table.

I am not fully convinced on copying Ancestor Permission, whether it is even required or not, although I did tried making it work by changing ancestor path parent path but wasn't working so removing it is a must for now.

If i will try to just override GetFilePathsToFileStatusResult getFilePathsToFileStatus(...) then we need to override Data class GetFilePathsToFileStatusResult too as we need datafiles too along with destPath srcFileStatus.

To conclude it -
reader should understand whether it is actually srcPath or destPath while creating copyable file
need of adding replacepartition commit step along with registerstep (based on condition)
and to remove copying permission for now.

…ause of failure

phet

overall looking good. part 1 of 2 done on this re-review... will return

...main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetFinder.java

...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java

...che/gobblin/data/management/copy/iceberg/predicates/IcebergPartitionFilterPredicateUtil.java

...ain/java/org/apache/gobblin/data/management/copy/iceberg/IcebergOverwritePartitionsStep.java

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java

phet · 2024-10-09T17:36:31Z

...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java

+      } catch (IOException e) {
+        log.warn("Failed to read manifest file: {} " , manifestFile.path(), e);
+      }


iceberg is atomic/transactional, so I really don't agree w/ swallowing exceptions and still proceeding onward when the table is corrupted. that has the potential for us to lay even more corruption on top of that...

please explain if you see a genuine argument for ignoring errors.

yeah completely agree with your suggestion, somehow i missed it let me correct it by failing the copy with proper logging

phet · 2024-10-09T17:38:34Z

...che/gobblin/data/management/copy/iceberg/predicates/IcebergPartitionFilterPredicateUtil.java

+   * @return the index of the partition column if found, otherwise -1
+   * @throws IllegalArgumentException if the partition transform is not supported
+   */
+  public static int getPartitionColumnIndex(


this single static seems closely related enough to IcebergMatchesAnyPropNamePartitionFilterPredicate that it could reasonably live there as a public static (eliminating the need for an additional separate class).

Currently it looks like that but in future we will need more filter predicates and every filter will need the partition column index, so i believe keeping it separate for now should be fine , maybe we can put this in factory class itself or convert this class to factory class

...che/gobblin/data/management/copy/iceberg/predicates/IcebergPartitionFilterPredicateUtil.java

phet

looks close

...ain/java/org/apache/gobblin/data/management/copy/iceberg/IcebergOverwritePartitionsStep.java

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

...c/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetTest.java

phet · 2024-10-21T21:47:47Z

...c/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetTest.java

+    Assert.assertEquals(copyEntities.size(), 2);
+    verifyCopyEntities(copyEntities, true);


the path of every copy entity needs validation. pass that in the way we did in IcebergDatasetTest::verifyCopyEntities(Collection<CopyEntity> copyEntities, List<String> expected)

once you do, copyEntities.size() validation can and should be encapsulated within "verify"

I am doing that but just in a different way since we are adding UUID at runtime so cant have expected path beforehand - please have a look at function -

private static void verifyCopyEntities(Collection<CopyEntity> copyEntities, int expectedCopyEntitiesSize, boolean sameSrcAndDestWriteLocation) { Assert.assertEquals(copyEntities.size(), expectedCopyEntitiesSize); String srcWriteLocationStart = SRC_FS_URI + SRC_WRITE_LOCATION; String destWriteLocationStart = DEST_FS_URI + (sameSrcAndDestWriteLocation ? SRC_WRITE_LOCATION : DEST_WRITE_LOCATION); String srcErrorMsg = String.format("Source Location should start with %s", srcWriteLocationStart); String destErrorMsg = String.format("Destination Location should start with %s", destWriteLocationStart); for (CopyEntity copyEntity : copyEntities) { String json = copyEntity.toString(); if (IcebergDatasetTest.isCopyableFile(json)) { String originFilepath = IcebergDatasetTest.CopyEntityDeserializer.getOriginFilePathAsStringFromJson(json); String destFilepath = IcebergDatasetTest.CopyEntityDeserializer.getDestinationFilePathAsStringFromJson(json); Assert.assertTrue(originFilepath.startsWith(srcWriteLocationStart), srcErrorMsg); Assert.assertTrue(destFilepath.startsWith(destWriteLocationStart), destErrorMsg); String originFileName = originFilepath.substring(srcWriteLocationStart.length() + 1); String destFileName = destFilepath.substring(destWriteLocationStart.length() + 1); Assert.assertTrue(destFileName.endsWith(originFileName), "Incorrect file name in destination path"); Assert.assertTrue(destFileName.length() > originFileName.length() + 1, "Destination file name should be longer than source file name as UUID is appended"); } else{ IcebergDatasetTest.verifyPostPublishStep(json, OVERWRITE_COMMIT_STEP); } } }

I understand what you're doing w/ the UUID - which is a good thing to validate - but the difference in your verifyCopyEntities method def is that it blindly verifies the length of the list without knowing specifically which paths should be there (i.e. it's missing a List<String> expected parameter).

why not take List<String> expectedSrcFilePaths to verify against each copyable file's getOriginFilePathAsStringFromJson? then in addition continue to validate the relationship between each origin file path and its getDestinationFilePathAsStringFromJson.

phet · 2024-10-22T21:26:46Z

...gement/src/main/java/org/apache/gobblin/data/management/copy/iceberg/BaseIcebergCatalog.java

@@ -67,4 +71,6 @@ protected String getDatasetDescriptorPlatform() {
  }

  protected abstract TableOperations createTableOperations(TableIdentifier tableId);
+
+  protected abstract Table loadTableInstance(TableIdentifier tableId);


for good measure you could also make IcebergTable.TableNotFoundException a declared/checked exception here.

I'm tempted to re-situate the exception as IcebergCatalog.TableNotFoundException, but I don't want two classes w/ the same semantics - and renaming public interfaces is probably too late... so I'll make peace with the current name

As discussed not throwing here instead catching NoSuchTableException in BaseIcebergCatalog::openTable and throwing IcebergTable.TableNotFoundException from there.

...c/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetTest.java

...nagement/src/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTableTest.java

...gement/src/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergDatasetTest.java

...c/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetTest.java

phet · 2024-10-23T07:41:36Z

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

+      } catch (IOException e) {
+        String errMsg = String.format("~%s~ Failed to get file status for path : %s", this.getFileSetId(), srcPath);
+        log.error(errMsg);
+        throw new RuntimeException(errMsg, e);
+      }


I really wish java.util.function.* played along better w/ checked exceptions... but that's clearly not the case... *sigh*

throwing IOException is actually a key part of the FileSet "contract", so substituting an unchecked RuntimeException (that no caller expects and would NOT be looking out for) is not something we ought to do at this late stage.

instead, either write this iteratively (using for-each loop) or follow IcebergDataset's use of CheckedExceptionFunction.wrapToTunneled

try { ... } catch (CheckedExceptionFunction.WrappedIOException wrapper) { wrapper.rethrowWrapped(); }

the code there actually uses:

copyConfig.getCopyContext().getFileStatus(targetFs, new Path(pathStr)).isPresent()

for caching, which shouldn't be necessary here, given IcebergTable::getPartitionSpecificDataFiles examines only a single snapshot.

Done using IcebergDataset's use of CheckedExceptionFunction.wrapToTunneled

...gement/src/main/java/org/apache/gobblin/data/management/copy/iceberg/BaseIcebergCatalog.java

...nagement/src/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTableTest.java

phet · 2024-10-23T16:43:54Z

...a-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java

    for (ManifestFile manifestFile : dataManifestFiles) {
+      if (growthMilestoneTracker.isAnotherMilestone(knownDataFiles.size())) {
+        log.info("~{}~ for snapshot '{}' - before manifest-file '{}' '{}' total known iceberg datafiles", tableId,
+            currentSnapshotId,
+            manifestFile.path(),
+            knownDataFiles.size()
+        );
+      }


I agree this makes more sense here, given the synchronous reading of every manifest files happens within this method, rather than in the style of the Iterator<IcebergSnapshotInfo> returned by IcebergTable::getIncrementalSnapshotInfosIterator.

that said, I doubt we should still log tracked growth as this very same list is later transformed in IcebergPartitionDataset::calcDestDataFileBySrcPath. all the network calls are in this method, rather than over there, so the in-process transformation into CopyEntities should be quite fast. maybe just log once at the end of calcDestDataFileBySrcPath

Yes, seems a valid approach let me remove growthMileStonetracker from that function

phet · 2024-10-23T16:48:07Z

...c/test/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDatasetTest.java

@@ -224,23 +230,23 @@ private static void setupDestFileSystem() throws IOException {
    Mockito.when(targetFs.getFileStatus(any(Path.class))).thenThrow(new FileNotFoundException());
  }

-  private static List<DataFile> createDataFileMocks() throws IOException {
-    List<DataFile> dataFiles = new ArrayList<>();
+  private static Map<String, DataFile> createDataFileMocksBySrcPath(List<String> srcFilePaths) throws IOException {


I really like how returning this Map allows you to be so succinct at every point of use:

Map<String, DataFile> mockDataFilesBySrcPath = createDataFileMocksBySrcPath(srcFilePaths); Mockito.when(srcIcebergTable.getPartitionSpecificDataFiles(Mockito.any())). thenReturn(new ArrayList<>(mockDataFilesBySrcPath.values())); ... // (above just a `.values()` and simply a `.keySet()` below) verifyCopyEntities(copyEntities, new ArrayList<>(mockDataFilesBySrcPath.keySet()), false);

nice work!

phet · 2024-10-23T16:59:17Z

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

+      copyEntities.add(createOverwritePostPublishStep(destDataFiles));
+    }
+
+    log.info("~{}~ generated {} copy--entities", fileSet, copyEntities.size());


the two dashes between copy--entities seems like a typo

phet

excellent work here!

Blazer-007 added 15 commits September 24, 2024 09:58

initial changes for iceberg distcp partition copy

02ae2fc

added datetime filter predicate with unit tests

981357c

changing string.class to object.class

7cd9353

updated replace partition to use serialized data files

82d10d3

some code cleanup

c43d3e1

added unit test

0cf7638

added replace partition unit test

63bb9aa

refactored and added more test

6e1cf6b

added javadoc

065cde3

removed extra lines

a13220d

some minor changes

e1d812f

added retry and tests for replace partitions commit step

4364044

minor test changes

66d81a3

added metadata validator

24b4823

removed validator class for now

d8356e1

Blazer-007 force-pushed the iceberg_distcp_partition_copy_0 branch from b4f6369 to d8356e1 Compare September 24, 2024 12:32

phet reviewed Sep 24, 2024

View reviewed changes

Blazer-007 mentioned this pull request Sep 26, 2024

[GOBBLIN-2138] added loadTable api method in baseicebergcatalog #4033

Closed

Blazer-007 added 2 commits September 27, 2024 17:00

addressed comments and removed some classes for now

4dcc88b

fixing checkstyle bugs and disabling newly added tests to find root c…

46bd976

…ause of failure

Blazer-007 mentioned this pull request Oct 3, 2024

[GOBBLIN-2163] Adding Iceberg TableMetadata Validator #4064

Merged

phet reviewed Oct 8, 2024

View reviewed changes

addressed pr comments and added few extra logs

e1e6f57

phet reviewed Oct 9, 2024

View reviewed changes

Blazer-007 added 4 commits October 17, 2024 14:44

refactored classes

b6163ba

removed extra import statements

6c73a25

enabled the tests

9c35733

fixed iceberg table tests

cdc863a

phet reviewed Oct 21, 2024

View reviewed changes

some refactoring

1dbe929

refactored tests as per review comments

383ed91

phet reviewed Oct 22, 2024

View reviewed changes

Blazer-007 added 3 commits October 23, 2024 03:11

throw tablenotfoundexception in place of nosuchtableexception

942ad8d

fixed throwing proper exception

6a4cf78

removed unused imports

2adaa8b

phet reviewed Oct 23, 2024

View reviewed changes

Blazer-007 added 5 commits October 23, 2024 16:41

replcaed runtime exception with ioexception

c948854

added check to avoid printing same log line

a55ee61

fixed import order

1afc37a

added catch for CheckedExceptionFunction.WrappedIOException wrapper

bb35070

fixed compile issue

eeb8d25

phet reviewed Oct 23, 2024

View reviewed changes

removing unwanted logging

675e8bb

phet approved these changes Oct 23, 2024

View reviewed changes

phet merged commit 4b639f6 into apache:master Oct 23, 2024
6 checks passed

Blazer-007 deleted the iceberg_distcp_partition_copy_0 branch October 24, 2024 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058

[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058

Blazer-007 commented Sep 22, 2024 •

edited

Loading

phet left a comment •

edited

Loading

phet Sep 24, 2024

Blazer-007 Sep 25, 2024

phet Sep 24, 2024

phet Sep 24, 2024 •

edited

Loading

Blazer-007 Sep 25, 2024

phet left a comment

phet Oct 9, 2024

Blazer-007 Oct 17, 2024

phet Oct 9, 2024

Blazer-007 Oct 17, 2024

phet left a comment

phet Oct 21, 2024

Blazer-007 Oct 22, 2024 •

edited

Loading

phet Oct 23, 2024 •

edited

Loading

Blazer-007 Oct 23, 2024

phet Oct 22, 2024

Blazer-007 Oct 23, 2024

phet Oct 23, 2024

Blazer-007 Oct 23, 2024

phet Oct 23, 2024

Blazer-007 Oct 23, 2024

phet Oct 23, 2024

phet Oct 23, 2024

phet left a comment

		Assert.assertEquals(copyEntities.size(), 2);
		verifyCopyEntities(copyEntities, true);

[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058

[GOBBLIN-2159] Adding support for partition level copy in Iceberg distcp #4058

Conversation

Blazer-007 commented Sep 22, 2024 • edited Loading

JIRA

Description

Tests

Commits

phet left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phet Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blazer-007 Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

phet Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phet left a comment

Choose a reason for hiding this comment

Blazer-007 commented Sep 22, 2024 •

edited

Loading

phet left a comment •

edited

Loading

phet Sep 24, 2024 •

edited

Loading

Blazer-007 Oct 22, 2024 •

edited

Loading

phet Oct 23, 2024 •

edited

Loading