Flink: Maintenance - TableManager + ExpireSnapshots #11144

pvary · 2024-09-16T07:39:46Z

TableManager builder implementation along with the first maintenance task to provide context.
https://docs.google.com/document/d/16g3vR18mVBy8jbFaLjf2JwAANuYOmIwr15yDDxovdnA/edit#heading=h.yd2vbtnf7z6w

pvary · 2024-09-16T07:43:09Z

@stevenzwu: This PR become quite sizeable. I still think that it is better to keep it as one to provide context for some of the decisions made during the definition of the MaintenanceTaskBuilder.

If you have time we could discuss offline the review strategy, and whether to split this PR to smaller ones.

Thanks,
Peter

stevenzwu · 2024-09-16T16:39:57Z

.../flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java

+import org.apache.iceberg.flink.maintenance.operator.TriggerEvaluator;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+public abstract class MaintenanceTaskBuilder<T extends MaintenanceTaskBuilder> {


should this be marked as @Internal or even package private?

Do we want to allow the users to create their own maintenance tasks?

Let's keep this package private first so that it is easier to evolve the class especially during early stages.

when there is real need in the future, we can always make it public then.

Let's mark it experimental then. I definitely would like to provide a way for the users to extend the maintenance tasks.

I am not saying we shouldn't. but it is usually good to keep them private first so that we are free to evolve the class. Maybe wait until the need is clear.

Use Spark as an example. the BaseSparkAction is package private.

This API is a single abstract DataStream<TaskResult> append(DataStream<Trigger> sourceStream); method.
I don't expect big changes here.

not just append. there are a lot of public methods in this class.

If we have any doubt if users would implement extensions from this class, we can delay the decision until real ask came forward. it is trivial to make a private class public. But once a class is public, it is more difficult to change/evolve the contract.

The public methods are public API anyways...
These are the ones which will be called by the users when they are scheduling the MaintenanceTasks...

...k/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java

stevenzwu · 2024-09-16T20:53:29Z

...k/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java

+     * @param newPlanningWorkerPoolSize for planning files to delete
+     * @return for chained calls
+     */
+    public Builder planningWorkerPoolSize(int newPlanningWorkerPoolSize) {


should we design it like RemoveSnapshots? one benefit is to use the default ThreadPools.getWorkerPool()/getDeleteWorkerPool() to reuse thread pools in the JVM.

public ExpireSnapshots executeDeleteWith(ExecutorService executorService) public ExpireSnapshots planWith(ExecutorService executorService)

Are you suggesting to do everything in the operator instead of separating out the delete to another operator?

The Spark implementation even split the expired file calculations to multiple operators for performance reasons. In the long run we might go down that road... WYYT?

Are you suggesting to do everything in the operator instead of separating out the delete to another operator?

Nope, that is not what I meant.

I was wondering if ExpireSnapshotsProcessor and AsyncDeleteFiles should use the default shared thread pools from ThreadPools.getWorkerPool()/getDeleteWorkerPool(), instead of creating new pools.

I don't like the idea of shared pools. If some users create multiple maintenance flows in a single job, they will end up using the same pool, and will block each-other.

Also, if you have multiple slots on a single TaskManager, then these pools are shared between the subtasks. Which is again not something we want

sharing thread pool is not necessarily a bad thing. it can limit the concurrent I/O. E.g., we may not want to have too many threads perform scan planing, which can be memory intensive.

Deletes have low memory footprint. Hence it is probably less of a concern to have separate pools. but probably good to keep an eye on the number of http connections

We create our own pool for the source planner too, and it is not configurable there.
Do you think this risk is high enough to merit a new configuration value for this?

.../v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java

...1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java

...k/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/ExpireSnapshots.java

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java

stevenzwu · 2024-09-17T17:04:35Z

.../flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/MaintenanceTaskBuilder.java

+      int maintenanceTaskIndex,
+      String maintainanceTaskName,
+      TableLoader newTableLoader,
+      String mainUidSuffix,


what does main mean here?

There is a possibility to inherit the suffix and the slotSharingGroup from the TableMaintenance.Builder. It could be overwritten on task-by-task basis.

The main here is the value inherited from the TableMaintenance.Builder

stevenzwu · 2024-09-17T17:10:00Z

.../v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink.maintenance.stream;


I don't know if users would interpret stream sub package as pubic APIs. It is better to use proper Java class scope for that purpose. public classes are public and non-public classes can be package private.

...1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/operator/AsyncDeleteFiles.java

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java

stevenzwu · 2024-09-17T17:46:09Z

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java

+                ctx.output(DELETE_STREAM, file);
+                deleteFileCounter.incrementAndGet();
+              })
+          .cleanExpiredFiles(true)


maybe we should add Javadoc to the ExpireSnapshots class that expired files are always deleted

Added this:

/** Deletes expired snapshots and the corresponding files. */

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java

stevenzwu · 2024-09-17T18:03:25Z

...nk/src/main/java/org/apache/iceberg/flink/maintenance/operator/ExpireSnapshotsProcessor.java

+    } catch (Exception e) {
+      LOG.info("Exception expiring snapshots for {} at {}", table, ctx.timestamp(), e);
+      out.collect(
+          new TaskResult(trigger.taskId(), trigger.timestamp(), false, Lists.newArrayList(e)));


TaskResult has List<Exception> exceptions. wondering what scenario would we have a list of exceptions to propagate?

Other maintenance tasks might have multiple errors from multiple operators/subtasks

hmm. I am still not quite following. Each operator subtask emits a TaskResult. Each TaskResult should only contain one exception, right?

I didn't see the exceptions are used by downstream. if success boolean flag good enough for downstream, maybe we can remove the exceptions from TaskResult as stack trace can be non-trivial.

BTW, TaskResult is not marked as Serializable.

When we do compaction, then we have multiple subtasks running parallel. If more than one of them fails, then we will have multiple exception messages to report, but we don't want to fail the job (especially in PostCommitTopology).

Also I think it is very rare that we have an exception, and it is good to have a single place where we can collect/handle those. So while serializing a stack trace is non-trivial, I think it worth the cost in the long run.

Fixed the TaskResult serialization....

stevenzwu · 2024-09-17T21:12:08Z

.../v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java

+
+    private String uidSuffix = "TableMaintenance-" + UUID.randomUUID();
+    private String slotSharingGroup = StreamGraphGenerator.DEFAULT_SLOT_SHARING_GROUP;
+    private Duration rateLimit = Duration.ofMillis(1);


is 1 ms a good default?

Good catch - remained from a different config.
Set it to 1 min.
WDYT?

stevenzwu · 2024-09-17T21:14:05Z

.../v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java

+    private String uidSuffix = "TableMaintenance-" + UUID.randomUUID();
+    private String slotSharingGroup = StreamGraphGenerator.DEFAULT_SLOT_SHARING_GROUP;
+    private Duration rateLimit = Duration.ofMillis(1);
+    private Duration lockCheckDelay = Duration.ofSeconds(30);


is 30s a good default? is that based on the estimated average of task run time?

I would not set it based on the estimated average task run time, as if the actual run time is longer only by a bit, then we will wait for 2 times the required time.

30s seems reasonable for the JDBC lock manager

.../v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/stream/TableMaintenance.java

.../flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestAsyncDeleteFiles.java

stevenzwu · 2024-09-18T05:20:12Z

...20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java

+  }
+
+  @Test
+  void testMetrics() throws Exception {


similarly, can metrics assertion be added to one of earlier methods?

I prefer to separate out testing the different features

stevenzwu · 2024-09-18T05:23:32Z

...20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java

+        .add(
+            new MaintenanceTaskBuilderForTest(true)
+                .scheduleOnCommitCount(1)
+                .uidSuffix(anotherUid)


do we really need the flexibility of uidSuffix overwrite?

If we want to keep the MonitorSource state, but drop some or one of the maintenance tasks state, then we need a different uid

...20/flink/src/test/java/org/apache/iceberg/flink/maintenance/stream/TestTableMaintenance.java

github-actions bot added the flink label Sep 16, 2024

pvary force-pushed the flow branch 3 times, most recently from a1dabe5 to 96322c5 Compare September 16, 2024 09:38

Flink: Maintenance - TableManager + ExpireSnapshots

70e0783

pvary force-pushed the flow branch from 96322c5 to 70e0783 Compare September 16, 2024 11:04

stevenzwu reviewed Sep 16, 2024

View reviewed changes

pvary force-pushed the flow branch from eb8f7be to 33307c3 Compare September 17, 2024 09:02

Steven's comments

5a1516c

pvary force-pushed the flow branch from 33307c3 to 5a1516c Compare September 17, 2024 09:09

stevenzwu reviewed Sep 17, 2024

View reviewed changes

stevenzwu reviewed Sep 18, 2024

View reviewed changes

Steven's second round of comments

ea3366e

pvary force-pushed the flow branch 5 times, most recently from fa56618 to 2403f44 Compare September 18, 2024 20:44

Small fixes

3952792

pvary force-pushed the flow branch from 2403f44 to 3952792 Compare September 18, 2024 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Maintenance - TableManager + ExpireSnapshots #11144

Flink: Maintenance - TableManager + ExpireSnapshots #11144

pvary commented Sep 16, 2024

pvary commented Sep 16, 2024

stevenzwu Sep 16, 2024

pvary Sep 17, 2024

stevenzwu Sep 17, 2024 •

edited

Loading

pvary Sep 17, 2024

stevenzwu Sep 17, 2024

pvary Sep 18, 2024

stevenzwu Sep 18, 2024

pvary Sep 18, 2024

stevenzwu Sep 16, 2024

pvary Sep 17, 2024

stevenzwu Sep 17, 2024

pvary Sep 17, 2024

pvary Sep 17, 2024

stevenzwu Sep 18, 2024

pvary Sep 18, 2024

stevenzwu Sep 17, 2024

pvary Sep 17, 2024

stevenzwu Sep 17, 2024

stevenzwu Sep 17, 2024

pvary Sep 18, 2024

stevenzwu Sep 17, 2024

pvary Sep 18, 2024

stevenzwu Sep 18, 2024

pvary Sep 18, 2024

pvary Sep 18, 2024

stevenzwu Sep 17, 2024

pvary Sep 18, 2024

stevenzwu Sep 17, 2024

pvary Sep 18, 2024

stevenzwu Sep 18, 2024

pvary Sep 18, 2024

stevenzwu Sep 18, 2024

pvary Sep 18, 2024

Flink: Maintenance - TableManager + ExpireSnapshots #11144

Are you sure you want to change the base?

Flink: Maintenance - TableManager + ExpireSnapshots #11144

Conversation

pvary commented Sep 16, 2024

pvary commented Sep 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Sep 17, 2024 •

edited

Loading