[Managed Iceberg] unbounded source #33504

ahmedabu98 · 2025-01-06T18:16:10Z

Unbounded (streaming) source for Managed Iceberg.

See design doc for high level overview: https://s.apache.org/beam-iceberg-incremental-source

…erg_streaming_source

github-actions · 2025-01-30T22:06:51Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

ahmedabu98 · 2025-02-03T15:51:02Z

R: @kennknowles
R: @regadas

Can y'all take a look? I still have to write some tests, but it's at a good spot for a first round of reviews. I ran a bunch of pipelines (w/Legacy DataflowRunner) at different scales and the throughput/scalability looks good.

github-actions · 2025-02-03T15:52:21Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

…erg_streaming_source

kennknowles

Overall, I think all the pieces are in the right place. Just a question about why an SDF is the way it is and a couple code-level comments.

This seems like something you want to test a lot of different ways before it gets into a release. Maybe get another set of eyes like @chamikaramj or @Abacn too. But I'm approving and leaving to your judgment.

sdks/java/io/iceberg/bqms/build.gradle

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/SnapshotRange.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadTask.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadTaskDescriptor.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromGroupedTasks.java

kennknowles

Wait actually I forgot I want to have the discussion about the high level toggle between incremental scan source and bounded source.

…erg_streaming_source

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromGroupedTasks.java

…erg_streaming_source

…rk progress; convert GiB output iterable to list because of RunnerV2 bug

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

…ng' option; doc updates

…erg_streaming_source

chamikaramj

Thanks!

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

chamikaramj · 2025-03-05T01:26:24Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *   <tr>
+ *     <td> {@code to_timestamp} </td>
+ *     <td> {@code long} </td>
+ *     <td> Reads up to the latest snapshot (inclusive) created before this timestamp (in milliseconds).


Is this also optional (similar to to_snapshot) ?

Yes, all the new configuration parameters are optional

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

chamikaramj · 2025-03-05T01:29:00Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *           <li>{@code earliest}: starts reading from the earliest snapshot</li>
+ *           <li>{@code latest}: starts reading from the latest snapshot</li>
+ *       </ul>
+ *       <p>Defaults to {@code earliest} for batch, and {@code latest} for streaming.


By "streaming" do you mean the PipelineOption [1] the Iceberg config (defined below) or both ?

[1]

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/options/StreamingOptions.java

Line 36 in b76e45a

boolean isStreaming();

Streaming in the context of IcebergIO, so the config. I'll move up the streaming row so ppl will see it first and reference it.

chamikaramj · 2025-03-05T01:31:57Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *     <td>
+ *       The column used to derive event time to track progress. Must be of type:
+ *       <ul>
+ *           <li>{@code timestamp}</li>


Could you elaborate what you mean here by timestamp and timestamptz types ?

These are Iceberg types: https://iceberg.apache.org/spec/#primitive-types. Will include this link

chamikaramj · 2025-03-05T01:45:49Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *     </td>
+ *   </tr>
+ *   <tr>
+ *     <td> {@code streaming} </td>


What if both this and to_snapshot or to_timestamp are set ?

Mentioned below in the "Choosing an End Point (CDC only)" section.

It will still be a streaming pipeline, which will stop by itself after processing the end snapshot. Similar to how PeriodicImpulse behaves.

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadUtils.java

ahmedabu98 · 2025-03-05T18:37:01Z

@chamikaramj this is ready for another review

…erg_streaming_source

scwhittle · 2025-03-06T16:20:49Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IncrementalScanSource.java

+                .discardingFiredPanes())
+        .apply(
+            GroupIntoBatches.<ReadTaskDescriptor, ReadTask>ofByteSize(
+                    MAX_FILES_BATCH_BYTE_SIZE, ReadTask::getByteSize)


we don't really want these batches, we just want the read tasks distributed to workers without causing worker ooms. Otherwise we're just adding latency for the poll latency and not really benefitting from the batch.

Ideally we could change Redistribute to autoshard, but since it is tied to GroupIntoBatches currently, what about just doing GroupIntoBatches.ofSize(1).withShardedKey() ?

I initially figured that GroupIntoBatches.ofSize(1).withShardedKey() would give us the same problem of having too many concurrent shards, but when I ran it I found it actually produces only 1 shard, processing everything sequentially. Same thing when I tried .ofByteSize(1)

GroupIntoBatches.ofSize(1).withShardedKey().withMaxBufferingDuration(pollInterval): 2025-03-07_08_57_21-15760437490773458424

GroupIntoBatches.ofByteSize(1).withShardedKey().withMaxBufferingDuration(pollInterval): 2025-03-07_09_04_50-7891042636475112191

scwhittle · 2025-03-06T16:21:23Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IncrementalScanSource.java

+        .setCoder(KvCoder.of(ReadTaskDescriptor.getCoder(), ReadTask.getCoder()))
+        .apply(
+            Window.<KV<ReadTaskDescriptor, ReadTask>>into(new GlobalWindows())
+                .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))


can this trigger be removed? seems like the GiB does the triggering so I'm not sure if this has effect (or if it does if it is intended).

scwhittle · 2025-03-06T16:28:56Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromTasks.java

+    String tableIdentifier = element.getKey().getTableIdentifierString();
+    ReadTask readTask = element.getValue();
+    Table table = TableCache.get(tableIdentifier, scanConfig.getCatalogConfig().catalog());
+    Schema dataSchema = IcebergUtils.icebergSchemaToBeamSchema(table.schema());


seems like some additional things that could be cached.

Or better yet can the schemas be built at pipeline construction time? Having well-defined schemas seems like it will help for pipeline update compatability. Each one now is going to get a unique uuid

scwhittle · 2025-03-06T16:32:22Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromTasks.java

+    scanTasksCompleted.inc();
+  }
+
+  // infinite skew in case we encounter some files that don't support watermark column statistics,


it seems like this will either:

hold the watermark

output late stuff that will be dropped

I don't think this would be needed (at least at this transform) if the read tasks had the right event timestamp coming in, since we woudl just assign that timestamp to all records within the snapshot

scwhittle · 2025-03-06T16:33:00Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromTasks.java

+    Schema dataSchema = IcebergUtils.icebergSchemaToBeamSchema(table.schema());
+    Schema outputCdcSchema = ReadUtils.outputCdcSchema(dataSchema);
+
+    Instant outputTimestamp = ReadUtils.getReadTaskTimestamp(readTask, scanConfig);


it seems like this should be done when creating the read task, that way it will hold the watermark up appropriately while the task is being shuffled etc.

scwhittle · 2025-03-06T16:42:37Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/WatchForSnapshots.java

+
+      return isComplete
+          ? PollResult.complete(timestampedSnapshots) // stop at specified snapshot
+          : PollResult.incomplete(timestampedSnapshots); // continue forever


I think we want to generate a correct watermark here using
PollResult.withWatermark

chamikaramj

Thanks. LGTM.

chamikaramj · 2025-03-06T20:50:21Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *     <td> {@code operation} </td>
+ *     <td> {@code string} </td>
+ *     <td>
+ *       The snapshot <a href="https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/DataOperations">operation</a> associated with this record. For now, only "append" is supported.


May be change to "APPEND" to be consistent with Iceberg.

chamikaramj · 2025-03-06T20:57:59Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java

+ *
+ * <p><b>Note</b>: This reads <b>append-only</b> snapshots. Full CDC is not supported yet.
+ *
+ * <p>The CDC <b>streaming</b> source (enabled with {@code streaming=true}) continuously polls the


We should validate (and fail) somewhere if the "streaming" flag is set here and the streaming PipelineOption [1] is not set.

[1]

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/options/StreamingOptions.java

Line 38 in c1d0fa4

void setStreaming(boolean value);

chamikaramj · 2025-03-06T21:55:24Z

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java

@@ -108,6 +110,7 @@ public class Managed {
   *
   * <ul>
   *   <li>{@link Managed#ICEBERG} : Read from Apache Iceberg tables
+   *   <li>{@link Managed#ICEBERG_CDC} : CDC Read from Apache Iceberg tables


We should link to locations where users can find additional Javadocs related to each of these options (also for write).

initial

bb87511

github-actions bot added java io labels Jan 6, 2025

ahmedabu98 marked this pull request as draft January 6, 2025 18:16

ahmedabu98 added 7 commits January 7, 2025 15:50

let CombinedScanTask do splitting (based on Parquet row groups)

853de4d

perf improv

69fd988

create one read task descriptor per snapshot range

da2f33f

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

73c8992

…erg_streaming_source

some improvements

81ca709

use GiB for streaming, Redistribute for batch; update docs

e319d76

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

c25cd75

…erg_streaming_source

ahmedabu98 marked this pull request as ready for review January 30, 2025 21:09

use static value

af1ec85

ahmedabu98 added 3 commits February 3, 2025 16:27

add some test

f5d3268

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

df40239

…erg_streaming_source

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

43ab88f

…erg_streaming_source

kennknowles approved these changes Feb 5, 2025

View reviewed changes

kennknowles requested changes Feb 5, 2025

View reviewed changes

ahmedabu98 added 3 commits February 7, 2025 08:32

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

20db0ee

…erg_streaming_source

add a java doc; don't use static block to create coder

622625f

spotless

4c25d3f

kennknowles approved these changes Feb 11, 2025

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java Outdated Show resolved Hide resolved

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ReadFromGroupedTasks.java Outdated Show resolved Hide resolved

ahmedabu98 added 2 commits February 13, 2025 12:35

add options: from/to timestamp, starting strategy, and streaming toggle

8666166

trigger integration tests

297c309

github-actions bot added the build label Feb 13, 2025

ahmedabu98 added 2 commits February 13, 2025 13:47

small test fix

5e3a2cc

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

8b131fd

…erg_streaming_source

scan every snapshot individually; use snapshot commit timestamp to ma…

887eff1

…rk progress; convert GiB output iterable to list because of RunnerV2 bug

chamikaramj reviewed Mar 3, 2025

View reviewed changes

sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java Show resolved Hide resolved

ahmedabu98 added 2 commits March 3, 2025 14:35

new schematransform for cdc streaming; add watermark configs

6cfc2d8

cleanup

fbad86e

github-actions bot added the model label Mar 3, 2025

ahmedabu98 added 5 commits March 3, 2025 15:02

add guava import

50f9497

remove iceberg_cdc_read from xlang auto-wrapper gen

4f1f40b

fix javadoc

633365c

cleanup

37485f1

spotless

4ede0e8

ahmedabu98 mentioned this pull request Mar 4, 2025

[Task]: Close feature gaps between regular and CDC Iceberg sources #34168

Open

17 tasks

ahmedabu98 added 4 commits March 4, 2025 14:50

use CDC schema for batch and streaming; re-introduce boolean 'streami…

db9fd63

…ng' option; doc updates

add to CHANGES.md and discussion docs

79ab16a

spotless

06a4cee

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

132034f

…erg_streaming_source

chamikaramj reviewed Mar 5, 2025

View reviewed changes

ahmedabu98 added 3 commits March 5, 2025 10:40

address review comments about java docs

795c87c

remove raw guava dep

c6461c9

add another test for read utils

7dbf3e1

Merge branch 'master' of https://github.com/ahmedabu98/beam into iceb…

5263a13

…erg_streaming_source

scwhittle requested changes Mar 6, 2025

View reviewed changes

chamikaramj reviewed Mar 6, 2025

View reviewed changes

ahmedabu98 mentioned this pull request Mar 7, 2025

[Task]: Add Python AfterSynchronizedProcessingTime trigger and add an Iceberg CDC streaming read test #34212

Open

17 tasks

use cached schemas

40fe4ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Managed Iceberg] unbounded source #33504

[Managed Iceberg] unbounded source #33504

ahmedabu98 commented Jan 6, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025

ahmedabu98 commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

kennknowles left a comment

kennknowles left a comment

chamikaramj left a comment

chamikaramj Mar 5, 2025

ahmedabu98 Mar 5, 2025

chamikaramj Mar 5, 2025

ahmedabu98 Mar 5, 2025

chamikaramj Mar 5, 2025

ahmedabu98 Mar 5, 2025

chamikaramj Mar 5, 2025

ahmedabu98 Mar 5, 2025

ahmedabu98 commented Mar 5, 2025

scwhittle Mar 6, 2025

ahmedabu98 Mar 7, 2025

scwhittle Mar 6, 2025

scwhittle Mar 6, 2025

scwhittle Mar 6, 2025

scwhittle Mar 6, 2025

scwhittle Mar 6, 2025

scwhittle Mar 6, 2025

chamikaramj left a comment

chamikaramj Mar 6, 2025

chamikaramj Mar 6, 2025

chamikaramj Mar 6, 2025

[Managed Iceberg] unbounded source #33504

Are you sure you want to change the base?

[Managed Iceberg] unbounded source #33504

Conversation

ahmedabu98 commented Jan 6, 2025 • edited Loading

github-actions bot commented Jan 30, 2025

ahmedabu98 commented Feb 3, 2025

github-actions bot commented Feb 3, 2025

kennknowles left a comment

Choose a reason for hiding this comment

kennknowles left a comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 commented Jan 6, 2025 •

edited

Loading