[Task]: Add Python AfterSynchronizedProcessingTime trigger and add an Iceberg CDC streaming read test #34212

ahmedabu98 · 2025-03-07T15:31:20Z

What needs to happen?

Was trying to add a Python test for the new Iceberg streaming source (#33504) but ran into the following error:

    @staticmethod
    def from_runner_api(proto, context):
>     return {
          'after_all': AfterAll,
          'after_any': AfterAny,
          'after_each': AfterEach,
          'after_end_of_window': AfterWatermark,
          'after_processing_time': AfterProcessingTime,
          # 'after_synchronized_processing_time': _AfterSynchronizedProcessingTime,
          'always': Always,
          'default': DefaultTrigger,
          'element_count': AfterCount,
          'never': _Never,
          'or_finally': OrFinally,
          'repeat': Repeatedly,
      }[proto.WhichOneof('trigger')].from_runner_api(proto, context)
E     KeyError: 'after_synchronized_processing_time'

apache_beam/transforms/trigger.py:301: KeyError

This trigger was going to be added in an old PR that went stale (#14060).

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

ahmedabu98 · 2025-03-07T15:33:15Z

Iceberg source test

def test_streaming_write_read_cdc_pipeline_using_bqms(self):
    runner = self.test_pipeline.get_option('runner')
    if not runner or "dataflowrunner" not in runner.lower():
      self.skipTest(
        "CDC streaming source requires"
        "`beam:requirement:pardo:on_window_expiration:v1`, "
        "which is currently only supported by the Dataflow runner")

    bigquery_client = BigQueryWrapper()
    dataset_id = "py_managed_iceberg_bqms_test_" + str(int(time.time()))
    project = self.test_pipeline.get_option('project')
    bigquery_client.get_or_create_dataset(project, dataset_id)
    _LOGGER.info(
      "Created dataset %s in project %s", dataset_id, project)

    catalog_props = {
      "warehouse": self.WAREHOUSE,
      "gcp_project": project,
      "gcp_location": "us-central1",
      "catalog-impl": "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog",
      "io-impl": "org.apache.iceberg.gcp.gcs.GCSFileIO"}
    first_table = dataset_id + ".first"

    rows = [self._create_row(i) for i in range(3)]
    expected_dicts = [row.as_dict() for row in rows]

    # Reduce test time by forcing direct runner on pipelines that don't need Dataflow
    args_with_directrunner = PipelineOptions(self.args).get_all_options()
    args_with_directrunner['runner'] = 'DirectRunner'
    # first, prepare an initial table (uses DirectRunner)
    with beam.Pipeline(options=PipelineOptions(**args_with_directrunner)) as write_pipeline:
      first_write_config = {
        "table": first_table,
        "catalog_properties": catalog_props
      }
      _ = (
              write_pipeline
              | beam.Create(rows)
              | beam.managed.Write(beam.managed.ICEBERG, config=first_write_config))

    # the intended test: stream cdc read from the first table, validate, and stream records to a second table
    # (uses DataflowRunner)
    second_table = dataset_id + ".second"
    with beam.Pipeline(argv=self.args) as read_pipeline:
      first_read_config = {
        "table": first_table,
        "catalog_properties": catalog_props,
        "streaming": True,
        "to_timestamp": int((time.time() * 1000))
      }
      second_write_config = {
        "table": second_table,
        "catalog_properties": catalog_props,
        "triggering_frequency_seconds": 5
      }

      output_cdc_rows = (
              read_pipeline
              | beam.managed.Read(beam.managed.ICEBERG_CDC, config=first_read_config))
      output_cdc_dicts = output_cdc_rows | beam.Map(lambda row: {"operation": row.operation, "record": row.record._asdict()})

      _ = (output_cdc_rows
           | "Extract records" >> beam.Map(lambda row: row.record)
           .with_output_types(
                RowTypeConstraint.from_fields([
                  ("int_", int),
                  ("str_", str),
                  ("bytes_", bytes),
                  ("bool_", bool),
                  ("float_", float)
                ]))
              | beam.managed.Write(beam.managed.ICEBERG, config=second_write_config))

      expected_cdc_dicts = [{"record": record, "operation": "append"} for record in expected_dicts]
      assert_that(output_cdc_dicts, equal_to(expected_cdc_dicts))

    # batch read from the second table and validate records (uses DirectRunner)
    with beam.Pipeline(options=PipelineOptions(**args_with_directrunner)) as read_pipeline:
      second_read_config = {
        "table": second_table,
        "catalog_properties": catalog_props
      }

      output_dicts = (
              read_pipeline
              | beam.managed.Read(beam.managed.ICEBERG, config=second_read_config)
              | beam.Map(lambda row: row._asdict()))

      assert_that(output_dicts, equal_to(expected_dicts))

    bigquery_client._delete_dataset(project, dataset_id)

The test passes when adding the trigger.py changes from #14060.

ahmedabu98 · 2025-03-07T15:33:44Z

cc @kennknowles

ahmedabu98 added awaiting triage task labels Mar 7, 2025

github-actions bot added python io P2 labels Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task]: Add Python AfterSynchronizedProcessingTime trigger and add an Iceberg CDC streaming read test #34212

[Task]: Add Python AfterSynchronizedProcessingTime trigger and add an Iceberg CDC streaming read test #34212

ahmedabu98 commented Mar 7, 2025

ahmedabu98 commented Mar 7, 2025

ahmedabu98 commented Mar 7, 2025

[Task]: Add Python AfterSynchronizedProcessingTime trigger and add an Iceberg CDC streaming read test #34212

[Task]: Add Python AfterSynchronizedProcessingTime trigger and add an Iceberg CDC streaming read test #34212

Comments

ahmedabu98 commented Mar 7, 2025

What needs to happen?

Issue Priority

Issue Components

ahmedabu98 commented Mar 7, 2025

ahmedabu98 commented Mar 7, 2025