FluxSieve Replication Package

This repository is the replication package for the paper FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability.

It contains:

implementation artifacts for the FluxSieve data-plane mechanisms,
two experimental tracks (RTOLAP with Apache Pinot and data-lake analytics with DuckDB/Parquet),
raw and processed experimental results, and
plotting scripts used to generate the figures.

1) Package Overview

streaming-data-plane/
- Java/Kafka Streams implementation of FluxSieve core mechanisms:
  - in-stream multi-pattern matching and enrichment,
  - dynamic pattern-matching engine updates via object storage and notifications,
  - direct Parquet writers used by the data-lake experiments.
streaming-data-lake/
- Reproduction assets for the DuckDB/Parquet evaluation track.
- Includes local load generation, benchmark scripts, result folders, and plotting scripts.
RTOLAP/
- Reproduction assets for the Apache Pinot (RTOLAP) evaluation track.
- Includes Kubernetes deployment assets, load generation, Pinot query client, schemas/tables, and plotting scripts.

2) Paper-to-Repository Map

Use this section as an index from paper concepts/results to concrete artifacts.

Paper topic	Repository artifacts
FluxSieve architecture and core stream-processing approach	`streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/matchAndEnrichRecords/KStreamsMatchingHyperscanEnrich5Dbs.java`
Dynamic adaptation and matcher hot-swap protocol	`streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/dynamicPatternMatchingEngine/PatternMatchingEngineUpdater.java`, `streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/dynamicPatternMatchingEngine/KStreamsDynamicPatternFiltering.java`
Kafka Streams integration details and tests	`streaming-data-plane/README.md`, `streaming-data-plane/kstreams/src/test/`
Data-lake benchmark implementation (baseline vs FluxSieve-style enriched data)	`streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/matchAndEnrichRecordsWriteParquet/KafkaStreamsBaselineWriteParquetDirect.java`, `streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/matchAndEnrichRecordsWriteParquet/KafkaStreamsMatchToParquetDirect.java`
Data-lake evaluation methodology (cold/hot, count vs return rows, file-layout/parallelism sweep)	`streaming-data-lake/scripts/queries-comparison.py`, `streaming-data-lake/README.md`
Data-lake result artifacts (raw benchmark metrics)	`streaming-data-lake/experimental-results/final-10M-records_ultra-high-selectivity_parquet_zstd-10reps_parallelism1_~1k-parquet-files_~10k-records-each/results/10M-unified_benchmark_parquet_zstd/`, `streaming-data-lake/experimental-results/final-10M-records_ultra-high-selectivity_parquet_zstd-10reps_parallelism1_~5k-parquet-files_~2k-records-each/results/10M-unified_benchmark_parquet_zstd/`, `streaming-data-lake/experimental-results/final-10M-records_ultra-high-selectivity_parquet_zstd-10reps_parallelism4_~1k-parquet-files_~10k-records-each/results/10M-unified_benchmark_parquet_zstd/`, `streaming-data-lake/experimental-results/final-10M-records_ultra-high-selectivity_parquet_zstd-10reps_parallelism4_~5k-parquet-files_~2k-records-each/results/10M-unified_benchmark_parquet_zstd/`
Data-lake figure generation	`streaming-data-lake/experimental-results/plot-general-scaling-speedup.py`
RTOLAP setup and distributed-system assumptions	`RTOLAP/README.md`, `RTOLAP/tables/`
RTOLAP ingestion/query workload execution	`RTOLAP/load-generator/`, `RTOLAP/pinot-client/`
RTOLAP result artifacts (high-selectivity and ultra-high-selectivity runs)	`RTOLAP/experimental-results/High-Selectivity/`, `RTOLAP/experimental-results/Ultra-High-Selectivity/`
RTOLAP figure generation	`RTOLAP/experimental-results/High-Selectivity/plot-general-scaling-speedup.py`, `RTOLAP/experimental-results/Ultra-High-Selectivity/plot-general-scaling-speedup_high-selectivity.py`

3) Fast Reproduction Paths

Choose one of the tracks below depending on what you want to replicate.

A. DuckDB/Data-Lake Track (local-friendly)

Primary docs:

streaming-data-lake/README.md
Concurrency experiments: streaming-data-lake/experimental-results/concurrent-queries/ (see streaming-data-lake/experimental-results/concurrent-queries/README.md)

Typical flow:

Build the streaming module once:

(cd streaming-data-plane && ./gradlew :kstreams:compileJava)

Generate Kafka input records:

(cd streaming-data-lake/load-generator && podman-compose -f docker-compose.yaml up --build)

Run baseline and enriched Parquet writers (from streaming-data-lake/README.md).

Run benchmark and analysis scripts:

cd streaming-data-lake
python3 -m venv .venv
source .venv/bin/activate
pip install -r scripts/requirements-parquet-tools.txt
python3 scripts/queries-comparison.py

Inspect generated and archived metrics in streaming-data-lake/experimental-results/.

B. RTOLAP/Pinot Track (distributed/Kubernetes)

Primary docs:

RTOLAP/README.md
RTOLAP/load-generator/README.md
RTOLAP/pinot-client/README.md

Typical flow:

Deploy Kafka + Pinot cluster (as described in RTOLAP/README.md).
Produce workload data with RTOLAP/load-generator/.
Configure Pinot schema/table definitions from RTOLAP/tables/.
Execute query scenarios with RTOLAP/pinot-client/.
Analyze archived CSVs and regenerate plots from:
- RTOLAP/experimental-results/High-Selectivity/
- RTOLAP/experimental-results/Ultra-High-Selectivity/

4) Where to Find Implementations Quickly

Core FluxSieve stream matching + enrichment:
- streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/matchAndEnrichRecords/KStreamsMatchingHyperscanEnrich5Dbs.java
Dynamic matcher update path:
- streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/dynamicPatternMatchingEngine/PatternMatchingEngineUpdater.java
- streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/dynamicPatternMatchingEngine/KStreamsDynamicPatternFiltering.java
Parquet writer variants used by data-lake experiments:
- streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/matchAndEnrichRecordsWriteParquet/KafkaStreamsBaselineWriteParquetDirect.java
- streaming-data-plane/kstreams/src/main/java/com/dynatrace/research/matchAndEnrichRecordsWriteParquet/KafkaStreamsMatchToParquetDirect.java

5) Where to Find Experimental Results Quickly

DuckDB/data-lake track:
- streaming-data-lake/experimental-results/
- Key outputs per scenario:
  - cold_metrics_unified.csv
  - hot_metrics_unified.csv
  - 10M-unified_benchmark_parquet_zstd.log
RTOLAP/Pinot track:
- RTOLAP/experimental-results/High-Selectivity/
- RTOLAP/experimental-results/Ultra-High-Selectivity/
- Includes per-dataset-size CSVs (5M/10M/20M/40M), with multiple repetitions and query variants.

6) Suggested Reading Order

For a quick but complete understanding of the replication package:

Read this file (README.md).
Read FluxSieve paper for the full methodology and findings.
Follow one track end-to-end:
- local: streaming-data-lake/README.md
- distributed: RTOLAP/README.md
Use plotting scripts in each track to regenerate figures from raw result CSVs.

7) Notes on Reproducibility

The RTOLAP track depends on distributed infrastructure (Kubernetes, Kafka, Pinot, object storage).
The data-lake track is easier to run locally and includes reproducible benchmark scripts and archived outputs.
Existing result folders in this repository can be used directly to reproduce plots even if you do not rerun full ingestion workloads.

License

This repository is available under the terms in LICENSE. Use is limited to internal, non-production, non-commercial purposes. Redistribution, sublicensing, modification, sale, and use for training or improving AI/ML models are not permitted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FluxSieve Replication Package

1) Package Overview

2) Paper-to-Repository Map

3) Fast Reproduction Paths

A. DuckDB/Data-Lake Track (local-friendly)

B. RTOLAP/Pinot Track (distributed/Kubernetes)

4) Where to Find Implementations Quickly

5) Where to Find Experimental Results Quickly

6) Suggested Reading Order

7) Notes on Reproducibility

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
RTOLAP		RTOLAP
streaming-data-lake		streaming-data-lake
streaming-data-plane		streaming-data-plane
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

FluxSieve Replication Package

1) Package Overview

2) Paper-to-Repository Map

3) Fast Reproduction Paths

A. DuckDB/Data-Lake Track (local-friendly)

B. RTOLAP/Pinot Track (distributed/Kubernetes)

4) Where to Find Implementations Quickly

5) Where to Find Experimental Results Quickly

6) Suggested Reading Order

7) Notes on Reproducibility

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages