Skip to content

datafusion-contrib/datafusion-distributed

Repository files navigation

DataFusion Distributed

Library that brings distributed execution capabilities to Apache DataFusion.

Note

This is project is not part of Apache DataFusion

What can you do with this crate?

This crate is a toolkit that extends Apache DataFusion with distributed capabilities, providing a developer experience as close as possible to vanilla DataFusion while being unopinionated about the networking stack used for hosting the different workers involved in a query.

Users of this library can expect to take their existing single-node DataFusion-based systems and add distributed capabilities with minimal changes.

Core tenets of the project

  • Be as close as possible to vanilla DataFusion, providing a seamless integration with existing DataFusion systems and a familiar API for building applications.
  • Unopinionated about networking. This crate does not take any opinion about the networking stack, and users are expected to leverage their own infrastructure for hosting DataFusion nodes.
  • No coordinator-worker architecture. To keep infrastructure simple, any node can act as a coordinator or a worker.

Benchmarks

The benchmarking code is public an open for anyone to easily reproduce. It uses AWS CDK for automating the creation of the benchmarking cluster so that anyone can reproduce the same results in their own AWS account. The code can be found in the benchmarks/cdk directory.

TPC-H SF1

benchmarks_sf1.png

TPC-H SF10

benchmarks_sf10.png

Docs

The user and contributor guide can be found here:

https://datafusion-contrib.github.io/datafusion-distributed

Getting familiar with distributed DataFusion

There are some runnable examples showcasing how to provide a localhost implementation for Distributed DataFusion in examples/:

  • localhost_worker.rs: code that spawns a Worker listening for physical plans over the network.
  • localhost_run.rs: code that distributes a query across the spawned Workers and executes it.

The integration tests also provide an idea about how to use the library and what can be achieved with it: