Skip to content

datafusion-contrib/datafusion-distributed

Repository files navigation

DataFusion Distributed

Library that brings distributed execution capabilities to Apache DataFusion.

What can you do with this crate?

This crate is a toolkit that extends Apache DataFusion with distributed capabilities, providing a developer experience as close as possible to vanilla DataFusion while being unopinionated about the networking stack used for hosting the different workers involved in a query.

Users of this library can expect to take their existing single-node DataFusion-based systems and add distributed capabilities with minimal changes.

Core tenets of the project

  • Be as close as possible to vanilla DataFusion, providing a seamless integration with existing DataFusion systems and a familiar API for building applications.
  • Unopinionated about networking. This crate does not take any opinion about the networking stack, and users are expected to leverage their own infrastructure for hosting DataFusion nodes.
  • No coordinator-worker architecture. To keep infrastructure simple, any node can act as a coordinator or a worker.

Benchmarks

dist-df-vs-df-vs-trino.png

Docs

The user and contributor guide can be found here:

https://datafusion-contrib.github.io/datafusion-distributed

Getting familiar with distributed DataFusion

There are some runnable examples showcasing how to provide a localhost implementation for Distributed DataFusion in examples/:

  • localhost_worker.rs: code that spawns an Arrow Flight Endpoint listening for physical plans over the network.
  • localhost_run.rs: code that distributes a query across the spawned Arrow Flight Endpoints and executes it.

The integration tests also provide an idea about how to use the library and what can be achieved with it:

Releases

No releases published

Contributors 14

Languages