Custom Apache Spark data sources using the Python Data Source API (Spark 4.0+). Learn by example and build your own data sources.
pip install pyspark-data-sources
# Install with specific extras
pip install pyspark-data-sources[faker] # For FakeDataSource
pip install pyspark-data-sources[all] # All optional dependencies- Apache Spark 4.0+ or Databricks Runtime 15.4 LTS+
- Python 3.9-3.12
from pyspark.sql import SparkSession
from pyspark_datasources import FakeDataSource
# Create Spark session
spark = SparkSession.builder.appName("datasource-demo").getOrCreate()
# Register the data source
spark.dataSource.register(FakeDataSource)
# Read batch data
df = spark.read.format("fake").option("numRows", 5).load()
df.show()
# +--------------+----------+-------+------------+
# | name| date|zipcode| state|
# +--------------+----------+-------+------------+
# | Pam Mitchell|1988-10-20| 23788| Tennessee|
# |Melissa Turner|1996-06-14| 30851| Nevada|
# | Brian Ramsey|2021-08-21| 55277| Washington|
# | Caitlin Reed|1983-06-22| 89813|Pennsylvania|
# | Douglas James|2007-01-18| 46226| Alabama|
# +--------------+----------+-------+------------+
# Stream data
stream = spark.readStream.format("fake").load()
query = stream.writeStream.format("console").start()| Data Source | Type | Description | Install |
|---|---|---|---|
fake |
Batch/Stream | Generate synthetic test data using Faker | pip install pyspark-data-sources[faker] |
github |
Batch | Read GitHub pull requests | Built-in |
googlesheets |
Batch | Read public Google Sheets | Built-in |
huggingface |
Batch | Load Hugging Face datasets | [huggingface] |
stock |
Batch | Fetch stock market data (Alpha Vantage) | Built-in |
opensky |
Batch/Stream | Live flight tracking data | Built-in |
kaggle |
Batch | Load Kaggle datasets | [kaggle] |
arrow |
Batch | Read Apache Arrow files | [arrow] |
lance |
Batch Write | Write Lance vector format | [lance] |
📚 See detailed examples for all data sources →
from pyspark_datasources import FakeDataSource
spark.dataSource.register(FakeDataSource)
# Generate synthetic data with custom schema
df = spark.read.format("fake") \
.schema("name string, email string, company string") \
.option("numRows", 5) \
.load()
df.show(truncate=False)
# +------------------+-------------------------+-----------------+
# |name |email |company |
# +------------------+-------------------------+-----------------+
# |Christine Sampson |[email protected]|Hernandez-Nguyen |
# |Yolanda Brown |[email protected] |Miller-Hernandez |
# +------------------+-------------------------+-----------------+Here's a minimal example to get started:
from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
class MyCustomDataSource(DataSource):
def name(self):
return "mycustom"
def schema(self):
return StructType([
StructField("id", IntegerType()),
StructField("name", StringType())
])
def reader(self, schema):
return MyCustomReader(self.options, schema)
class MyCustomReader(DataSourceReader):
def __init__(self, options, schema):
self.options = options
self.schema = schema
def read(self, partition):
# Your data reading logic here
for i in range(10):
yield (i, f"name_{i}")
# Register and use
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("mycustom").load()📖 Complete guide with advanced patterns →
- 📚 Data Sources Guide - Detailed examples for each data source
- 🔧 Building Data Sources - Complete tutorial with advanced patterns
- 📖 API Reference - Full API specification and method signatures
- 💻 Development Guide - Contributing and development setup
- Apache Spark 4.0+ or Databricks Runtime 15.4 LTS+
- Python 3.9-3.12
We welcome contributions! See our Development Guide for details.