Skip to content

allisonwang-db/pyspark-data-sources

Repository files navigation

PySpark Data Sources

pypi code style: ruff

Custom Apache Spark data sources using the Python Data Source API (Spark 4.0+). Learn by example and build your own data sources.

Quick Start

Installation

pip install pyspark-data-sources

# Install with specific extras
pip install pyspark-data-sources[faker]        # For FakeDataSource

pip install pyspark-data-sources[all]          # All optional dependencies

Requirements

Basic Usage

from pyspark.sql import SparkSession
from pyspark_datasources import FakeDataSource

# Create Spark session
spark = SparkSession.builder.appName("datasource-demo").getOrCreate()

# Register the data source
spark.dataSource.register(FakeDataSource)

# Read batch data
df = spark.read.format("fake").option("numRows", 5).load()
df.show()
# +--------------+----------+-------+------------+
# |          name|      date|zipcode|       state|
# +--------------+----------+-------+------------+
# |  Pam Mitchell|1988-10-20|  23788|   Tennessee|
# |Melissa Turner|1996-06-14|  30851|      Nevada|
# |  Brian Ramsey|2021-08-21|  55277|  Washington|
# |  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
# | Douglas James|2007-01-18|  46226|     Alabama|
# +--------------+----------+-------+------------+

# Stream data
stream = spark.readStream.format("fake").load()
query = stream.writeStream.format("console").start()

Available Data Sources

Data Source Type Description Install
fake Batch/Stream Generate synthetic test data using Faker pip install pyspark-data-sources[faker]
github Batch Read GitHub pull requests Built-in
googlesheets Batch Read public Google Sheets Built-in
huggingface Batch Load Hugging Face datasets [huggingface]
stock Batch Fetch stock market data (Alpha Vantage) Built-in
opensky Batch/Stream Live flight tracking data Built-in
kaggle Batch Load Kaggle datasets [kaggle]
arrow Batch Read Apache Arrow files [arrow]
lance Batch Write Write Lance vector format [lance]

📚 See detailed examples for all data sources →

Example: Generate Fake Data

from pyspark_datasources import FakeDataSource

spark.dataSource.register(FakeDataSource)

# Generate synthetic data with custom schema
df = spark.read.format("fake") \
    .schema("name string, email string, company string") \
    .option("numRows", 5) \
    .load()

df.show(truncate=False)
# +------------------+-------------------------+-----------------+
# |name              |email                    |company          |
# +------------------+-------------------------+-----------------+
# |Christine Sampson |[email protected]|Hernandez-Nguyen |
# |Yolanda Brown     |[email protected]  |Miller-Hernandez |
# +------------------+-------------------------+-----------------+

Building Your Own Data Source

Here's a minimal example to get started:

from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

class MyCustomDataSource(DataSource):
    def name(self):
        return "mycustom"

    def schema(self):
        return StructType([
            StructField("id", IntegerType()),
            StructField("name", StringType())
        ])

    def reader(self, schema):
        return MyCustomReader(self.options, schema)

class MyCustomReader(DataSourceReader):
    def __init__(self, options, schema):
        self.options = options
        self.schema = schema

    def read(self, partition):
        # Your data reading logic here
        for i in range(10):
            yield (i, f"name_{i}")

# Register and use
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("mycustom").load()

📖 Complete guide with advanced patterns →

Documentation

Requirements

  • Apache Spark 4.0+ or Databricks Runtime 15.4 LTS+
  • Python 3.9-3.12

Contributing

We welcome contributions! See our Development Guide for details.

Resources