PySpark Data Sources

Custom Apache Spark data sources using the Python Data Source API (Spark 4.0+). Learn by example and build your own data sources.

Quick Start

Installation

pip install pyspark-data-sources

# Install with specific extras
pip install pyspark-data-sources[faker]        # For FakeDataSource

pip install pyspark-data-sources[all]          # All optional dependencies

Requirements

Apache Spark 4.0+ or Databricks Runtime 15.4 LTS+
Python 3.9-3.12

Basic Usage

from pyspark.sql import SparkSession
from pyspark_datasources import FakeDataSource

# Create Spark session
spark = SparkSession.builder.appName("datasource-demo").getOrCreate()

# Register the data source
spark.dataSource.register(FakeDataSource)

# Read batch data
df = spark.read.format("fake").option("numRows", 5).load()
df.show()
# +--------------+----------+-------+------------+
# |          name|      date|zipcode|       state|
# +--------------+----------+-------+------------+
# |  Pam Mitchell|1988-10-20|  23788|   Tennessee|
# |Melissa Turner|1996-06-14|  30851|      Nevada|
# |  Brian Ramsey|2021-08-21|  55277|  Washington|
# |  Caitlin Reed|1983-06-22|  89813|Pennsylvania|
# | Douglas James|2007-01-18|  46226|     Alabama|
# +--------------+----------+-------+------------+

# Stream data
stream = spark.readStream.format("fake").load()
query = stream.writeStream.format("console").start()

Available Data Sources

Data Source	Type	Description	Install
`fake`	Batch/Stream	Generate synthetic test data using Faker	`pip install pyspark-data-sources[faker]`
`github`	Batch	Read GitHub pull requests	Built-in
`googlesheets`	Batch	Read public Google Sheets	Built-in
`huggingface`	Batch	Load Hugging Face datasets	`[huggingface]`
`stock`	Batch	Fetch stock market data (Alpha Vantage)	Built-in
`opensky`	Batch/Stream	Live flight tracking data	Built-in
`kaggle`	Batch	Load Kaggle datasets	`[kaggle]`
`arrow`	Batch	Read Apache Arrow files	`[arrow]`
`lance`	Batch Write	Write Lance vector format	`[lance]`

📚 See detailed examples for all data sources →

Example: Generate Fake Data

from pyspark_datasources import FakeDataSource

spark.dataSource.register(FakeDataSource)

# Generate synthetic data with custom schema
df = spark.read.format("fake") \
    .schema("name string, email string, company string") \
    .option("numRows", 5) \
    .load()

df.show(truncate=False)
# +------------------+-------------------------+-----------------+
# |name              |email                    |company          |
# +------------------+-------------------------+-----------------+
# |Christine Sampson |[email protected]|Hernandez-Nguyen |
# |Yolanda Brown     |[email protected]  |Miller-Hernandez |
# +------------------+-------------------------+-----------------+

Building Your Own Data Source

Here's a minimal example to get started:

from pyspark.sql.datasource import DataSource, DataSourceReader
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

class MyCustomDataSource(DataSource):
    def name(self):
        return "mycustom"

    def schema(self):
        return StructType([
            StructField("id", IntegerType()),
            StructField("name", StringType())
        ])

    def reader(self, schema):
        return MyCustomReader(self.options, schema)

class MyCustomReader(DataSourceReader):
    def __init__(self, options, schema):
        self.options = options
        self.schema = schema

    def read(self, partition):
        # Your data reading logic here
        for i in range(10):
            yield (i, f"name_{i}")

# Register and use
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("mycustom").load()

📖 Complete guide with advanced patterns →

Documentation

📚 Data Sources Guide - Detailed examples for each data source
🔧 Building Data Sources - Complete tutorial with advanced patterns
📖 API Reference - Full API specification and method signatures
💻 Development Guide - Contributing and development setup

Requirements

Apache Spark 4.0+ or Databricks Runtime 15.4 LTS+
Python 3.9-3.12

Contributing

We welcome contributions! See our Development Guide for details.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
contributing		contributing
docs		docs
examples		examples
pyspark_datasources		pyspark_datasources
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PySpark Data Sources

Quick Start

Installation

Requirements

Basic Usage

Available Data Sources

Example: Generate Fake Data

Building Your Own Data Source

Documentation

Requirements

Contributing

Resources

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

allisonwang-db/pyspark-data-sources

Folders and files

Latest commit

History

Repository files navigation

PySpark Data Sources

Quick Start

Installation

Requirements

Basic Usage

Available Data Sources

Example: Generate Fake Data

Building Your Own Data Source

Documentation

Requirements

Contributing

Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages