A Python library for validating pandas DataFrames using schemas, with support for type checking, custom validators, and function input/output validation.
- Define schemas for pandas DataFrames with type checking and validation
- Support for custom validators (e.g., IsPositive, IsNonEmptyString, Range, etc.)
- Function decorator for validating input and output DataFrames
- PyArrow type integration for efficient type checking
- Schema inference from existing DataFrames
- Nullable column support
- Comprehensive type mapping between Python, pandas, and PyArrow types
pip install pdschemapoetry add pdschemaimport pandas as pd
from pdschema import Column, IsNonEmptyString, IsPositive, Range, Schema
# Create a DataFrame
df = pd.DataFrame(
{
"idx": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"score": [85.5, 92.0, 78.5],
}
)
# Define a schema using programmatic syntax
schema = Schema(
[
Column("idx", int, nullable=False),
Column("name", str, nullable=False, validators=[IsNonEmptyString()]),
Column("age", int, validators=[IsPositive()]),
Column("score", float, validators=[Range(0, 100)]),
]
)
# Validate the DataFrame: raises ValueError if validation fails
schema.validate(df)
# Declarative Schema Definition
class MySchema(Schema):
idx = Column(dtype=int, nullable=False)
name = Column(dtype=str, nullable=False, validators=[IsNonEmptyString])
age = Column(dtype=int, nullable=False, validators=[IsPositive])
score = Column(dtype=float, nullable=False, validators=[Range(0, 100)])
MySchema().validate(df)Use the @pdfunction decorator to validate function inputs and outputs:
from pdschema import pdfunction
@pdfunction(
arguments={
"df": Schema([Column("id", int), Column("value", float)]),
"threshold": float
},
outputs={
"result": Schema([Column("id", int), Column("filtered_value", float)])
}
)
def filter_values(df, threshold):
result = df[df["value"] > threshold]
return {"result": result}The package comes builtin with many Validators you can use.
IsPositive: Ensures numeric values are positiveIsNonEmptyString: Ensures strings are non-emptyMax: Ensures values are less than or equal to a maximumMin: Ensures values are greater than or equal to a minimumGreaterThan: Ensures values are greater than a thresholdGreaterThanOrEqual: Ensures values are greater than or equal to a thresholdLessThan: Ensures values are less than a thresholdLessThanOrEqual: Ensures values are less than or equal to a thresholdChoice: Ensures values are in a list of allowed choicesLength: Ensures values have a specific length or length rangeRange: Ensures values are within a range
You can infer a schema from an existing DataFrame:
df = pd.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35]
})
schema = Schema.infer_schema(df)This project is licensed under the MIT License - see the LICENSE file for details.