You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Beginner Guide on writing PySpark ETL pipeline in production
|-- job_config.json # contain the sources and sink information|-- main.py # driver program from where the execution starts |-- src # module contains the actual ETL code||-- __init__.py
||-- transform.py # place to write your transformations(business logic)|`-- utils.py # place to write utility and helper function for reading, writing e.t.c `-- tests # module to write test cases|-- __init__.py
|-- conftest.py
|-- resources
||-- expected
||`-- fact_joined_with_lookup.csv|`-- input
||-- citytier_pincode.csv
|`-- orders_data.csv`-- test_src
|-- test_transform.py
`-- test_utils.py|-- Makefile # used to build egg file|-- setup.py # used to build egg file|-- README.md
Prerequisites to run the job
Upload the files present under tests/resources/input/ to any GCS bucket.
Update the job_config.json file with the source and sink location
Note: If you are testing
Package code into .egg file
$ cd ETLInPySpark
$ make build
Submit Job: GCP
$ cd ETLInPySpark
$ gcloud dataproc jobs submit pyspark --cluster=<CLUSTER_NAME> --region=<REGION> \
--py-files ./dist/etlinpyspark-0.0.1-py3.9.egg \
--files job_config.json \
--properties=spark.submit.deployMode=cluster main.py \
-- --config_file_name job_config.json --env dev
Submit Job: Local
$ cd ETLInPySpark
$ spark-submit main.py --config_file_name job_config.json --env dev
Run test: Local
$ cd ETLInPySpark
$ pytest
About
Sample project on how you can get started on writing ETL pipelines in PySpark in Prod