This guide provides detailed steps to set up and automate the data processing pipeline using Apache Airflow, Docker, and AWS services. Follow the instructions carefully to ensure a smooth setup.
- Configure the
nse_data_dumper.pyfile. - Check the
config.jsonandoutput_data/symbol_data.parquet(if present) for required configurations and available data.
- Edit the
docker-compose.ymlfile to ensure volumes are correctly pointing to the scripts and data.
- Modify the DAG script to reference the correct location of your Python script within the Docker container.
- Add a
REQUIREMENT.TXTto the Docker Compose file if needed. This approach is feasible for small projects.
-
Create a
.envfile with the following content:AIRFLOW_IMAGE_NAME=apache/airflow:2.4.2 AIRFLOW_UID=50000 AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= AWS_DEFAULT_REGION=ap-south-1 -
Reference these variables in the Docker Compose file for environment setup.
- Install
boto3or add it to the requirements file for AWS interaction.
- To use Spark instead of Athena, add the Spark image to your
docker-compose.yml. - Ensure AWS access keys are added to the environment variables.
- Create a network to allow communication between containers.
-
If errors occur while reading data from S3 using PySpark, verify your AWS configuration.
-
Consider adding necessary JAR files or editing
spark-defaults.conf. -
Use the
s3a://prefix as required by Hadoop AWS.# Define the S3 path s3_path = "s3a://symboldatabucket86/symboldata/all_symbols_data.parquet" # Read data from S3 df = spark.read.parquet(s3_path)
- If deploying on EC2, leverage IAM roles for permissions and avoid hardcoding keys.
- Update
scripts/nse_data_dumper.py, replacingsymbols[:5]withsymbolon line 182.
- Set up Athena and configure the query output bucket.
- Create necessary databases and tables using the script
scripts/data_processing/athena_tablecreation.sql.
- Create an Athena DAG following a Python operator.
- Use
scripts/data_processing/athena_proccessing.sqlfor the query. - Set the query output bucket to
s3://athena-result-bucket-ankit86/codeoutput/.
- Create an SNS topic and subscribe emails for alerts. Don't forget the email verification step.
-
Create a Lambda function using the Python runtime.
-
Copy the code from
scripts/data_processing/lambda_message.pyinto the Lambda function editor.
- Attach a role to your Lambda function granting necessary permissions like S3 read and SNS publish.
- Configure the S3 bucket (
s3://athena-result-bucket-ankit86/codeoutput/) to trigger the Lambda function on new object creation events.
Ensure all steps are followed correctly to set up a robust data processing pipeline. Happy Automating!



