- 
Stripe: Using Stripe's API as the source of generating realistic transaction data. 
- 
Apache Kafka: Stripe transaction data is streamed into Apache Kafka. It handles the data streams and ensures they are processed in real-time. 
- 
Apache ZooKeeper: ZooKeeper is used alongside Kafka to manage and coordinate the Kafka brokers. ZooKeeper helps maintain configuration information, naming, synchronization, and group services. 
- 
PySpark: The data from Kafka is then processed using PySpark Structured Streaming. This involves transforming individual transaction data into rows fit for a database. 
- 
Forex API: GBP/x exchange rates are taken from an online API and updated every 24 hours. 
- 
PostgreSQL: After processing, the data is stored in PostgreSQL. 
- 
dbt (Data Build Tool): dbt is used to manage and transform data within PostgreSQL. Data is split into dimension and a fact tables. 
- 
Grafana: Finally, the data stored in PostgreSQL is visualized using Grafana. 
 
 
- add PySpark tests
- use an orchestrator
- use more data
- Spark might be an overkill due to the data amount limitations, but I wanted to learn how to set Spark Streaming up in case data is much more
- for a better Grafana visualisation
- maybe find an alternative transactions data source because the Stripe API has a 25 rate limit
- also many of the generated values in a transaction from the Stripe API are null
 
- git clone https://github.com/divakaivan/transaction-stream-data-pipeline.git
- Rename sample.envto.envand fill in the necessary environment variables
- Type makein the terminal to see the setup options
Usage: make [option]
Options:
  help                 Show this help message
  build                Build docker services
  start                Start docker services (detached mode)
  stop                 Stop docker services
  dbt-test             Run dbt with LIMIT 100
  dbt-full             Run dbt with full data
