The "Ecommerce Website Transaction" project is a collaborative effort between two teams to simulate and analyze ecommerce transaction data. Both teams generate a vast amount of simulated data based on a predefined schema, stream it to each other, clean and transform the data, and finally perform analytical queries. The results are visualized using Zeppelin and presented to an audience with diverse backgrounds.
Our team's primary objective is to analyze the other team's data, finding trends and patterns that can provide valuable insights. Part of our simulation process also involves generating "bad data" by selecting specific columns and replacing them with unrelated data, challenging the data cleaning and transformation process.
Click here to see the demo
The schema used to generate the transaction data includes:
order_id
: Order IDcustomer_id
: Customer IDcustomer_name
: Customer Nameproduct_id
: Product IDproduct_name
: Product Nameproduct_category
: Product Categorypayment_type
: Payment Typeqty
: Quantity Orderedprice
: Price of Productdatetime
: Date & Time when Order was Placedcountry
: Customer Countrycity
: Customer Cityecommerce_website_name
: Site where Order was Placedpayment_txn_id
: Payment Transaction IDpayment_txn_success
: Payment Success/Failurefailure_reason
: Reason for Payment Failure
- Generates over 2 million rows of transaction data spanning 10 years.
- Uses base data on products, companies, and customers stored in files for transaction generation.
- Highly customizable data generation.
- Generates customers from over 20 different countries with region-accurate names.
- Converts transaction prices from USD to the customer's local currency.
- Introduces bad data at a rate of 3% for testing and validation.
- Simulates logistic growth for each company at different rates.
- Both teams generate transaction data based on the schema.
- Data is streamed to the opposite team via Kafka and stored in a CSV file.
- Each team cleans and transforms the received data.
- Analytical queries are performed on the cleaned data.
- Results are visualized using Tableau and Zeppelin.
- Teams come together to share findings and present to a mixed audience.
- Apache Spark
- Spark SQL
- Kafka
- Scala 2.12.11
- Zeppelin
A big thank you to all our contributors who made this project possible: