Skip to content

AhmetFurkanDEMIR/Trendyol-Data-Streaming-Case-Study

Repository files navigation

Trendyol Group Data Streaming Case Study

For Turkish documentation, see: README.tr.md

architecture

Overview

This repository provides a production-grade, scalable, and observable Kafka platform using modern Platform Engineering practices. Infrastructure automation, secure Kafka cluster setup, observability, REST API management, and distributed Kafka Connect operations are fully automated end-to-end. All components are designed for reliability, maintainability, and self-service management.


Case Study Summary

You are expected to design and manage a Kafka infrastructure focused on high availability, observability, automation, and reproducibility for production environments. All operations, tests, and scenarios are performed on a single central Kafka cluster.

Duration: 7 days

Scenario:

  • A single Kafka cluster is set up as the foundation for the entire platform.
  • All code must be production quality, well-documented, and version-controlled.
  • Best practices for infrastructure automation, containerization, and monitoring are applied.

Architecture Overview

The following diagram shows all components of the Trendyol Kafka Platform architecture:

graph TB
    subgraph AUTO["Automation"]
        GHA[GitHub Actions]
        TF[Terraform]
        ANS[Ansible]
    end

    subgraph CLUSTER["Kafka Cluster"]
        subgraph BROKERS["Brokers (4 nodes)"]
            B1[Broker 1 - AZ1]
            B2[Broker 2 - AZ1]
            B3[Broker 3 - AZ2]
            B4[Broker 4 - AZ2]
        end
        
        subgraph CONTROLLERS["Controllers (3 nodes)"]
            C1[Controller 1 - AZ1]
            C2[Controller 2 - AZ2]
            C3[Controller 3 - AZ3]
        end
        
        SSL[SSL/TLS + SASL/SCRAM]
        SSL -.secure.-> BROKERS
        SSL -.secure.-> CONTROLLERS
        BROKERS <-.cluster.-> CONTROLLERS
    end

    subgraph SERVICES["Services"]
        CN1[Kafka Connect 1<br/>AZ1]
        CN2[Kafka Connect 2<br/>AZ1]
        API[REST API<br/>JWT Protected]
        MON[Monitoring<br/>Prometheus + Grafana]
    end

    GHA --> TF
    GHA --> ANS
    TF --> CLUSTER
    TF --> SERVICES
    ANS --> CLUSTER
    ANS --> SERVICES
    
    CN1 --> SSL --> BROKERS
    CN2 --> SSL
    API --> SSL
    MON -.metrics.-> CLUSTER

    style BROKERS fill:#E3F2FD,stroke:#1976D2,stroke-width:3px
    style CONTROLLERS fill:#F3E5F5,stroke:#7B1FA2,stroke-width:3px
    style B1 fill:#4A90E2,stroke:#333,stroke-width:2px
    style B2 fill:#4A90E2,stroke:#333,stroke-width:2px
    style B3 fill:#4A90E2,stroke:#333,stroke-width:2px
    style B4 fill:#4A90E2,stroke:#333,stroke-width:2px
    style C1 fill:#7B68EE,stroke:#333,stroke-width:2px
    style C2 fill:#7B68EE,stroke:#333,stroke-width:2px
    style C3 fill:#7B68EE,stroke:#333,stroke-width:2px
    style CN1 fill:#FF8C42,stroke:#333,stroke-width:2px
    style CN2 fill:#FF8C42,stroke:#333,stroke-width:2px
    style API fill:#F39C12,stroke:#333,stroke-width:2px
    style MON fill:#50C878,stroke:#333,stroke-width:2px
    style SSL fill:#C0392B,stroke:#333,stroke-width:3px
    style TF fill:#7B42BC,stroke:#333,stroke-width:2px
    style ANS fill:#E74C3C,stroke:#333,stroke-width:2px
    style GHA fill:#2C3E50,stroke:#333,stroke-width:2px
Loading
  • Modular, reusable, and scalable AWS infrastructure.
  • Modules: network, compute, security.
  • Resources:
    • Kafka Broker: 4 nodes (2 AZ1, 2 AZ2)
    • Kafka Controller: 3 nodes (1 AZ1, 1 AZ2, 1 AZ3)
    • Kafka Connect Cluster: 2 nodes (both in AZ1)
    • Observability Node: 1 node (AZ1)
  • Automated Kafka broker and controller setup with Confluent Platform Ansible Collection.
  • Production-grade security: SSL/TLS encryption, SASL/SCRAM authentication.
  • Rack awareness (provides high availability and resilience against data loss by distributing nodes across physically or logically different data centers/availability zones) and multi-AZ distribution.
  • JMX metrics enabled for monitoring.
  • Prometheus: Collects metrics from all Kafka and Connect nodes (JMX and node_exporter).
  • Alertmanager: Centralized alert management for infrastructure and Kafka events.
  • Grafana: Ready dashboards for Kafka Broker, Controller, and Connect.
  • Node Exporter: Installed on all nodes for system metrics.
  • All observability components are automatically installed and managed with Ansible.
  • FastAPI-based REST API using AdminClient for Kafka management.
  • Endpoints for broker, topic, consumer group management, and topic configuration.
  • All API operations are protected with JWT.
  • Dockerized, runs on the Observability node.
  • Distributed Kafka Connect cluster setup with Docker Compose.
  • Secure integration with the main Kafka cluster.
  • HTTP Source Connector fetches data from the REST API and streams it to Kafka topics.
  • Full connector lifecycle management via Kafka Connect REST API.

Key Features

  • Single Cluster Architecture: All modules operate on the same Kafka cluster.
  • Infrastructure as Code: All infrastructure is managed with Terraform.
  • Automated Deployment: All setup and configuration steps are automated with Ansible and GitHub Actions.
  • Production-Grade Security: End-to-end encryption, strong authentication, and best practices.
  • Comprehensive Observability: Metrics, dashboards, and alerts for all critical components.
  • Self-Service Platform: REST API and automation scripts for easy management and extensibility.

Directory Structure


Getting Started

  1. Configure GitHub Secrets:
    • To ensure all automation (CI/CD, Terraform, Ansible, etc.) works seamlessly, add all required secrets and credentials to the GitHub Secrets section of your repository.
    • For example: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, any required SSH private keys, DockerHub tokens, etc.
    • To add secrets: Go to your GitHub repo > Settings > Secrets and variables > Actions > New repository secret.
    • Missing or incorrect secrets will cause automated deployment and teardown operations to fail. secrets
  2. Provision Infrastructure:
    • Set up your AWS credentials and run the modules in terraform/.
  3. Deploy Kafka Cluster:
  4. Set Up Observability:
    • Deploy Prometheus, Alertmanager, and Grafana using scripts in observability/.
  5. Deploy REST API:
  6. Deploy Kafka Connect:
    • Start the Connect cluster and manage connectors with Docker Compose in kafka_connect/.

Automated Deployment and Teardown

  • To deploy all infrastructure and application components with a single command, use deploy-all.yaml (or the relevant deploy-all.yaml in your project). This file provides fully automated deployment with no manual changes required. All modules are brought up in order and according to their dependencies. deploy_all
  • To safely shut down/delete the cluster and all resources, use destroy.yaml or the relevant destroy workflow. This process safely terminates the Kafka cluster and all related infrastructure. destroy
  • https://github.com/AhmetFurkanDEMIR/Trendyol-Data-Streaming-Case-Study/actions

Note: You do not need to make any manual changes to files or configurations during these operations. The entire process is automatic and idempotent.


Amazon CloudWatch (Logs)

Amazon CloudWatch is a centralized observability service used to monitor AWS resources, collect logs, generate alarms, and trigger automated actions.

In this project, Amazon CloudWatch was used to centrally collect system and Kafka logs located under the /var/log directory, store them securely, and make them available for aggregated search and analysis.

This process was implemented by automatically installing the CloudWatch Agent during the instance initialization phase in the Terraform step. The agent was configured via a configuration file to monitor logs under the /var/log directory.

Configuration file

cloud_watch_logs

error.png


Documentation