Skip to content

JoelKong/system-design-paypal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Designing a global payment system focused on reliability, compliance, and scale

A PayPal/Stripe-style payment platform built to practice and showcase system design concepts from my internship experience. Money movement stays strongly consistent in PostgreSQL with everything async (fraud, notifications, analytics) goes through Kafka so checkout stays fast.


Requirements

Features

  • One-time payments across currencies with live FX conversion
  • Recurring billing (subscriptions) triggered by Kubernetes CronJob
  • Refunds
  • JWT auth for merchants and consumers
  • Async fraud scoring with post-capture reversal
  • Append-only payment status audit trail on every state change

Deferred (not implemented but mocked)

  • KYC, RBAC, MFA, disputes, real Visa/Mastercard processors — hardcoded JWT roles and a mock gateway stand in for now

Realistic scale targets

  • 1,000+ payments per second (designed for; load-tested locally at ~100–500 TPS with k6)
  • Multi-currency — USD, EUR, etc. via fx-service + Redis cache
  • 99.9%+ availability target via multi-AZ EKS, RDS, MSK in prod

Tech Stack

Layer Tech
Frontend React, Tailwind, Vite
Backend Spring Boot 3, Maven, Hibernate / Spring Data JPA
Transactional DB PostgreSQL (schema-per-service locally, Aurora in prod)
NoSQL MongoDB (sessions, fraud signals, notification logs)
Cache Redis (idempotency, FX rates, balance cache)
Queue / streams Kafka + Kafka Streams (MSK in prod)
Batch Spring Batch (FX rate import, reconciliation)
Orchestration Kubernetes — Minikube locally, EKS in prod (Helm charts)
Backend patterns Controller → Service → Repository; saga orchestrator in payment-service
Resilience Resilience4j circuit breaker on mock gateway
Metrics / traces Prometheus, Grafana, Jaeger (local); Datadog (prod)
Tests JUnit 5, Cucumber BDD, Testcontainers
CI GitHub Actions → ECR → EKS

Architecture (local)

Browser → ingress (payment.local)
              ↓
    user · ledger · payment · transaction · fx · fraud · gateway-mock
    notification · recurring · stream-processing · batch-jobs
              ↓
    PostgreSQL  (payments, ledger, accounts, subscriptions — ACID)
    MongoDB     (sessions, fraud docs, notification log)
    Redis       (idempotency, FX cache, hot balances)
    Kafka       (payment.events, payment.captured, fraud.*, recurring.*)
              ↑
    Docker Compose on host (Postgres, Mongo, Redis, Kafka)
    Pods reach host via host.minikube.internal

Postgres, MongoDB, Redis, and Kafka run in Docker Compose on the host — not inside Minikube pods. Databases need stable disks; pods are ephemeral. Minikube runs the 11 Spring Boot services, ingress, billing CronJob, and observability stack. Same Helm chart deploys to EKS in prod (RDS, DocumentDB/Atlas, ElastiCache, MSK replace Compose).

Without K8s: scripts/start-local.sh runs JARs directly on localhost ports for fast iteration.


How the main flows work

One-time payment

The hot path does not wait for fraud. Fraud runs async after capture; if it fails, we reverse.

  1. Client POST /api/payments with Idempotency-Key header.
  2. payment-service checks idempotency (Redis lock + Postgres unique key) and creates PENDING.
  3. fx-service returns rate if settlement currency differs (cached in Redis).
  4. account-ledger-service reserves funds (SELECT FOR UPDATE on account row).
  5. payment-gateway-mock authorizes card (Resilience4j circuit breaker; tok_decline / tok_timeout for failure testing).
  6. Ledger captures: consumer pending → merchant available.
  7. Append row to payment_status_history; write outbox event in same DB transaction.
  8. Return 201 CAPTURED to client — fast checkout.
  9. Outbox relay publishes payment.captured to Kafka.
  10. stream-processing-service counts velocity per user in a 5-min window → fraud.enriched.events.
  11. fraud-service scores rules (amount > 10k, velocity > 5). Publishes payment.fraud_rejected or payment.fraud_cleared.
  12. On reject → payment-service runs compensation saga (reverse ledger) → status REVERSED.

Why async fraud?

Sync fraud adds 50–300 ms on every checkout. At high TPS that kills p99 latency. Trade-off: there's a brief window where a bad payment is captured before reversal — real systems mitigate with holds, amount limits, or sync fraud for high-risk merchants only.

What's the Idempotency-Key header?

Double-submit or network retry without it could charge twice.

  1. First request with key pay-001 → process payment, store key in Postgres (UNIQUE) + Redis.
  2. Same key again → return existing payment, don't charge again.
  3. Redis lock prevents two in-flight requests with the same key.

Refund

  1. POST /api/payments/{id}/refund only if status is CAPTURED or FRAUD_CLEARED.
  2. Ledger posts balanced debit/credit: merchant → consumer.
  3. Status → REFUNDED; outbox → Kafka for transaction-service and notifications.

Recurring billing

  1. K8s CronJob (every 5 min locally, hourly in prod) POSTs /internal/billing/run-due.
  2. recurring-service queries Postgres for subscriptions where next_billing_at <= now.
  3. Publishes payment.recurring.charge per subscription (idempotency key = subscriptionId + billingPeriod).
  4. payment-service consumes and runs the same payment saga as one-time pay.

Spring Batch in batch-jobs-service handles bulk FX import (POST /api/batch/fx-import), not the hot payment path.


Double-entry ledger

Every money movement = balanced debit + credit rows in ledger_entries. Account balances updated in the same transaction with pessimistic row locks. PostgreSQL is the only source of truth for money — never write balances to MongoDB.


System design concepts and technologies applied

Backend structure (Controller → Service → Repository)

HTTP request
  ↓
Controller   — routing, validation (@RestController)
  ↓
Service      — saga steps, idempotency, status machine
  ↓
Repository   — JPA for Postgres; MongoRepository for logs/sessions
  ↓
Postgres / MongoDB / Redis / Kafka

Infrastructure & data & concepts applied

  • Microservices — bounded contexts per service; shared payment-common for outbox, idempotency, events
  • Kubernetes — 11 services as deployments; CronJob for billing; HPA-ready; ingress routes traffic. Data stays outside cluster (Compose locally, managed AWS services in prod)
  • Saga orchestration — payment-service coordinates reserve → gateway → capture; compensation on failure or fraud reject
  • Event-driven audit — append-only payment_status_history + Kafka payment.events for replay
  • Kafka Streams — velocity windows enrich fraud decisions without blocking checkout
  • Circuit breaker — gateway-mock wraps external auth calls (simulates bank/card network flakiness)
  • Idempotency — header + Redis + Postgres unique constraint
  • Exponential Backoff - Retries are handled gracefully via a backoff and circuit breaker pattern
  • Observability — Prometheus/Grafana/Jaeger locally; Datadog APM + logs in prod
  • BDD tests — Cucumber features for payment, refund, fraud reversal, recurring

Kafka topics

Topic Purpose
payment.events Domain events → transaction-service, notifications
payment.captured Async fraud entry point
payment.fraud_rejected Triggers reversal saga
payment.fraud_cleared Fraud passed → notify
fraud.enriched.events Velocity-enriched payload from Kafka Streams
payment.recurring.charge Billing cycle charges
payment.reversed / payment.refunded Compensation / refund events
notification.commands Email/webhook mock

Redis keys

Key Purpose
idempotency:{key} Cached payment response (24h TTL)
idempotency:{key}:lock In-flight request lock
balance:{accountId} Cached available:pending balance
fx:{from}:{to} FX rate cache

Service ports (JAR / port-forward mode)

Service Port
user-service 8081
account-ledger-service 8082
payment-service 8083
transaction-service 8084
fx-service 8085
stream-processing-service 8086
fraud-service 8087
payment-gateway-mock 8088
notification-service 8089
recurring-service 8090
batch-jobs-service 8091

Production scale-up

Local stack runs on Minikube + Compose. For production we deploy via Terraform (infra/terraform/) and the same Helm chart.

Production architecture

Users / Merchants
  ↓
CloudFront (React SPA)
  ↓
Route 53 → ALB + WAF
  ↓
┌─────────────────────────────────────────────────────────────┐
│  EKS (multi-AZ)                                             │
│  user · ledger · payment · transaction · fx · fraud         │
│  gateway-mock · notification · recurring · streams · batch  │
│  HPA on payment-service / api paths                         │
│  CronJob → recurring billing                                │
└─────────────────────────────────────────────────────────────┘
  ↓              ↓              ↓              ↓
Aurora Postgres  DocumentDB/    ElastiCache    Amazon MSK
(multi-AZ,       Atlas          Redis          (3 brokers)
 read replicas)  (MongoDB)

CI: GitHub Actions → ECR → Helm deploy to EKS
Secrets: AWS Secrets Manager
Observability: Datadog (APM, logs, infra metrics)
               CloudWatch for raw AWS logs backup

Sharding path (not implemented locally): shard Postgres by merchant_id hash when single primary exceeds ~5–10k write TPS. Kafka partitions already keyed by payment/merchant ID.


IMPORTANT considerations

Temporal (future) — right now the saga lives as Java code inside payment-service. That works for MVP, but long-running flows (recurring billing + dunning + retry + partial refund + dispute) get messy: we need to remember state across crashes, retries, and days-long waits. Temporal gives us durable workflows — each step is recorded, survives pod restarts, and retries/compensations are built in. Maybe can consider if we have a lot of workflows in the future or complicated workflows

Disputes / chargebacks — not built. Would be a separate lifecycle: DISPUTED → EVIDENCE → WON/LOST with ledger hold on disputed amount.

KYC / RBAC / MFA — JWT carries hardcoded ROLE_MERCHANT / ROLE_CONSUMER. Real system would integrate identity providers, step-up auth for large transfers, and block payouts until KYC APPROVED.

If Redis dies — idempotency falls back to Postgres unique constraint; balance cache misses go to Postgres. Hot path still works, just slower.

If Kafka dies — payments still commit (outbox rows queue up); fraud/notifications catch up when relay resumes. Ledger never depends on Kafka for correctness.


How to Run

Prerequisites: Java 21, Maven, Docker, Minikube, Helm, Node 18+

export JAVA_HOME=/opt/homebrew/opt/openjdk@21/libexec/openjdk.jdk/Contents/Home

# 1. Data layer (host)
docker compose -f infra/docker/docker-compose.yml up -d

# 2. Kubernetes
./infra/k8s/minikube/setup.sh
eval $(minikube docker-env)
mvn package -DskipTests
# build images (example)
docker build -t payment-system/payment-service:latest services/payment-service
# ... repeat for other services, or use CI pipeline

helm upgrade --install payment-system infra/k8s/helm/payment-system \
  --namespace payment-system --create-namespace

# 3. Ingress
echo "$(minikube ip) payment.local" | sudo tee -a /etc/hosts
minikube tunnel   # separate terminal

# 4. Frontend
cd frontend && npm install && npm run dev

App: http://localhost:5173 (frontend) · API via ingress http://payment.local or port-forward individual services

Fast dev without K8s:

docker compose -f infra/docker/docker-compose.yml up -d
./scripts/start-local.sh
cd frontend && npm run dev

Tests:

mvn test
mvn test -pl services/payment-service -Dtest=RunCucumberTest -am   # needs Docker
k6 run infra/loadtest/payment-load.js

Teardown:

helm uninstall payment-system -n payment-system
minikube stop
docker compose -f infra/docker/docker-compose.yml down

About

Building a rough system design outline for a payment app which emphasizes on reliability, scalability and compliance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors