Skip to content

JoelKong/system-design-instagram

Repository files navigation

Designing a social media system focused on high availability and scalability

A social media style app built to practice and showcase system design concepts i have learnt throughout my internships. We focus on scalability and availability here as we need to serve a good user experience for billions of concurrent users across the world.


Requirements

Features

  • Image upload — support different formats and sizes
  • Like and comment on posts; view likes and comments; get notifications when posts are liked or commented on
  • Follow other users; get notified when someone follows the person
  • Analytics

Realistic Scale targets

  • 100,000 concurrent uploads
  • 1,000,000 concurrent likes and comments
  • 1,000,000,000 registered users

Tech Stack

Layer Tech
Frontend React, Tailwind
API NestJS monorepo api-gateway (HTTP) → gRPC microservices
Database PostgreSQL
Cache Redis
Queue Kafka
Orchestration Kubernetes (Minikube locally, Kustomize manifests)
Backend patterns NestJS DI, Controller → Service → Repository (TypeORM)
Docs Swagger at /api/docs
Metrics / traces Prometheus, Grafana, Jaeger
CI GitHub Actions

Architecture (local)

Browser → ingress (/ → web, /api → gateway, /media → media-service)
              ↓
         api-gateway (JWT, rate limit, Swagger)
              ↓ gRPC
    auth · user · media · post · feed · like · comment · fanout · notification
              ↓
    Postgres (users, posts, likes, comments, follows)
    Redis    (feeds, like/comment counts)
    Kafka    (new posts, likes, comments, notifications)

Postgres is run on docker locally and will depend on RDS in prod. We should try to avoid running db on the pod as it has a state and will be complex to handle in k8 as pods are ephemeral.


How the main flows work

Feed

When feed is open, the app does not scan Postgres for every follower/post combo.

  1. feed-service reads a Redis sorted set feed:{userId} — post IDs were added when I followed someone or when people I follow posted.
  2. For each post ID, post-service loads the post from Postgres (caption, media URL, username) and pulls like/comment counts from Redis.

Reads stay fast because the fanout work happened at write time.

New post

  1. Post saved to Postgres.
  2. Post ID pushed into the author's feed and followers' feeds in Redis — but only if the author has ≤10k followers (see below).
  3. post.published on Kafka; fanout-service does the same Redis push for anything the sync step missed.

Why skip fanout above 10k followers?

Two ways to build a feed:

  • Push (what I use for normal accounts): when you post, copy the post ID into every follower's Redis feed up front. Opening the feed is a fast Redis read.
  • Pull (for big accounts): don't write to millions of feeds on every post. When someone opens their feed, query recent posts from the people they follow. We set a threshold of maybe 10K? we can maybe run a normal postgres query for recent posts of probably most interacted accounts? (maybe we can add a algorithm here or something like how tiktok and reels fyp work) so that it wont be too intensive on the server, and of course pagination and lazy loading.

Imagine if a celebrity has 1M followers, one post would mean 1M Redis writes before the upload even finishes. That blocks the request and hammers Redis for no good reason.

So the rule is: ≤set a threshold like 10k followers → push fanout on write. >10k → skip push (pull model on read — follower loads posts from followed users via Postgres when they open the feed). The threshold is configurable via FANOUT_FOLLOWER_THRESHOLD.


Likes & comments

Eventual consistency is chosen here to handle many concurrent writes. Strong consistency on every like would mean locking the same row/post under heavy traffic hence slower, and bad UX. The exact count being off by one for a second doesn't matter much at scale.

  1. Tap like → UI updates immediately (optimistic).
  2. like-service writes the like to Postgres, bumps Redis like_count:{postId}, returns the new count.
  3. Kafka event for notifications. If Kafka is down, the like still sticks — the HTTP response doesn't depend on it.
  4. Idempotency — see below.

Comments: same pattern (Postgres row + Redis counter + Kafka).

What's the Idempotency-Key header?

When you like a post, the frontend sends a random UUID in the Idempotency-Key header.

Problem it solves: you double-tap the heart, or the request times out and the browser retries — without this, the server might count two likes.

How it works:

  1. First request with key abc-123 → process the like, store the response in Redis under idempotency:{userId}:abc-123 (24h TTL).
  2. Same key again → return the cached response, don't increment again.
  3. Plus a like_dedup row in Postgres (userId, postId) unique constraint so the same user can't like twice even with different keys.

So it's retry-safe and double-tap-safe.


Follow

Saving the follow in Postgres isn't enough — I also copy the followee's recent posts into the follower's Redis feed so the timeline isn't empty.


System design concepts and technologies applied

Backend structure (Controller → Service → Repository)

Each microservice follows a layered NestJS layout:

HTTP/gRPC request
  ↓
Controller   — routing only (api-gateway HTTP controllers, or @GrpcMethod in services)
  ↓
Service      — business logic (likePost, createPost, follow, idempotency checks)
  ↓
Repository   — data access (TypeORM Repository<T> for Postgres reads/writes)
  ↓
Postgres / Redis / Kafka

Infrastructure & data & concepts applied

  • Microservices + BFF — browser talks HTTP; services talk gRPC for faster internal calls
  • Kubernetes — 10 services + web + Redis/Kafka/observability; K8s runs each as its own deployment, restarts crashed pods, routes traffic through ingress, and matches how this would run on EKS in prod. Postgres stays outside the cluster (docker-compose locally, RDS in prod) because databases need stable disks, not ephemeral pods
  • Kafka — decouple likes/comments/notifications from the HTTP path so writes stay fast
  • Eventual consistency — like/comment counts in Redis; notifications catch up via Kafka
  • Strong consistency where it matters — auth, follows, ownership in Postgres transactions
  • Cache for reads — feeds and counts in Redis; Postgres holds durable metadata
  • Hybrid fanout — push to Redis when ≤10k followers; pull on read for bigger accounts
  • Indexes(followeeId) on follows for fanout queries, (userId, createdAt) on posts for profile/feed pulls, (postId) on likes/comments; engagement tables partitioned by time for the 1B-user / high-write story
  • IdempotencyIdempotency-Key header + Redis cache + dedup tables
  • Optimistic UI — React updates before the server responds
  • State machines — media upload and post publish statuses
  • Observability — Prometheus, Grafana, Jaeger
  • Swagger/api/docs
  • CICD — GitHub Actions / Jenkins
  • Test Cases - WIP

Redis keys

Key Purpose
feed:{userId} Sorted set of post IDs for home feed
like_count:{postId} Like count
comment_count:{postId} Comment count
user_likes:{userId} Set of posts this user liked

API docs

Swagger UI: http://localhost:8080/api/docs


Production scale-up

Local stack only runs on Minikube. For production we deploy via infra as code. Terraform sketches live in infra/terraform/. Currently still learning more about iac.

Production architecture

Users
  ↓
CloudFront (static web assets + cached media)
  ↓
Route 53 → ALB (HTTPS, /api routing)
  ↓
┌─────────────────────────────────────────────────────────────┐
│  EKS (multi-AZ)                                             │
│                                                             │
│  web · api-gateway · auth · user · media · post · feed      │
│  like · comment · fanout · notification                     │
│                                                             │
│  HPA on api-gateway / web (CPU, RPS)                        │
│  KEDA on like / comment / fanout consumers (Kafka lag)      │
└─────────────────────────────────────────────────────────────┘
  ↓              ↓              ↓              ↓
RDS Postgres   ElastiCache    Amazon MSK     S3 (+ CloudFront)
(multi-AZ,     Redis          (3 brokers,    pre-signed uploads,
 read replicas) cluster mode)  same topics)   media storage

CI: GitHub Actions → ECR → Argo CD / Flux deploy to EKS
Secrets: AWS Secrets Manager → External Secrets Operator → K8s
Observability: OpenTelemetry → managed metrics/traces/logs
               (e.g. Datadog)

IMPORTANT Considerations

Right now celebrity posts dont appear on a normal person's feed as we are not injecting their posts into Redis. We can actually apply a hybrid approach to generating a person's feed be it celebrity or non celebrity by using the pre generated feed from redis combined with the recent posts which the user's followees with more than the threshold amount of followers possesses and sort it by time. And also if i just followed a user, backfilling is needed to display their recent posts on your feed.

Right now if redis dies, the home feed dies, which is not we want in a application thats centered around availability and scalability. We should fall back to postgres if such a thing were to happen, we should also replicate our in memory db (redis atm), if primary fails we promote the replica and failover. We can make shards of redis clusters by user id and replicas for each shard so that if one primary shard goes down, we can still serve for other users. Replicas are async by default so data is not strongly consistent across all nodes, hence if one primary node goes down and gets taken over, we continue serving that stale data while rebuilding whats missing in that shard via postgres.

Another way is lazy rebuilding on read, so if cache is empty we serve from postgres but trigger a job which repopulates only that user's feed so we dont touch the inactive users.


How to Run

npm ci
make up

App: http://localhost:8080

./scripts/obs-port-forward.sh   # Grafana :3001, Prometheus :9090, Jaeger :16686
open http://localhost:8080/api/docs
make down                       # teardown

Without K8s: make up-dev then make devhttp://localhost:4200

About

Building a rough system design outline for a social media app which emphasizes on high availability, scalability for millions of concurrent users

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors