Designing a social media system focused on high availability and scalability

A social media style app built to practice and showcase system design concepts i have learnt throughout my internships. We focus on scalability and availability here as we need to serve a good user experience for billions of concurrent users across the world.

Requirements

Features

Image upload — support different formats and sizes
Like and comment on posts; view likes and comments; get notifications when posts are liked or commented on
Follow other users; get notified when someone follows the person
Analytics

Realistic Scale targets

100,000 concurrent uploads
1,000,000 concurrent likes and comments
1,000,000,000 registered users

Tech Stack

Layer	Tech
Frontend	React, Tailwind
API	NestJS monorepo api-gateway (HTTP) → gRPC microservices
Database	PostgreSQL
Cache	Redis
Queue	Kafka
Orchestration	Kubernetes (Minikube locally, Kustomize manifests)
Backend patterns	NestJS DI, Controller → Service → Repository (TypeORM)
Docs	Swagger at `/api/docs`
Metrics / traces	Prometheus, Grafana, Jaeger
CI	GitHub Actions

Architecture (local)

Browser → ingress (/ → web, /api → gateway, /media → media-service)
              ↓
         api-gateway (JWT, rate limit, Swagger)
              ↓ gRPC
    auth · user · media · post · feed · like · comment · fanout · notification
              ↓
    Postgres (users, posts, likes, comments, follows)
    Redis    (feeds, like/comment counts)
    Kafka    (new posts, likes, comments, notifications)

Postgres is run on docker locally and will depend on RDS in prod. We should try to avoid running db on the pod as it has a state and will be complex to handle in k8 as pods are ephemeral.

How the main flows work

Feed

When feed is open, the app does not scan Postgres for every follower/post combo.

feed-service reads a Redis sorted set feed:{userId} — post IDs were added when I followed someone or when people I follow posted.
For each post ID, post-service loads the post from Postgres (caption, media URL, username) and pulls like/comment counts from Redis.

Reads stay fast because the fanout work happened at write time.

New post

Post saved to Postgres.
Post ID pushed into the author's feed and followers' feeds in Redis — but only if the author has ≤10k followers (see below).
post.published on Kafka; fanout-service does the same Redis push for anything the sync step missed.

Why skip fanout above 10k followers?

Two ways to build a feed:

Push (what I use for normal accounts): when you post, copy the post ID into every follower's Redis feed up front. Opening the feed is a fast Redis read.
Pull (for big accounts): don't write to millions of feeds on every post. When someone opens their feed, query recent posts from the people they follow. We set a threshold of maybe 10K? we can maybe run a normal postgres query for recent posts of probably most interacted accounts? (maybe we can add a algorithm here or something like how tiktok and reels fyp work) so that it wont be too intensive on the server, and of course pagination and lazy loading.

Imagine if a celebrity has 1M followers, one post would mean 1M Redis writes before the upload even finishes. That blocks the request and hammers Redis for no good reason.

So the rule is: ≤set a threshold like 10k followers → push fanout on write. >10k → skip push (pull model on read — follower loads posts from followed users via Postgres when they open the feed). The threshold is configurable via FANOUT_FOLLOWER_THRESHOLD.

Likes & comments

Eventual consistency is chosen here to handle many concurrent writes. Strong consistency on every like would mean locking the same row/post under heavy traffic hence slower, and bad UX. The exact count being off by one for a second doesn't matter much at scale.

Tap like → UI updates immediately (optimistic).
like-service writes the like to Postgres, bumps Redis like_count:{postId}, returns the new count.
Kafka event for notifications. If Kafka is down, the like still sticks — the HTTP response doesn't depend on it.
Idempotency — see below.

Comments: same pattern (Postgres row + Redis counter + Kafka).

What's the Idempotency-Key header?

When you like a post, the frontend sends a random UUID in the Idempotency-Key header.

Problem it solves: you double-tap the heart, or the request times out and the browser retries — without this, the server might count two likes.

How it works:

First request with key abc-123 → process the like, store the response in Redis under idempotency:{userId}:abc-123 (24h TTL).
Same key again → return the cached response, don't increment again.
Plus a like_dedup row in Postgres (userId, postId) unique constraint so the same user can't like twice even with different keys.

So it's retry-safe and double-tap-safe.

Follow

Saving the follow in Postgres isn't enough — I also copy the followee's recent posts into the follower's Redis feed so the timeline isn't empty.

System design concepts and technologies applied

Backend structure (Controller → Service → Repository)

Each microservice follows a layered NestJS layout:

HTTP/gRPC request
  ↓
Controller   — routing only (api-gateway HTTP controllers, or @GrpcMethod in services)
  ↓
Service      — business logic (likePost, createPost, follow, idempotency checks)
  ↓
Repository   — data access (TypeORM Repository<T> for Postgres reads/writes)
  ↓
Postgres / Redis / Kafka

Infrastructure & data & concepts applied

Microservices + BFF — browser talks HTTP; services talk gRPC for faster internal calls
Kubernetes — 10 services + web + Redis/Kafka/observability; K8s runs each as its own deployment, restarts crashed pods, routes traffic through ingress, and matches how this would run on EKS in prod. Postgres stays outside the cluster (docker-compose locally, RDS in prod) because databases need stable disks, not ephemeral pods
Kafka — decouple likes/comments/notifications from the HTTP path so writes stay fast
Eventual consistency — like/comment counts in Redis; notifications catch up via Kafka
Strong consistency where it matters — auth, follows, ownership in Postgres transactions
Cache for reads — feeds and counts in Redis; Postgres holds durable metadata
Hybrid fanout — push to Redis when ≤10k followers; pull on read for bigger accounts
Indexes — (followeeId) on follows for fanout queries, (userId, createdAt) on posts for profile/feed pulls, (postId) on likes/comments; engagement tables partitioned by time for the 1B-user / high-write story
Idempotency — Idempotency-Key header + Redis cache + dedup tables
Optimistic UI — React updates before the server responds
State machines — media upload and post publish statuses
Observability — Prometheus, Grafana, Jaeger
Swagger — /api/docs
CICD — GitHub Actions / Jenkins
Test Cases - WIP

Redis keys

Key	Purpose
`feed:{userId}`	Sorted set of post IDs for home feed
`like_count:{postId}`	Like count
`comment_count:{postId}`	Comment count
`user_likes:{userId}`	Set of posts this user liked

API docs

Swagger UI: http://localhost:8080/api/docs

Production scale-up

Local stack only runs on Minikube. For production we deploy via infra as code. Terraform sketches live in infra/terraform/. Currently still learning more about iac.

Production architecture

Users
  ↓
CloudFront (static web assets + cached media)
  ↓
Route 53 → ALB (HTTPS, /api routing)
  ↓
┌─────────────────────────────────────────────────────────────┐
│  EKS (multi-AZ)                                             │
│                                                             │
│  web · api-gateway · auth · user · media · post · feed      │
│  like · comment · fanout · notification                     │
│                                                             │
│  HPA on api-gateway / web (CPU, RPS)                        │
│  KEDA on like / comment / fanout consumers (Kafka lag)      │
└─────────────────────────────────────────────────────────────┘
  ↓              ↓              ↓              ↓
RDS Postgres   ElastiCache    Amazon MSK     S3 (+ CloudFront)
(multi-AZ,     Redis          (3 brokers,    pre-signed uploads,
 read replicas) cluster mode)  same topics)   media storage

CI: GitHub Actions → ECR → Argo CD / Flux deploy to EKS
Secrets: AWS Secrets Manager → External Secrets Operator → K8s
Observability: OpenTelemetry → managed metrics/traces/logs
               (e.g. Datadog)

IMPORTANT Considerations

Right now celebrity posts dont appear on a normal person's feed as we are not injecting their posts into Redis. We can actually apply a hybrid approach to generating a person's feed be it celebrity or non celebrity by using the pre generated feed from redis combined with the recent posts which the user's followees with more than the threshold amount of followers possesses and sort it by time. And also if i just followed a user, backfilling is needed to display their recent posts on your feed.

Right now if redis dies, the home feed dies, which is not we want in a application thats centered around availability and scalability. We should fall back to postgres if such a thing were to happen, we should also replicate our in memory db (redis atm), if primary fails we promote the replica and failover. We can make shards of redis clusters by user id and replicas for each shard so that if one primary shard goes down, we can still serve for other users. Replicas are async by default so data is not strongly consistent across all nodes, hence if one primary node goes down and gets taken over, we continue serving that stale data while rebuilding whats missing in that shard via postgres.

Another way is lazy rebuilding on read, so if cache is empty we serve from postgres but trigger a job which repopulates only that user's feed so we dont touch the inactive users.

How to Run

npm ci
make up

App: http://localhost:8080

./scripts/obs-port-forward.sh   # Grafana :3001, Prometheus :9090, Jaeger :16686
open http://localhost:8080/api/docs
make down                       # teardown

Without K8s: make up-dev then make dev → http://localhost:4200

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
.vscode		.vscode
apps		apps
infra		infra
libs		libs
scripts		scripts
tests/load		tests/load
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.ingress-port-forward.log		.ingress-port-forward.log
.ingress-port-forward.pid		.ingress-port-forward.pid
.prettierignore		.prettierignore
.prettierrc		.prettierrc
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
nx.json		nx.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Designing a social media system focused on high availability and scalability

Requirements

Tech Stack

Architecture (local)

How the main flows work

Feed

New post

Likes & comments

Follow

System design concepts and technologies applied

Backend structure (Controller → Service → Repository)

Infrastructure & data & concepts applied

Redis keys

API docs

Production scale-up

Production architecture

IMPORTANT Considerations

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Designing a social media system focused on high availability and scalability

Requirements

Tech Stack

Architecture (local)

How the main flows work

Feed

New post

Likes & comments

Follow

System design concepts and technologies applied

Backend structure (Controller → Service → Repository)

Infrastructure & data & concepts applied

Redis keys

API docs

Production scale-up

Production architecture

IMPORTANT Considerations

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages