System Design 101: How to Build Systems That Scale
title: "System Design 101: How to Build Systems That Scale" date: "2026-03-10" description: "A practical introduction to system design fundamentals — scalability, load balancing, caching, databases, and the trade-offs that define real-world architecture decisions." tags: ["System Design", "Architecture"]
System design is the art of building software systems that are reliable, scalable, and maintainable. Whether you're preparing for a senior engineering interview or designing your next product, these fundamentals apply everywhere.
The Core Problem: Scale
A system that works for 100 users will often break at 10,000. The challenges change:
- Traffic — more requests per second
- Data — more records to store and query
- Geography — users in multiple regions
- Reliability — more nodes means more failure surfaces
Good system design anticipates these challenges before they become crises.
1. Horizontal vs Vertical Scaling
Vertical scaling (scale up) — buy a bigger server. Add more CPU, RAM, faster disk. Simple, but has limits and creates a single point of failure.
Horizontal scaling (scale out) — add more servers. Distribute load across many machines. This is how modern internet-scale systems work.
Vertical: Horizontal:
[Big Server] [Server] [Server] [Server]
| |
[Users] [Load Balancer]
|
[Users]Most production systems use a combination: somewhat powerful instances, but many of them.
2. Load Balancing
When you have multiple servers, you need something to distribute incoming traffic evenly. That's a load balancer.
Common load-balancing strategies:
| Strategy | How it works | Best for | |---|---|---| | Round Robin | Requests go to servers in rotation | Stateless apps with similar workloads | | Least Connections | Route to server with fewest active connections | Variable request duration | | IP Hash | Hash client IP → consistent server | Session persistence | | Weighted | Some servers get more traffic | Mixed-capacity servers |
In AWS, this is ALB (Application Load Balancer). In Kubernetes, it's a Service with multiple pods. Nginx and HAProxy are popular self-hosted options.
Health Checks
Load balancers continuously health-check backend servers. If a server stops responding, the load balancer stops routing traffic to it. This is a core mechanism for high availability.
3. Caching
Cache = store frequently-accessed data in fast memory to avoid expensive recomputation or database queries.
Levels of Caching
- Browser cache — HTTP cache headers (Cache-Control, ETag)
- CDN — cache static assets geographically close to users
- Application cache — in-process cache (e.g., Python dict, Guava cache in Java)
- Distributed cache — Redis, Memcached — shared across all app servers
Cache Invalidation
"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton
When underlying data changes, you need to update or invalidate cached copies. Common strategies:
- TTL (Time-to-Live) — expire cache entries after N seconds
- Write-through — update cache on every write
- Write-behind — queue cache writes asynchronously
- Cache-aside (lazy loading) — only populate cache on cache miss
import redis
import json
cache = redis.Redis(host="localhost", port=6379, db=0)
def get_user(user_id: str) -> dict:
# Try cache first
cached = cache.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Cache miss — query database
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
# Populate cache with 5-minute TTL
cache.setex(f"user:{user_id}", 300, json.dumps(user))
return user4. Databases
Choosing the right database is one of the most consequential architecture decisions.
Relational Databases (SQL)
PostgreSQL, MySQL, SQLite. Data is structured in tables with defined schemas. ACID transactions guarantee consistency.
Best for: user data, financial records, anything with complex relationships and joins.
-- Efficient query with proper indexing
SELECT p.title, u.name, COUNT(c.id) as comment_count
FROM posts p
JOIN users u ON p.user_id = u.id
LEFT JOIN comments c ON c.post_id = p.id
WHERE p.published_at > NOW() - INTERVAL '7 days'
GROUP BY p.id, u.name
ORDER BY comment_count DESC
LIMIT 10;NoSQL Databases
Different data models for different access patterns:
| Type | Example | Best for | |---|---|---| | Document | MongoDB, Firestore | Flexible schemas, JSON-like data | | Key-Value | Redis, DynamoDB | Fast lookups by key, sessions, caches | | Wide-Column | Cassandra, HBase | High-write, time-series, massive scale | | Graph | Neo4j | Relationships (social networks, fraud detection) |
Database Scaling Techniques
Read replicas — one primary handles writes; multiple replicas handle reads. Great when your workload is read-heavy (most apps are).
Sharding — partition data across multiple database instances by a shard key (e.g., user_id % N). Enables horizontal scaling but adds query complexity.
Connection pooling — reuse database connections rather than opening a new one per request. PgBouncer for Postgres, HikariCP for Java.
5. Message Queues & Async Processing
Not all work needs to happen synchronously in the request-response cycle. Message queues decouple producers from consumers.
[Web Server] → [Queue: email_jobs] → [Email Worker]
→ [Email Worker]
→ [Email Worker]Benefits:
- Resilience — if email service is down, jobs wait in queue
- Throughput — workers process at their own pace
- Retry — failed jobs can be retried automatically
Popular choices: Kafka (high-throughput streaming), RabbitMQ (rich routing), SQS (AWS managed), Bull/BullMQ (Redis-backed, great for Node.js).
6. CAP Theorem
In a distributed system, you can only guarantee two of three:
- Consistency — every read gets the most recent write
- Availability — every request gets a (non-error) response
- Partition tolerance — system works even if nodes can't communicate
Since network partitions are unavoidable in practice, you're really choosing between CP (consistent but may be unavailable during partition) and AP (available but may return stale data).
Examples:
- CP: Zookeeper, HBase, MongoDB (with strong consistency settings)
- AP: Cassandra, DynamoDB (with eventual consistency), CouchDB
7. Designing for Failure
Every component in a distributed system will eventually fail. Good systems assume this and design accordingly.
Circuit breakers — if a downstream service is failing, stop calling it for a while. Prevents cascading failures.
Retries with exponential backoff — don't hammer a struggling service.
async function callWithRetry<T>(
fn: () => Promise<T>,
maxRetries = 3
): Promise<T> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt === maxRetries - 1) throw err;
// Exponential backoff: 100ms, 200ms, 400ms...
await new Promise((r) => setTimeout(r, 100 * 2 ** attempt));
}
}
throw new Error("unreachable");
}Timeouts — always set timeouts on external calls. Hanging requests exhaust your connection pool.
Bulkheads — isolate failures to a part of the system (like bulkheads in a ship).
Putting It Together: A Simple Feed Service
Imagine designing Twitter's feed. The requirements:
- Users can post tweets
- Users follow other users
- Timeline shows latest tweets from followed users
A reasonable starting architecture:
- API servers behind a load balancer, horizontally scaled
- PostgreSQL for users, follows, tweets (with read replicas)
- Redis to cache pre-computed timelines per user
- Kafka for fan-out — when user A posts, emit a message; consumers update followers' cached timelines
- CDN for media (images, videos)
As scale increases, you'd shard the tweets table, add more Kafka consumers, and possibly split the "celebrity accounts with millions of followers" into a separate pull-based timeline path.
Summary
System design is about trade-offs. There's rarely a single right answer — only answers appropriate for your scale, team size, and reliability requirements.
Key principles to internalize:
- Design for failure — assume components will fail
- Scale out over scale up for internet-scale systems
- Cache aggressively — but have a cache invalidation strategy
- Async by default for work that doesn't need to be synchronous
- Pick databases based on access patterns, not familiarity
- Measure everything — you can't optimize what you can't observe