Back-of-the-Envelope Estimation & System Design Basics
-
Last Updated: June 27, 2026
-
By: javahandson
-
Series
Learn Java in a easy way
Learn back-of-the-envelope estimation step by step, plus scalability (vertical vs horizontal scaling), latency vs throughput, latency numbers every programmer should know, and availability (the nines) — a beginner-friendly system design guide with examples.
Every backend developer eventually hits a wall. The app that ran perfectly on your laptop starts to stutter once real users show up. Pages load slowly, the database groans, and someone asks the dreaded question: “Will this scale?” That question sits at the heart of system design basics, and it is the reason this series exists.
This article is Day 1 of a hands-on journey into system design, written specifically for Java developers who are comfortable with Spring Boot but new to thinking about systems at scale. We will not jump into Kafka, sharding, or distributed consensus yet. Instead, we start with the three foundations that all later topics depend on: how systems grow (scalability), how we measure their speed (latency versus throughput), and how we estimate capacity before writing a single line of code (back-of-the-envelope estimation).
By the end, you will be able to reason about a system the way senior engineers do in design discussions and interviews. You will know when to buy a bigger server and when to add more servers, why a fast system can still feel slow, and how to estimate whether your idea needs one machine or a thousand. We will keep things simple, use everyday analogies, and back each concept with small Java examples so the ideas stick.
It is tempting to skip theory and just write code. After all, Spring Boot makes it easy to ship an API in minutes. The problem is that early decisions quietly decide your ceiling. A service that stores everything in memory works beautifully with ten users and collapses with ten thousand. A database query that takes 50 milliseconds feels instant until a million people run it at once.
System design basics provide a mental model of these trade-offs. They help you answer practical questions before they become production incidents. How many requests can one instance handle? What happens when traffic doubles overnight? Where will the bottleneck appear first? These are not abstract puzzles; they are the daily reality of building software that lasts.
There is also a career angle. System design rounds are now a standard part of interviews for mid-level and senior engineers. Interviewers rarely want a perfect answer. They want to see whether you can think out loud, make reasonable assumptions, and estimate numbers without panicking. The skills in this article are exactly what those rounds test.
Scalability is the ability of a system to handle more work by adding resources. The keyword is “more work”: more users, more requests, more data, or all three. A scalable system stays healthy as demand rises; an unscalable one degrades or falls over.
Picture a small restaurant with a single chef. At first, orders trickle in, and the chef keeps up easily. As the restaurant grows in popularity, tickets pile up on the rail. The owner now faces a choice that mirrors exactly the decision every backend team faces. You can make the one chef more capable, or you can hire more chefs. These two paths are the two kinds of scaling.
Vertical scaling means making a single machine more powerful. In the restaurant, this is giving your one chef a bigger stove, sharper knives, and more counter space. In computing, it means adding more CPU cores, more RAM, or faster disks to the same server.
The appeal of vertical scaling lies in its simplicity. Your application does not change at all. You do not need a load balancer, you do not need to worry about coordinating multiple machines, and your code keeps running exactly as before. For many applications, upgrading the server is the fastest and cheapest way to buy more headroom.
Vertical scaling has a limit. A single machine can only get bigger up to a certain point, and high-end machines are very costly. Worse, a single powerful server is still a single point of failure. If it goes down, your entire system goes down with it. Vertical scaling buys time, not infinity.
Horizontal scaling means adding more machines and spreading the work across them. Instead of one super-chef, you hire five ordinary chefs who cook in parallel. In computing, you run several copies of your application behind a load balancer that distributes incoming requests among them.
This path scales almost without limit. Need more capacity? Add another instance. It also brings resilience: if one machine dies, the load balancer simply routes around it, and the others keep serving traffic. This is why large-scale systems are almost always horizontally scaled.
The trade-off is increased complexity. With multiple servers, they must work together as a single system. If any server can handle a request, where should user sessions be stored? How do you keep data consistent across all servers? Problems that were simple on a single machine—such as managing shared state—become much harder in a distributed system. In fact, many distributed system techniques exist primarily to solve the challenges introduced by horizontal scaling.
Horizontal scaling becomes far simpler when your service is stateless, meaning each request carries everything needed to process it, and the server stores nothing between requests. Spring Boot REST controllers are naturally stateless, which is one reason they scale so well. The example below shows a stateless endpoint: any instance can serve any request because nothing is held in memory between calls.
@RestController
@RequestMapping("/api/orders")
public class OrderController {
private final OrderService orderService;
public OrderController(OrderService orderService) {
this.orderService = orderService;
}
// Stateless: the request carries the orderId,
// and the result comes from a shared database.
// Any instance behind the load balancer can serve it.
@GetMapping("/{orderId}")
public OrderResponse getOrder(@PathVariable Long orderId) {
return orderService.findById(orderId);
}
}
Notice that the controller does not store any instance-specific data. Instead, it fetches the order from a shared database for every request. This makes the service stateless, allowing you to run multiple instances behind a load balancer. Since every instance accesses the same shared data, any request can be handled by any server. In contrast, storing data such as a shopping cart in a local HashMap inside the controller would tie users to a specific server. If a request were routed to a different instance, the cart data would be unavailable, making horizontal scaling impossible.
Both approaches have their place. Teams often start vertical for simplicity and shift to horizontal as traffic grows. The table below summarizes the trade-offs.
| Aspect | Vertical Scaling (Up) | Horizontal Scaling (Out) |
| Method | Bigger single machine | More machines in parallel |
| Complexity | Low — no code changes | High — need coordination |
| Limit | Hard ceiling per machine | Near-unlimited |
| Failure impact | Single point of failure | Survives node loss |
| Cost curve | Expensive at the top end | Cheaper commodity nodes |
| Best for | Quick wins, simple apps | Large-scale, resilient systems |
As a system grows, measuring its performance becomes essential. Two metrics are used in almost every performance discussion: latency and throughput. Although they are closely related, they measure different things and are often confused. Understanding the difference between them is one of the most important fundamentals of system design.
Imagine a highway. Latency is how long a single car takes to drive from one city to the next. Throughput is how many cars pass a checkpoint each hour. They describe different things, and improving one does not automatically improve the other.
Latency is the time it takes for one request to complete, from the moment it is sent to the moment the response arrives. It is usually measured in milliseconds, and lower is better. When a user clicks a button and waits, they are experiencing latency directly.
In a Java backend, latency is the total time taken to process a single request. It includes the time spent sending data over the network, querying the database, executing business logic, and preparing the response. If any of these steps becomes slower, users will notice the delay. Reducing latency means making each request complete faster, often by using caching, minimizing database queries, optimizing algorithms, or simplifying application logic.
Throughput measures how many requests a system can process in a given amount of time, usually expressed as requests per second (RPS). Unlike latency, which focuses on the response time of a single request, throughput measures the overall capacity of the system. A higher throughput means the system can serve more users at the same time.
You can improve throughput by increasing parallelism—for example, by adding more threads, expanding connection pools, or running additional server instances. However, high throughput does not always mean a fast user experience. A system may handle thousands of requests per second while each request still takes a long time to complete. Likewise, a system can have low latency for individual users but struggle to handle many requests simultaneously. Understanding this difference is essential when designing scalable systems.
A wide ten-lane highway has high throughput because many cars can travel at the same time. However, in a traffic jam, each car may still move very slowly, which means high latency. On the other hand, a single-lane racetrack can give very low latency because one car can move very fast, but it has low throughput since only one car can go at a time.
In simple terms, latency is about the experience of one unit (one car or one request), while throughput is about the system’s overall capacity.
This is also why improving one can sometimes hurt the other. A common example is batching. When many small operations are grouped into one large batch, throughput increases because fixed costs are shared. But latency increases too, because each item must wait until the batch is full before being processed.
// A simple example showing latency vs throughput in a Java backend.
//
// Each request simulates 200ms of work.
// So the latency of one request is ~200ms.
public OrderResponse process(Long orderId) {
long start = System.currentTimeMillis();
simulateWork(200); // simulate DB call / computation / I/O delay
OrderResponse response = build(orderId);
long latencyMs = System.currentTimeMillis() - start;
log.info("Request latency: {} ms", latencyMs); // ~200ms per request
return response;
}
In the above example, if a system has one thread:
Now, if we increase to 10 threads:
Latency stays the same → each request still takes ~200ms
Throughput increases → more requests are handled in parallel
Throughput improves when you add parallel workers; latency remains unchanged because each request still takes the same amount of time.
| Aspect | Latency | Throughput |
| Question it answers | How long does a request take? | How many requests per second? |
| Unit | Milliseconds (ms) | Requests per second |
| Better when | Lower | Higher |
| Improved by | Caching, fewer DB calls | More threads, more servers |
| Who feels it | The individual user | The system as a whole |
To reason about latency in real systems, it helps to have a rough sense of how long common operations take. Google’s Jeff Dean popularized a now-famous set of approximate timings, and while the exact figures have shifted as hardware improves, their relative scale has stayed remarkably stable. The lesson is not the precise nanoseconds but the orders of magnitude between them.
The table below lists representative numbers. The original values come from Jeff Dean’s work as popularized by the ByteByteGo system design course; treat them as ballpark figures for reasoning, not exact benchmarks for your hardware.
Note: ns = Nanoseconds; µs = Microseconds; ms = Milliseconds
1,000 ns = 1 µs
1,000 µs = 1 ms
1,000 ms = 1 second
| Operation | Approximate Time |
| L1 cache reference | 0.5 ns |
| Branch mispredict | 5 ns |
| L2 cache reference | 7 ns |
| Mutex lock/unlock | 100 ns |
| Main memory reference | 100 ns |
| Compress 1 KB with a fast algorithm | ~10 µs |
| Send 2 KB over a 1 Gbps network | ~20 µs |
| Read 1 MB sequentially from memory | ~250 µs |
| Round-trip within the same datacenter | ~500 µs |
| Disk seek | ~10 ms |
| Read 1 MB sequentially from disk | ~30 ms |
| Round-trip California → Netherlands → California | ~150 ms |
A few practical conclusions fall out of these numbers, and they quietly drive a lot of architecture decisions:
Keep these magnitudes in mind whenever you read your own latency logs. If an endpoint that only touches memory and a local cache is taking 80 milliseconds, the numbers above tell you something is wrong — probably a hidden disk read, a network call, or lock contention.
Before building anything, good engineers do a rough calculation to understand the scale of the problem. This is a back-of-the-envelope estimation: getting an approximate answer in a minute using round numbers. As Jeff Dean described it, these are estimates built from thought experiments and a few common performance numbers, just enough to feel out which designs can meet your requirements. The goal is not precision; it is knowing whether you need one server or a thousand, gigabytes or petabytes.
These estimates shape every design decision that follows. If your calculation says ten requests per second, a single modest instance will do. If it says fifty thousand, you are immediately in distributed-systems territory. Doing this math early prevents both over-engineering a toy and under-building something that will buckle.
Storage math relies on knowing your data units, and those units are powers of two. A byte is eight bits, and one ASCII character fits in a single byte. From there, each unit is roughly a thousand times the previous one. Memorizing this table makes storage estimation almost instant.
| Power of 2 | Approximate Value | Unit |
| 2^10 | 1 Thousand | 1 KB (Kilobyte) |
| 2^20 | 1 Million | 1 MB (Megabyte) |
| 2^30 | 1 Billion | 1 GB (Gigabyte) |
| 2^40 | 1 Trillion | 1 TB (Terabyte) |
| 2^50 | 1 Quadrillion | 1 PB (Petabyte) |
Beyond data units, a couple of time tricks make division painless under pressure.
Let us estimate a substantial, media-heavy system: a photo-sharing service in the spirit of Instagram, where users upload photos and scroll a feed of images from people they follow. This example is a good teacher because it forces us to estimate not just queries and storage, but also media storage and network bandwidth — the dimensions that actually determine the architecture of an image-heavy system. The numbers below are invented for the exercise, not real figures from any company.
Good estimation always starts by writing down assumptions, because every later number depends on them. Stating them out loud is also exactly what interviewers want to see.
Start with writes, which are the uploads. One hundred million uploads per day divided by our rounded 100,000 seconds per day gives about 1,000 uploads per second on average. Applying the peak multiplier of 2 to 3, plan for roughly 3,000 uploads per second when traffic clusters in busy evening hours.
Now the reads, which are the feed views. Each daily active user views 100 photos, so 100 million users times 100 views is 10 billion views per day. Divided by 100,000 seconds, that is about 100,000 reads per second on average, and around 300,000 reads per second at peak. The read-to-write ratio is therefore roughly 100 to 1, which is the single most important insight from this estimate: this is an overwhelmingly read-dominated system, so the architecture must pour its effort into making reads fast and cheap.
Each day brings 100 million new photos, each 300 KB. Multiplying that by 300 KB gives 30 million MB, or about 30 TB of new photo data every single day. That number alone tells you that a single database or disk is hopeless; you need distributed object storage, such as S3, from day one.
Project that forward. Over a year, 30 TB per day times 365 is roughly 11 PB (petabytes) per year. Across the 5-year retention window, you are storing on the order of 55 PB of photos. The power-of-two table from earlier lets you carry these units confidently: thousands of gigabytes become terabytes, and thousands of terabytes become petabytes.
Metadata is comparatively tiny. If each photo record — photo ID, user ID, caption, timestamps — is about 1 KB, then 100 million records per day is only 100 GB, trivial compared to the 30 TB of pixels. This contrast matters architecturally: metadata fits comfortably in a database, while the photos themselves belong in object storage with a CDN in front.
Bandwidth is a dimension lighter systems never make you think about, and for media services it is often the real cost driver. Incoming (write) bandwidth comes from uploads: about 1,000 uploads per second times 300 KB each is roughly 300 MB per second of ingress.
Outgoing (read) bandwidth dwarfs it, because reads outnumber writes 100 to 1. About 100,000 feed views per second times 300 KB each is roughly 30 GB per second of egress on average, and triple that at peak. That single number — tens of gigabytes per second leaving your system — is why image-heavy services lean so heavily on a content delivery network. Serving every one of those bytes from your own servers would be ruinously expensive and slow; a CDN caches photos close to users and absorbs the bulk of that egress.
Because reads dominate, caching is not optional, so it is worth a quick estimate too. A well-known rule of thumb is that roughly 20 percent of content drives about 80 percent of traffic — recent and popular photos are viewed far more than old ones. So rather than caching everything, we size the cache to hold the hot 20 percent of a day’s reads.
One day sees 10 billion photo views. Twenty percent of that is 2 billion views, but many are repeats of the same popular photos, so the number of distinct photos to cache is far smaller. If we aim to keep, say, the day’s 100 million newest photos hot in memory at 300 KB each, that is 30 TB of cache — clearly too much for one machine, confirming we need a distributed cache like Redis spread across many nodes. Even this rough pass tells us the cache tier is a first-class part of the design, not an afterthought.
The Java helper below generalizes the entire method — QPS, peak, storage, and bandwidth — into a small, reusable estimator you can keep in your notes.
// Back-of-the-envelope helper for a media-heavy service.
public class CapacityEstimator {
private static final long SECONDS_PER_DAY = 100_000; // rounded
public static long perSecond(long perDay) {
return perDay / SECONDS_PER_DAY;
}
public static long peak(long average, int multiplier) {
return average * multiplier;
}
// Bandwidth in MB/s given QPS and average payload size in KB.
public static double bandwidthMbPerSec(long qps, double payloadKb) {
return (qps * payloadKb) / 1024.0;
}
public static void main(String[] args) {
long uploads = perSecond(100_000_000L); // ~1,000/s
long views = perSecond(10_000_000_000L); // ~100,000/s
System.out.println("Avg uploads/s: " + uploads);
System.out.println("Avg views/s: " + views);
System.out.println("Peak views/s: " + peak(views, 3));
double egress = bandwidthMbPerSec(views, 300); // ~30,000 MB/s
System.out.printf("Read egress: %.0f MB/s%n", egress);
}
}
The takeaway is how much these four numbers reveal in under two minutes. A 100-to-1 read-to-write ratio, petabytes of media storage, tens of gigabytes per second of egress, and a multi-node cache requirement together sketch the entire shape of the system: object storage for photos, a database for metadata, a CDN for egress, and a distributed cache for hot reads. We have not designed anything yet, but estimation has already ruled out every single-machine approach and pointed straight at the right building blocks. The process matters more than the exact figures, and stating each assumption out loud is what lets others challenge and refine it.
Estimating load tells you how much traffic a system must handle. Availability tells you how reliably it must stay up while doing so. Availability is the percentage of time a system is operational, and it is one of the most important non-functional requirements in any real design discussion.
A service level agreement, or SLA, is a formal promise from a provider about uptime. Major cloud providers such as Amazon, Google, and Microsoft typically commit to 99.9 percent or higher for their core services. Uptime is traditionally counted in “nines,” and each additional nine is dramatically harder and more expensive to achieve than the last.
The table below shows how little downtime each level actually permits. The jump from three nines to five nines looks small on paper but represents an enormous engineering difference.
| Availability | Downtime per Day | Downtime per Year |
| 99% (two nines) | ~14.4 minutes | ~3.65 days |
| 99.9% (three nines) | ~1.44 minutes | ~8.77 hours |
| 99.99% (four nines) | ~8.6 seconds | ~52.6 minutes |
| 99.999% (five nines) | ~864 milliseconds | ~5.26 minutes |
Why does this matter for a Java developer? Because the availability target dictates architecture. Two nines might be fine for an internal reporting tool and achievable with a single well-monitored instance. Four or five nines demand redundancy at every layer: multiple instances, automatic failover, replicated databases, and health checks. This is exactly where the horizontal scaling from earlier becomes non-negotiable, since a single point of failure can never reach four nines.
In Spring Boot, the building blocks for high availability are practical and familiar. Health-check endpoints let a load balancer detect and route around a sick instance. The example below shows a simple custom health indicator using Spring Boot Actuator.
// A custom health check so the load balancer can detect
// an unhealthy instance and stop sending it traffic.
@Component
public class DatabaseHealthIndicator implements HealthIndicator {
private final JdbcTemplate jdbcTemplate;
public DatabaseHealthIndicator(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
@Override
public Health health() {
try {
jdbcTemplate.queryForObject("SELECT 1", Integer.class);
return Health.up().build();
} catch (Exception ex) {
// Reporting DOWN lets the load balancer skip this node,
// protecting overall availability.
return Health.down(ex).build();
}
}
}
When this endpoint reports DOWN, an upstream load balancer can stop routing requests to the failing instance while the others continue serving. That simple mechanism is a cornerstone of maintaining availability in a horizontally scaled system, even when individual machines fail.
These ideas are not separate; they feed into one another in a natural order. Estimation comes first and tells you the scale. Scalability decisions follow, because the numbers reveal whether one beefy server suffices or whether you must scale out. Latency and throughput then become the metrics you watch to confirm your design actually holds up under load.
Return to the photo-sharing feed. Estimation revealed about 100,000 feed reads per second on average, triple that at peak, and tens of gigabytes per second of egress. Those numbers immediately rule out a single instance and point toward horizontal scaling, a CDN, and a distributed cache. Once running, you would monitor read latency to keep each feed scroll snappy, and read throughput to confirm the fleet and cache keep up with demand. The concepts form a loop: estimate, scale, measure, repeat.
Thinking in this loop is what separates guessing from engineering. Availability sits alongside it as the reliability target that the whole loop must respect. Every later topic in this series — caching, load balancing, databases, message queues — plugs into the same loop. They are simply tools for meeting your latency target while maintaining your throughput at the scale your estimate predicted, without breaking your availability promise.
Beginners tend to stumble over the same points, and being aware of them early can save a lot of pain. Here are the ones that come up most often, along with the reasoning behind each.
System design interviews reward structured thinking more than memorized solutions. When you’re asked to design a system, start by estimating its scale. Clearly state your assumptions about the number of users, daily traffic, and request volume. Convert these estimates into queries per second (QPS) and approximate storage requirements using power-of-two units (KiB, MiB, GiB, and so on). This demonstrates that you understand the expected scale before making architectural decisions.
Interviewers also expect you to justify your scaling strategy. A strong answer is to begin with vertical scaling because it is simpler and easier to manage for small systems. As traffic grows, transition to horizontal scaling to increase capacity and improve reliability. Be sure to explain the trade-off: vertical scaling offers simplicity but creates a single point of failure, while horizontal scaling provides better fault tolerance and scalability at the cost of increased system complexity. Mentioning that high-availability systems require redundancy—which horizontal scaling enables—shows a solid understanding of real-world architecture.
When discussing performance, always distinguish between latency and throughput. If someone asks how you would make a system faster, first clarify whether the goal is to reduce the response time for each request (latency) or to increase the number of requests the system can handle (throughput). Since these goals often require different optimization strategies, asking this clarifying question demonstrates careful reasoning and is often more impressive than immediately suggesting a particular technology.
Translating theory into daily practice is what makes these concepts valuable. A few habits will carry you a long way as you build Spring Boot services with scale in mind.
Here is a compact glossary of every core term from this article, so you can revisit them quickly.
| Term | Meaning in One Line |
| Scalability | Handling more work by adding resources |
| Vertical scaling | A bigger, more powerful single machine |
| Horizontal scaling | More machines working in parallel |
| Stateless service | Each request is self-contained; no server memory between calls |
| Latency | Time for one request to complete (lower is better) |
| Throughput | Requests handled per second (higher is better) |
| QPS | Queries per second; the standard load measure |
| Estimation | Rough capacity math using round numbers |
| Peak multiplier | Average load times 2–3 to size for busy hours |
| Availability | Percentage of time a system is operational, counted in nines |
System design basics begin with three ideas that everything else builds on. Scalability tells you how a system grows, through a bigger machine or more machines, each with its own trade-offs. Latency and throughput give you the two numbers to measure performance, one for the individual and one for the crowd. Back-of-the-envelope estimation lets you size a system in under a minute, turning vague ambition into concrete numbers.
None of this requires advanced math or exotic tools. It requires a habit of thinking in scale and trade-offs before you write code. Carry that habit into your next Spring Boot project: estimate first, choose your scaling path with eyes open, and watch latency and throughput to confirm reality matches the plan. Master these foundations now, and every advanced topic that follows will feel like a natural extension rather than a leap.