Load Balancing and Types of Load Balancers Explained
-
Last Updated: July 1, 2026
-
By: javahandson
-
Series
Learn Java in a easy way
A clear guide to load balancing and the types of load balancers: server-side, DNS-based, and client-side load balancing, how they work together, L4 vs L7 load balancers, common load balancing algorithms, and sticky sessions.
Running a service on a single server only gets you so far. Once traffic grows, most teams respond by running multiple copies of the same service rather than a single large server. This approach is called horizontal scaling, and it solves the capacity problem well. But it immediately raises a new question: with five servers instead of one, something has to decide which server handles each incoming request.
That something is a load balancer, and understanding load balancing in system design is one of the first skills every backend engineer needs. Load balancing is the practice of spreading incoming traffic across multiple servers so that no single machine gets overwhelmed. It sounds simple on the surface, but the details shape almost every large system you will ever design.
This article covers load balancing in system design end-to-end: what a load balancer actually does, the difference between Layer 4 and Layer 7 load balancers, the common algorithms used to select a server, and a topic that trips up many developers and interview candidates: sticky sessions. We will keep the language simple and back the ideas with small, practical examples.
You do not need to run a company the size of Amazon to hit this problem. Even a modest application can outgrow a single instance the moment it gets featured somewhere, goes viral on social media, or simply grows its user base over a few quarters. The moment a second server enters the picture, load balancing stops being optional and becomes a core part of the architecture, not an afterthought bolted on later.
Imagine a website running on three servers behind a single public address. Users do not know or care that three servers exist. They just type a URL and expect a response. Somebody has to receive that request and hand it to one of the three servers. Without that layer, you would need to give every user a different address for every server, which defeats the entire purpose of scaling out.
Load balancing solves this invisibly. It gives users a single address to talk to while quietly distributing the actual work across many machines behind the scenes. Horizontal scaling gives you more servers, and load balancing makes those servers act like a single reliable service.
There is also a resilience angle. If a load balancer knows that a server is unhealthy, it can simply stop sending traffic to it. Users never notice the failure because their requests are quietly routed to a healthy server instead. This is one of the reasons load balancing is treated as a foundational topic in system design interviews: it touches scalability, availability, and performance all at once.
There is a cost angle too, which often gets overlooked. Without load balancing, teams tend to over-provision a single giant server to provide headroom for traffic spikes, and that server sits mostly idle outside peak hours. Spreading traffic across several right-sized servers and scaling that fleet up or down with demand is usually far cheaper than keeping one oversized machine running around the clock.
From a career standpoint, load-balancing questions frequently appear in system design interviews because they require candidates to reason through trade-offs rather than recite a definition. An interviewer can ask a dozen small follow-up questions about a single load balancer diagram: what happens if a server dies, what happens if the load balancer itself dies, how do you handle a user’s session, and how do you route traffic to a new version of the service. Being comfortable with the fundamentals in this article covers most of those follow-ups.
A load balancer, in the most general sense, is anything that decides which backend server handles an incoming request instead of leaving that choice to the client. That decision can be made in a few genuinely different places, and it helps to name each one explicitly rather than picturing only the classic single box in the middle. Keep in mind that there are really two separate questions here: where the routing decision is made, which the three subsections below cover, and how a given load balancer is actually built, which is the hardware, software, or cloud distinction discussed under server-side load balancing.
This is the model most people picture by default. A dedicated component sits between clients and a group of backend servers. It first receives every incoming request, decides which backend server should handle it, and forwards the request to that server. The response usually flows back through the same path, so every request and response passes through this one component.
Picture a busy restaurant with several chefs working in the kitchen. Even with all that cooking power, the restaurant still needs a host at the entrance. The host greets each guest, checks which tables are available, seats them efficiently, and keeps the dining room running smoothly. A load balancer plays the same role in a distributed system. Instead of guests, it receives incoming requests; instead of tables, it directs those requests to available server instances, ensuring the workload is spread evenly, and no single server becomes overwhelmed.
Server-side load balancers come in a few flavors, but it is worth being clear that these are just different implementations of the same server-side model described above, not a separate category of load balancing. All of them sit in the request path and forward traffic; they only differ in how they are built and operated. Hardware load balancers are dedicated physical appliances, common in older data centers. Software load balancers such as NGINX and HAProxy run as regular processes on ordinary servers and are far more common today. Cloud-managed load balancers, such as AWS Elastic Load Balancing with its Application Load Balancer and Network Load Balancer, are software load balancers run and scaled for you by the cloud provider, so you configure them instead of installing and maintaining them yourself.
Regardless of which of these three you use, the underlying job is identical: sit between clients and servers, pick a healthy backend, and forward the request. Choosing which backend to forward to still requires a rule, such as sending requests in a fixed rotation or to the server that is least busy; these rules are the load-balancing algorithms covered later in this article.
DNS-based load balancing works at a completely different point in the request’s journey: before any connection is even opened, and before a server-side load balancer ever sees the request. When a client looks up a domain name, the DNS server can return a different IP address to different clients for the exact same domain, using techniques such as round-robin DNS, which simply rotates through a list of IPs, or GeoDNS, which returns the IP address of the data center closest to the client’s location.
This gives DNS-based load balancing a very different character from a server-side load balancer. Nothing proxies the actual request; DNS only hands out an address and then steps out of the picture entirely. It is also coarse-grained, since it can only route a client to an entire data center or region, not to an individual server, and it is slow to react to failures, because DNS answers are cached by browsers, operating systems, and intermediate resolvers for a duration controlled by the record’s time-to-live, or TTL. Lowering the TTL helps failover occur faster, but it also means every client re-queries DNS more often, which adds load on the DNS infrastructure.
Despite these limitations, DNS-based load balancing is often the very first layer of load balancing in a global system, deciding which region or data center a user’s traffic goes to before a server-side Layer 4 or Layer 7 load balancer inside that region takes over and spreads the request across individual servers.
Client-side load balancing flips the model around. Instead of every request going through a shared load balancer that then selects a server, the caller itself maintains a list of available instances and selects one directly, often with help from a service registry such as Eureka or Consul. There is no separate load balancer box in the middle at all.
First, though, a word of caution about the name, because it trips up almost everyone the first time they meet it. In web development, the term “client” usually refers to the browser or the end user’s device, so client-side load balancing sounds like the browser choosing which server to talk to. That is not what it means here, and browsers essentially never do this.
The word client is relative: it simply means whoever is making a given request. Most of the time, that caller is another backend service. If an Order Service calls an Inventory Service, then within that call, the Order Service is the client, even though the Order Service is itself a backend server sitting in your data center. This service-to-service case is where client-side load balancing most often appears.
But the caller need not be a backend service. Two other examples are worth knowing, because they show the same idea outside the data center:
What these cases share is that the caller is trusted code you control or ship, whether that is a backend service, your own mobile app, or your own SDK. That is exactly why a plain web browser is the one caller that essentially never does client-side load balancing, which is worth spelling out next, since the name misleads almost everyone at first.
There are good reasons browsers are kept out of this. A browser would have to know the private addresses of all your internal instances, which is a security problem; it would have to talk directly to your service registry, and you would be shipping load-balancing logic into untrusted code you do not control. So the browser always talks to a single stable public endpoint handled by a server-side load balancer, and client-side load balancing is reserved for callers you trust.
Spring Cloud LoadBalancer is the standard tool for client-side load balancing in the Spring ecosystem. It lets one microservice call another by a logical service name, while the client library resolves that name to a real instance and picks a healthy one, as shown below.
// An Order Service calling an Inventory Service by name.
// Here the Order Service is the "client", not any browser.
@Service
public class InventoryClient {
private final WebClient webClient;
public InventoryClient(WebClient.Builder builder) {
// "inventory-service" is resolved via the service registry,
// and Spring Cloud LoadBalancer picks a healthy instance.
this.webClient = builder.baseUrl("http://inventory-service").build();
}
public Mono<Integer> getStockLevel(String sku) {
return webClient.get()
.uri("/stock/{sku}", sku)
.retrieve()
.bodyToMono(Integer.class);
}
}
The distinction matters in interviews.
A traditional server-side load balancer sits between the client and the servers. It makes routing decisions for everyone, which keeps clients simple, but it also adds an extra network hop and another component that can fail.
Client-side load balancing removes that extra hop. The caller talks directly to a server instance, but now the caller must know how to discover available instances and choose one. This can be faster, but it makes the client a little smarter and more complex.
One important caveat: client-side load balancing is not the only way to handle internal traffic.
A simple way to remember the big picture:
Interview takeaway: There is no single “correct” approach. The trade-off is usually between simplicity and flexibility, and performance. Many modern systems use a load balancer at the edge and one of several mechanisms for balancing internal service-to-service traffic.
These three types are not rivals. You do not pick one and drop the others. In a big system, they usually work as a team. Each one handles a different step of the journey a request takes. Once you can see how they chain together, the whole topic stops feeling like a pile of separate tricks and starts feeling like one pipeline.
The clearest way to understand the chain is to walk through real requests. We will follow two different ones. The first is a normal web request from a browser, in which all three types appear in their most common forms. The second is a request from a mobile app that shows client-side load balancing occurring outside the backend, so you do not walk away thinking client-side always means service-to-service.
Imagine a user in London opens a shopping app in their browser. The app runs in two regions: one in Europe and one in North America. Here is what happens, step by step.
Each step makes a smaller choice than the one before. DNS picks a region. The edge load balancer picks a public-facing server in that region. The internal step picks one backend instance for a service-to-service call. No single step could do the others’ jobs. DNS cannot read a URL path. The internal balancer has no idea which region the user is in. They only work well as a chain.
Now change one thing. Instead of a browser, the user is on the company’s own native mobile app. This matters because a mobile app is trusted code the company wrote and shipped, unlike a random web browser. That trust lets the app take on a job a browser never could: doing its own client-side load balancing.
Suppose the app’s backend exposes two or three public API gateway addresses, one per region or one per availability zone. The company builds a short list of these addresses right into the app. Here is how a request plays out.
The key takeaway is in Step 1. The same idea, a caller holding a list of targets and choosing one themselves, is happening on a phone rather than on a server. gRPC clients, desktop apps, and SDKs handed to partners can all do this, too. So client-side does not mean inside the backend; it means the caller chooses, wherever that caller happens to run. What still holds is that the caller is trusted code that the company controls, which is why a plain web browser stays out of it and instead leans on the server-side load balancer.
Seeing the chain also makes failures easier to reason about, because each layer fails in its own way and recovers on its own timeline. This is a favorite area for interview follow-up questions.
The pattern is that lower, finer-grained layers recover quickly, while higher, coarser layers recover slowly but cover bigger failures. A healthy design relies on each layer for the kinds of failures it handles best, rather than expecting any single layer to catch everything.
The table below lines up the three types so the split of work is easy to remember.
| Type | Where in the Journey | Granularity of Choice | Recovers From |
| DNS-based | Before the connection opens, during name lookup | Region or data center | A whole region is going down (slowly) |
| Server-side (edge) | At the entry point of a region, in the request path | One user-facing server in a fleet | A single server failing (fast) |
| Client-side | In the caller, whether a backend service, phone, or SDK | One instance of a called service | A single instance failing (fast) |
This whole chain is a strong thing to sketch early in a system design interview. Opening with traffic hitting DNS first, then an edge load balancer, then internal balancing for service calls, signals that you see load balancing as a set of stages rather than one magic box, and it sets up the failure discussion above as a natural next step.
Before going further, it helps to connect this section back to the three types of load balancing covered earlier. The L4 versus L7 distinction is not a fourth type of load balancer alongside server-side, DNS-based, and client-side. It is a property of server-side load balancers, specifically: it describes which network layer the in-path load balancer inspects when deciding where to send a request. DNS-based load balancing happens before a connection even exists, so it has no L4 or L7 to speak of, and client-side load balancing keeps this decision inside the calling service rather than in a separate box.
With that framing in place, the most important distinction among server-side load balancers is which layer of the network they operate at. This single detail determines what the load balancer can and cannot see about a request, and therefore what routing decisions it can make.
A Layer 4 load balancer works at the transport layer. It looks only at IP addresses and TCP or UDP ports. It has no idea whether the traffic inside the connection is HTTP, a database protocol, or anything else. It simply forwards packets to a chosen backend and keeps the connection open.
Because it does so little inspection, an L4 load balancer is extremely fast and uses very little CPU. It is a good fit whenever you need raw speed, and the routing decision does not depend on the request’s content. AWS Network Load Balancer and simple TCP load balancers, such as IPVS, are common examples.
A Layer 7 load balancer works at the application layer. For web traffic, that means it can read the full HTTP request: the URL path, headers, cookies, and even the request body. This opens the door to smart, content-aware routing decisions.
For example, an L7 load balancer can send every request that starts with /api/orders to the order service and every request that starts with /api/users to the user service. It can also terminate SSL, rewrite headers, or route a small percentage of traffic to a new version of a service for A/B testing. NGINX, HAProxy in Layer 7 mode, AWS Application Load Balancer, and Spring Cloud Gateway are common examples.
The extra intelligence comes at a cost. Parsing HTTP requests takes more CPU than blindly forwarding packets, so L7 load balancers are typically slower per request than L4 ones. In practice, most modern microservice architectures still choose L7, because path-based routing and SSL termination are usually worth the small performance cost.
| Aspect | Layer 4 (Transport) | Layer 7 (Application) |
| Sees | IP address and port only | Full HTTP request: path, headers, cookies |
| Routing decisions | Based on connection info | Based on the request content |
| Speed | Very fast, low CPU cost | Slower, higher CPU cost |
| SSL termination | Not aware of it | Can terminate SSL itself |
| Path-based routing | Not possible | Possible, e.g.,/orders vs /users |
| Examples | AWS NLB, IPVS | NGINX, HAProxy, AWS ALB, Spring Cloud Gateway |
A simple rule of thumb for interviews: if the requirement mentions routing based on URL path, hostname, or request headers, the answer is Layer 7. If the requirement is just about spreading raw TCP connections as fast as possible, Layer 4 is enough.
In practice, large systems often do not pick just one. A common pattern is to place a Layer 4 load balancer at the very edge of the network, since it is cheap and extremely fast at handling huge volumes of raw traffic, and then place Layer 7 load balancers behind it to handle smarter, content-aware routing to individual services.
Cloud providers reflect this layering directly in their product names. AWS offers a Network Load Balancer for the L4 tier and an Application Load Balancer for the L7 tier; it is common to see an NLB in front of a fleet of ALBs, or an NLB in front of a Kubernetes ingress controller that performs Layer 7 routing. Recognizing this two-tier pattern is a good way to demonstrate depth in a system design discussion, as it shows that L4 and L7 are complementary tools rather than competing choices.
Once a load balancer receives a request, it still needs a rule to decide which backend server to route it to. Here is a quick tour of the most common ones. Each deserves a deeper look on its own, but this overview is enough to reason about trade-offs in a design discussion.
Each of these algorithms selects only from servers currently marked healthy. The algorithm and the health check work as a pair: health checks determine who is eligible, and the algorithm decides which eligible server receives the next request. A perfectly tuned algorithm cannot help if it keeps sending traffic to a server that health checks failed to catch in time.
| Algorithm | Best For |
| Round Robin | Servers with equal capacity |
| Weighted Round Robin | Mixed or uneven hardware |
| Least Connections | Long-lived or variable-duration requests |
| Least Response Time | Latency-sensitive services |
| IP Hash | Simple session affinity |
| Consistent Hashing | Caching and sharded systems, minimal reshuffling |
A minimal round-robin selector in Java illustrates the core idea: keep a list of servers and a pointer, and advance the pointer on every request.
// A minimal round robin server selector.
public class RoundRobinBalancer {
private final List<String> servers;
private final AtomicInteger index = new AtomicInteger(0);
public RoundRobinBalancer(List<String> servers) {
this.servers = servers;
}
public String nextServer() {
int i = index.getAndIncrement() % servers.size();
return servers.get(Math.abs(i));
}
}
A production load balancer builds on this same loop, adding health checks, weights, and thread safety on top.
Load balancing assumes that any server can handle any request equally well. That assumption breaks the moment a server stores something in memory about a specific user, such as their shopping cart or login session. If the next request from that user lands on a different server, the new server has no idea who the user is.
Picture a user who logs in, and the server stores their session in local memory. Their next click gets routed to a different server. That server never saw the login, so it treats the user as logged out. This is a classic bug caused by mixing stateful servers with load balancing.
Sticky sessions, also called session affinity, solve this by pinning a client to the same backend server for the life of their session. Two common mechanisms are used.
Sticky sessions bring back the exact problem that horizontal scaling was meant to remove: a server that a specific user now depends on. If that server crashes, every user pinned to it loses their session. Sticky sessions can also create uneven load, since a handful of very active users might pile extra work onto one server while others sit idle.
For this reason, sticky sessions are best thought of as a short-term patch rather than a long-term design choice. They work, but they quietly reintroduce a single point of failure at the level of individual user sessions.
Sticky sessions also complicate autoscaling. When a new server joins the fleet to absorb extra load, it starts with zero pinned users, since affinity only forms as new sessions begin. Existing users stay glued to the older, already busy servers, so the new capacity helps less than expected during exactly the moment it is needed most. Removing a server is just as awkward because every user pinned to it must be migrated or will simply lose their session when the instance is terminated.
The better long-term answer to session affinity is to remove the need for it entirely. Instead of storing session data on a specific server, store it in a shared external store that every server instance can read from. This keeps servers stateless, which is what makes horizontal scaling and load balancing work smoothly together in the first place.
In the Spring Boot world, the standard tool for this is Spring Session backed by Redis. Every server instance still creates and reads sessions the normal way through the Servlet API, but under the hood, Spring Session stores the session data in Redis rather than in local memory.
Adding this to a Spring Boot project is mostly configuration, not code changes. The dependency and a small properties block are usually enough.
// build.gradle implementation 'org.springframework.session:spring-session-data-redis' implementation 'org.springframework.boot:spring-boot-starter-data-redis' # application.properties spring.session.store-type=redis spring.redis.host=localhost spring.redis.port=6379
Once this is in place, a user can log in through one server, and their very next request can land on any other instance. Every server reads the same Redis session, so the user experience remains seamless, with no sticky sessions required.
This is the pattern worth remembering: sticky sessions solve the symptom by pinning users to a server, while externalized session storage solves the root cause by removing server-side state altogether.
Externalizing session state is not entirely free. Every session read now involves a network call to Redis instead of a local memory lookup, which adds a small amount of latency, typically well under a millisecond within the same data center. It also makes Redis itself a critical dependency, so it needs to be deployed with its own replication and failover strategy. In almost every case, this small added complexity is a worthwhile trade for servers that can be added, removed, or replaced without any user noticing.
There is one more single point of failure hiding in this whole design: the load balancer itself. If every request passes through one load balancer instance and that instance goes down, it does not matter how many healthy backend servers are waiting behind it.
Production systems solve this by running load balancers in a highly available pair or cluster, often in an active-passive setup with a floating IP address that moves to the standby instance if the primary fails. Cloud-managed load balancers, such as AWS ALB and NLB, handle this redundancy automatically across multiple availability zones.
Reaching four or five nines of availability requires redundancy at every layer, and the load balancer layer is no exception. A single load balancer instance can never be more available than the single server it was designed to protect.
Health checks are what make all of this self-healing. The load balancer periodically checks each backend, and if one starts failing checks, traffic simply stops flowing to it. A Spring Boot Actuator health endpoint plugs directly into this: a load balancer can poll /actuator/health and automatically pull an unhealthy instance out of rotation.
Two more mechanisms commonly reinforce load balancer redundancy at a global scale. DNS failover monitors a load balancer’s health from outside the network and updates DNS records to point elsewhere if it stops responding, though it reacts more slowly due to DNS caching. Anycast routing, used by many large content delivery networks, advertises the same IP address from multiple physical locations and lets network routing itself send traffic to the nearest healthy location, which sidesteps DNS caching delays entirely.
None of these mechanisms is something a typical application team builds from scratch. The practical takeaway is simply to know they exist and to ask, whenever a design includes a load balancer, what happens if that specific box disappears. Cloud-managed load balancers handle most of this automatically, but self-hosted setups need it designed in deliberately.
A few misunderstandings around load balancing come up again and again, both in real production incidents and in interview rooms.
When a system design interview reaches the load balancing stage, interviewers usually want to hear you explicitly name Layer 4 or Layer 7 and justify which one fits the scenario. If the design involves multiple microservices reachable through different URL paths, say so out loud: that is an L7 requirement.
Expect a follow-up question about session handling if your design involves logged-in users. Naming sticky sessions is fine as a first answer, but naming externalized session storage as the preferred long-term solution shows a deeper understanding.
A strong closing point in any load-balancing discussion is to note that the load balancer itself requires redundancy. Many candidates design a beautiful fleet of stateless servers and forget that the single load balancer in front of them is still a single point of failure.
It also helps to mention client-side load balancing when the design involves multiple internal microservices communicating with each other, rather than assuming that every internal call passes through a shared load balancer. Naming both server-side and client-side load balancing in the same answer, and explaining when each fits, is usually enough to signal senior-level thinking on this topic.
A short list of habits carries most of the value from this article into real projects.
| Term | Meaning in One Line |
| Load balancer | Distributes incoming traffic across multiple servers |
| Layer 4 (L4) | Routes based on IP and port only, very fast |
| Layer 7 (L7) | Routes based on full request content, more flexible |
| Round robin | Cycles requests through servers in order |
| Least connections | Sends the next request to the least busy server |
| Consistent hashing | Hash ring routing that minimizes reshuffling when servers change |
| Client-side load balancing | The calling service picks a healthy instance itself; no shared load balancer is needed |
| Service registry | A directory of live service instances, used to discover where to send a call (e.g., Eureka, Consul) |
| Service mesh | Infrastructure that moves service-to-service concerns like load balancing into sidecar proxies (e.g., Istio, Linkerd) |
| Sticky session | Pins a client to the same backend server |
| Session affinity | Another name for sticky sessions |
| Externalized session | Session data is stored in a shared store like Redis, not on the server |
| Health check | Periodic probe used to detect and remove unhealthy servers |
| Anycast | Same IP advertised from multiple locations, routed to the nearest one |
Load balancing is the component that turns a pile of independent servers into a single reliable service. Layer 4 gives you speed; Layer 7 gives you intelligence; and choosing between them depends on whether your routing decisions need to look inside the request.
Sticky sessions solve an immediate problem, but bring back the fragility that horizontal scaling was meant to remove. Externalizing session state with a shared store like Redis is the more durable fix, and it keeps every server instance interchangeable.
Carry one more habit forward from this article: whenever you draw a load balancer in a design, immediately ask whether it is redundant itself. A system is only as available as its weakest single point of failure, and it is very easy to accidentally leave the load balancer as that weak point.
None of these ideas needs to be memorized in isolation. Load balancing, algorithms, sticky sessions, and high availability all answer the same underlying question: when a request arrives, who handles it, and what happens if that choice goes wrong. Keep that question in mind, and the rest of the details in this article fall into place naturally.