Load Balancing and Types of Load Balancers Explained

  • Last Updated: July 1, 2026
  • By: javahandson
  • Series
img

Load Balancing and Types of Load Balancers Explained

A clear guide to load balancing and the types of load balancers: server-side, DNS-based, and client-side load balancing, how they work together, L4 vs L7 load balancers, common load balancing algorithms, and sticky sessions.

1. Introduction

Running a service on a single server only gets you so far. Once traffic grows, most teams respond by running multiple copies of the same service rather than a single large server. This approach is called horizontal scaling, and it solves the capacity problem well. But it immediately raises a new question: with five servers instead of one, something has to decide which server handles each incoming request.

That something is a load balancer, and understanding load balancing in system design is one of the first skills every backend engineer needs. Load balancing is the practice of spreading incoming traffic across multiple servers so that no single machine gets overwhelmed. It sounds simple on the surface, but the details shape almost every large system you will ever design.

This article covers load balancing in system design end-to-end: what a load balancer actually does, the difference between Layer 4 and Layer 7 load balancers, the common algorithms used to select a server, and a topic that trips up many developers and interview candidates: sticky sessions. We will keep the language simple and back the ideas with small, practical examples.

You do not need to run a company the size of Amazon to hit this problem. Even a modest application can outgrow a single instance the moment it gets featured somewhere, goes viral on social media, or simply grows its user base over a few quarters. The moment a second server enters the picture, load balancing stops being optional and becomes a core part of the architecture, not an afterthought bolted on later.

2. Why Load Balancing Matters

Imagine a website running on three servers behind a single public address. Users do not know or care that three servers exist. They just type a URL and expect a response. Somebody has to receive that request and hand it to one of the three servers. Without that layer, you would need to give every user a different address for every server, which defeats the entire purpose of scaling out.

Load balancing solves this invisibly. It gives users a single address to talk to while quietly distributing the actual work across many machines behind the scenes. Horizontal scaling gives you more servers, and load balancing makes those servers act like a single reliable service.

There is also a resilience angle. If a load balancer knows that a server is unhealthy, it can simply stop sending traffic to it. Users never notice the failure because their requests are quietly routed to a healthy server instead. This is one of the reasons load balancing is treated as a foundational topic in system design interviews: it touches scalability, availability, and performance all at once.

There is a cost angle too, which often gets overlooked. Without load balancing, teams tend to over-provision a single giant server to provide headroom for traffic spikes, and that server sits mostly idle outside peak hours. Spreading traffic across several right-sized servers and scaling that fleet up or down with demand is usually far cheaper than keeping one oversized machine running around the clock.

From a career standpoint, load-balancing questions frequently appear in system design interviews because they require candidates to reason through trade-offs rather than recite a definition. An interviewer can ask a dozen small follow-up questions about a single load balancer diagram: what happens if a server dies, what happens if the load balancer itself dies, how do you handle a user’s session, and how do you route traffic to a new version of the service. Being comfortable with the fundamentals in this article covers most of those follow-ups.

3. What Is a Load Balancer, and the Types of Load Balancers

A load balancer, in the most general sense, is anything that decides which backend server handles an incoming request instead of leaving that choice to the client. That decision can be made in a few genuinely different places, and it helps to name each one explicitly rather than picturing only the classic single box in the middle. Keep in mind that there are really two separate questions here: where the routing decision is made, which the three subsections below cover, and how a given load balancer is actually built, which is the hardware, software, or cloud distinction discussed under server-side load balancing.

3.1. Server-Side Load Balancing

This is the model most people picture by default. A dedicated component sits between clients and a group of backend servers. It first receives every incoming request, decides which backend server should handle it, and forwards the request to that server. The response usually flows back through the same path, so every request and response passes through this one component.

Picture a busy restaurant with several chefs working in the kitchen. Even with all that cooking power, the restaurant still needs a host at the entrance. The host greets each guest, checks which tables are available, seats them efficiently, and keeps the dining room running smoothly. A load balancer plays the same role in a distributed system. Instead of guests, it receives incoming requests; instead of tables, it directs those requests to available server instances, ensuring the workload is spread evenly, and no single server becomes overwhelmed.

Server-side load balancers come in a few flavors, but it is worth being clear that these are just different implementations of the same server-side model described above, not a separate category of load balancing. All of them sit in the request path and forward traffic; they only differ in how they are built and operated. Hardware load balancers are dedicated physical appliances, common in older data centers. Software load balancers such as NGINX and HAProxy run as regular processes on ordinary servers and are far more common today. Cloud-managed load balancers, such as AWS Elastic Load Balancing with its Application Load Balancer and Network Load Balancer, are software load balancers run and scaled for you by the cloud provider, so you configure them instead of installing and maintaining them yourself.

Regardless of which of these three you use, the underlying job is identical: sit between clients and servers, pick a healthy backend, and forward the request. Choosing which backend to forward to still requires a rule, such as sending requests in a fixed rotation or to the server that is least busy; these rules are the load-balancing algorithms covered later in this article.

3.2. DNS-Based Load Balancing

DNS-based load balancing works at a completely different point in the request’s journey: before any connection is even opened, and before a server-side load balancer ever sees the request. When a client looks up a domain name, the DNS server can return a different IP address to different clients for the exact same domain, using techniques such as round-robin DNS, which simply rotates through a list of IPs, or GeoDNS, which returns the IP address of the data center closest to the client’s location.

This gives DNS-based load balancing a very different character from a server-side load balancer. Nothing proxies the actual request; DNS only hands out an address and then steps out of the picture entirely. It is also coarse-grained, since it can only route a client to an entire data center or region, not to an individual server, and it is slow to react to failures, because DNS answers are cached by browsers, operating systems, and intermediate resolvers for a duration controlled by the record’s time-to-live, or TTL. Lowering the TTL helps failover occur faster, but it also means every client re-queries DNS more often, which adds load on the DNS infrastructure.

Despite these limitations, DNS-based load balancing is often the very first layer of load balancing in a global system, deciding which region or data center a user’s traffic goes to before a server-side Layer 4 or Layer 7 load balancer inside that region takes over and spreads the request across individual servers.

3.3. Client-Side Load Balancing

Client-side load balancing flips the model around. Instead of every request going through a shared load balancer that then selects a server, the caller itself maintains a list of available instances and selects one directly, often with help from a service registry such as Eureka or Consul. There is no separate load balancer box in the middle at all.

First, though, a word of caution about the name, because it trips up almost everyone the first time they meet it. In web development, the term “client” usually refers to the browser or the end user’s device, so client-side load balancing sounds like the browser choosing which server to talk to. That is not what it means here, and browsers essentially never do this.

The word client is relative: it simply means whoever is making a given request. Most of the time, that caller is another backend service. If an Order Service calls an Inventory Service, then within that call, the Order Service is the client, even though the Order Service is itself a backend server sitting in your data center. This service-to-service case is where client-side load balancing most often appears.

But the caller need not be a backend service. Two other examples are worth knowing, because they show the same idea outside the data center:

  • A mobile app: a native phone app can be shipped with a list of API endpoints and pick among them itself, sometimes failing over to a backup endpoint if one is slow. Here, the client is the user’s phone, not a backend server, yet it performs client-side load balancing.
  • A gRPC client or SDK: gRPC has client-side load balancing built into its libraries. A desktop app, a command-line tool, or an SDK you hand to partners can resolve a service name to several addresses and spread calls across them on its own, with no separate load balancer in between.

What these cases share is that the caller is trusted code you control or ship, whether that is a backend service, your own mobile app, or your own SDK. That is exactly why a plain web browser is the one caller that essentially never does client-side load balancing, which is worth spelling out next, since the name misleads almost everyone at first.

There are good reasons browsers are kept out of this. A browser would have to know the private addresses of all your internal instances, which is a security problem; it would have to talk directly to your service registry, and you would be shipping load-balancing logic into untrusted code you do not control. So the browser always talks to a single stable public endpoint handled by a server-side load balancer, and client-side load balancing is reserved for callers you trust.

Spring Cloud LoadBalancer is the standard tool for client-side load balancing in the Spring ecosystem. It lets one microservice call another by a logical service name, while the client library resolves that name to a real instance and picks a healthy one, as shown below.

// An Order Service calling an Inventory Service by name.
// Here the Order Service is the "client", not any browser.
@Service
public class InventoryClient {
 
    private final WebClient webClient;
 
    public InventoryClient(WebClient.Builder builder) {
        // "inventory-service" is resolved via the service registry,
        // and Spring Cloud LoadBalancer picks a healthy instance.
        this.webClient = builder.baseUrl("http://inventory-service").build();
    }
 
    public Mono<Integer> getStockLevel(String sku) {
        return webClient.get()
                .uri("/stock/{sku}", sku)
                .retrieve()
                .bodyToMono(Integer.class);
    }
}

The distinction matters in interviews.

A traditional server-side load balancer sits between the client and the servers. It makes routing decisions for everyone, which keeps clients simple, but it also adds an extra network hop and another component that can fail.

Client-side load balancing removes that extra hop. The caller talks directly to a server instance, but now the caller must know how to discover available instances and choose one. This can be faster, but it makes the client a little smarter and more complex.

One important caveat: client-side load balancing is not the only way to handle internal traffic.

  • In Kubernetes, services are often routed through a Kubernetes Service, which acts like a built-in server-side load balancer.
  • With a service mesh such as Istio or Linkerd, the load-balancing logic is handled by a proxy next to each service.

A simple way to remember the big picture:

  • DNS → decides which region or data center to use.
  • Edge Layer 7 load balancer → handles external traffic coming from users.
  • Internal service calls → are balanced using client-side libraries, Kubernetes Services, or a service mesh, depending on the system.

Interview takeaway: There is no single “correct” approach. The trade-off is usually between simplicity and flexibility, and performance. Many modern systems use a load balancer at the edge and one of several mechanisms for balancing internal service-to-service traffic.

4. How the Three Types Work Together

These three types are not rivals. You do not pick one and drop the others. In a big system, they usually work as a team. Each one handles a different step of the journey a request takes. Once you can see how they chain together, the whole topic stops feeling like a pile of separate tricks and starts feeling like one pipeline.

The clearest way to understand the chain is to walk through real requests. We will follow two different ones. The first is a normal web request from a browser, in which all three types appear in their most common forms. The second is a request from a mobile app that shows client-side load balancing occurring outside the backend, so you do not walk away thinking client-side always means service-to-service.

4.1. Example One: A Web Request from a Browser

Imagine a user in London opens a shopping app in their browser. The app runs in two regions: one in Europe and one in North America. Here is what happens, step by step.

  • Step 1 is DNS-based load balancing. The browser looks up the app’s web address. A GeoDNS server notices that the user is in Europe and returns the address for the European region. This picks a region, nothing smaller, and it happens before any connection is made.
  • Step 2 is server-side load balancing. The browser connects to that address. Waiting there is a server-side load balancer at the edge of the European region, typically a Layer 7 load balancer. It looks at the request, sees that it wants the storefront page, and forwards it to one healthy storefront server from among many.
  • Step 3 is client-side load balancing. To build the page, the storefront server needs stock numbers. So it calls the inventory service. Now the storefront server is the client. It picks one healthy inventory instance itself and sends the call straight there.

Each step makes a smaller choice than the one before. DNS picks a region. The edge load balancer picks a public-facing server in that region. The internal step picks one backend instance for a service-to-service call. No single step could do the others’ jobs. DNS cannot read a URL path. The internal balancer has no idea which region the user is in. They only work well as a chain.

4.2. Example Two: A Request from a Mobile App

Now change one thing. Instead of a browser, the user is on the company’s own native mobile app. This matters because a mobile app is trusted code the company wrote and shipped, unlike a random web browser. That trust lets the app take on a job a browser never could: doing its own client-side load balancing.

Suppose the app’s backend exposes two or three public API gateway addresses, one per region or one per availability zone. The company builds a short list of these addresses right into the app. Here is how a request plays out.

  • Step 1 is client-side load balancing, on the phone. The app looks at its built-in list of gateway addresses and picks one, often the closest or the one that responded fastest last time. If that gateway is slow or unreachable, the app quietly retries the next address on its list. This is client-side load balancing, but the client is a phone in someone’s hand, not a backend service.
  • Step 2 is server-side load balancing. The chosen gateway address points to a server-side load balancer for that region. It receives the call and forwards it to one of the many healthy API servers, exactly like the edge step in the browser example.
  • Step 3 is client-side load balancing again, this time inside the backend. That API server calls other internal services to do its work, picking healthy instances the same way the storefront did in the first example.

The key takeaway is in Step 1. The same idea, a caller holding a list of targets and choosing one themselves, is happening on a phone rather than on a server. gRPC clients, desktop apps, and SDKs handed to partners can all do this, too. So client-side does not mean inside the backend; it means the caller chooses, wherever that caller happens to run. What still holds is that the caller is trusted code that the company controls, which is why a plain web browser stays out of it and instead leans on the server-side load balancer.

4.3. What Happens When a Layer Fails

Seeing the chain also makes failures easier to reason about, because each layer fails in its own way and recovers on its own timeline. This is a favorite area for interview follow-up questions.

  • If a backend instance fails, the layer above it stops sending traffic to it. A server-side load balancer drops the instance after failed health checks, and a client-side balancer skips it in its own instance list. Recovery here is usually fast, often within seconds.
  • If a whole region’s edge load balancer fails, DNS-based load balancing steers users elsewhere by handing out the other region’s address instead. Because DNS answers are cached, this switch is slower and can take minutes, which is the price of operating at the DNS layer.
  • If the DNS layer itself has trouble, there is not much below it to catch the fall, which is why DNS providers are run with heavy redundancy and why some large systems use more than one DNS provider.

The pattern is that lower, finer-grained layers recover quickly, while higher, coarser layers recover slowly but cover bigger failures. A healthy design relies on each layer for the kinds of failures it handles best, rather than expecting any single layer to catch everything.

The table below lines up the three types so the split of work is easy to remember.

TypeWhere in the JourneyGranularity of ChoiceRecovers From
DNS-basedBefore the connection opens, during name lookupRegion or data centerA whole region is going down (slowly)
Server-side (edge)At the entry point of a region, in the request pathOne user-facing server in a fleetA single server failing (fast)
Client-sideIn the caller, whether a backend service, phone, or SDKOne instance of a called serviceA single instance failing (fast)

This whole chain is a strong thing to sketch early in a system design interview. Opening with traffic hitting DNS first, then an edge load balancer, then internal balancing for service calls, signals that you see load balancing as a set of stages rather than one magic box, and it sets up the failure discussion above as a natural next step.

5. L4 vs L7 Load Balancing

Before going further, it helps to connect this section back to the three types of load balancing covered earlier. The L4 versus L7 distinction is not a fourth type of load balancer alongside server-side, DNS-based, and client-side. It is a property of server-side load balancers, specifically: it describes which network layer the in-path load balancer inspects when deciding where to send a request. DNS-based load balancing happens before a connection even exists, so it has no L4 or L7 to speak of, and client-side load balancing keeps this decision inside the calling service rather than in a separate box.

With that framing in place, the most important distinction among server-side load balancers is which layer of the network they operate at. This single detail determines what the load balancer can and cannot see about a request, and therefore what routing decisions it can make.

5.1. Layer 4 Load Balancing

A Layer 4 load balancer works at the transport layer. It looks only at IP addresses and TCP or UDP ports. It has no idea whether the traffic inside the connection is HTTP, a database protocol, or anything else. It simply forwards packets to a chosen backend and keeps the connection open.

Because it does so little inspection, an L4 load balancer is extremely fast and uses very little CPU. It is a good fit whenever you need raw speed, and the routing decision does not depend on the request’s content. AWS Network Load Balancer and simple TCP load balancers, such as IPVS, are common examples.

5.2. Layer 7 Load Balancing

A Layer 7 load balancer works at the application layer. For web traffic, that means it can read the full HTTP request: the URL path, headers, cookies, and even the request body. This opens the door to smart, content-aware routing decisions.

For example, an L7 load balancer can send every request that starts with /api/orders to the order service and every request that starts with /api/users to the user service. It can also terminate SSL, rewrite headers, or route a small percentage of traffic to a new version of a service for A/B testing. NGINX, HAProxy in Layer 7 mode, AWS Application Load Balancer, and Spring Cloud Gateway are common examples.

The extra intelligence comes at a cost. Parsing HTTP requests takes more CPU than blindly forwarding packets, so L7 load balancers are typically slower per request than L4 ones. In practice, most modern microservice architectures still choose L7, because path-based routing and SSL termination are usually worth the small performance cost.

5.3. L4 vs L7 at a Glance

AspectLayer 4 (Transport)Layer 7 (Application)
SeesIP address and port onlyFull HTTP request: path, headers, cookies
Routing decisionsBased on connection infoBased on the request content
SpeedVery fast, low CPU costSlower, higher CPU cost
SSL terminationNot aware of itCan terminate SSL itself
Path-based routingNot possiblePossible, e.g.,/orders vs /users
ExamplesAWS NLB, IPVSNGINX, HAProxy, AWS ALB, Spring Cloud Gateway

A simple rule of thumb for interviews: if the requirement mentions routing based on URL path, hostname, or request headers, the answer is Layer 7. If the requirement is just about spreading raw TCP connections as fast as possible, Layer 4 is enough.

5.4. Using Both Together

In practice, large systems often do not pick just one. A common pattern is to place a Layer 4 load balancer at the very edge of the network, since it is cheap and extremely fast at handling huge volumes of raw traffic, and then place Layer 7 load balancers behind it to handle smarter, content-aware routing to individual services.

Cloud providers reflect this layering directly in their product names. AWS offers a Network Load Balancer for the L4 tier and an Application Load Balancer for the L7 tier; it is common to see an NLB in front of a fleet of ALBs, or an NLB in front of a Kubernetes ingress controller that performs Layer 7 routing. Recognizing this two-tier pattern is a good way to demonstrate depth in a system design discussion, as it shows that L4 and L7 are complementary tools rather than competing choices.

6. Load Balancing Algorithms

Once a load balancer receives a request, it still needs a rule to decide which backend server to route it to. Here is a quick tour of the most common ones. Each deserves a deeper look on its own, but this overview is enough to reason about trade-offs in a design discussion.

  • Round-robin: requests are sent to servers in a fixed, rotating order. Simple, and works well when servers have similar capacity and requests take roughly the same time.
  • Weighted round robin: same idea, but each server gets a weight based on its capacity, so a more powerful server receives proportionally more requests.
  • Least connections: the next request goes to the server with the fewest active connections. Useful when request durations vary widely, such as with web sockets.
  • Least response time: like least connections, but it also factors in how quickly each server has been responding recently. Good for latency-sensitive services.
  • IP hash: the client’s IP address is hashed to consistently pick the same server, giving simple session affinity without cookies.
  • Consistent hashing: servers and requests are distributed on a hash ring, so adding or removing a server only reshuffles a small slice of traffic rather than the entire ring. This same idea powers distributed caches and sharded databases.
  • Random: each request is sent to a randomly chosen server. It sounds naive, but with enough requests, it spreads the load almost as evenly as round-robin, and it requires no shared state between load balancer instances, which makes it attractive in some distributed setups.

Each of these algorithms selects only from servers currently marked healthy. The algorithm and the health check work as a pair: health checks determine who is eligible, and the algorithm decides which eligible server receives the next request. A perfectly tuned algorithm cannot help if it keeps sending traffic to a server that health checks failed to catch in time.

AlgorithmBest For
Round RobinServers with equal capacity
Weighted Round RobinMixed or uneven hardware
Least ConnectionsLong-lived or variable-duration requests
Least Response TimeLatency-sensitive services
IP HashSimple session affinity
Consistent HashingCaching and sharded systems, minimal reshuffling

A minimal round-robin selector in Java illustrates the core idea: keep a list of servers and a pointer, and advance the pointer on every request.

// A minimal round robin server selector.
public class RoundRobinBalancer {
 
    private final List<String> servers;
    private final AtomicInteger index = new AtomicInteger(0);
 
    public RoundRobinBalancer(List<String> servers) {
        this.servers = servers;
    }
 
    public String nextServer() {
        int i = index.getAndIncrement() % servers.size();
        return servers.get(Math.abs(i));
    }
}

A production load balancer builds on this same loop, adding health checks, weights, and thread safety on top.

7. Sticky Sessions and Session Affinity

Load balancing assumes that any server can handle any request equally well. That assumption breaks the moment a server stores something in memory about a specific user, such as their shopping cart or login session. If the next request from that user lands on a different server, the new server has no idea who the user is.

7.1. The Problem in Practice

Picture a user who logs in, and the server stores their session in local memory. Their next click gets routed to a different server. That server never saw the login, so it treats the user as logged out. This is a classic bug caused by mixing stateful servers with load balancing.

7.2. Sticky Sessions as a Fix

Sticky sessions, also called session affinity, solve this by pinning a client to the same backend server for the life of their session. Two common mechanisms are used.

  • Cookie-based affinity: the load balancer sets a cookie identifying which server handled the first request. Every subsequent request from that browser carries the cookie, and the load balancer reads it to route the request back to the same server.
  • Client IP-based affinity: the client’s IP address is used to consistently select the same server, without requiring a cookie. This can break if the client’s IP changes mid-session, for example, when a mobile device switches networks.

7.3. The Trade-off

Sticky sessions bring back the exact problem that horizontal scaling was meant to remove: a server that a specific user now depends on. If that server crashes, every user pinned to it loses their session. Sticky sessions can also create uneven load, since a handful of very active users might pile extra work onto one server while others sit idle.

For this reason, sticky sessions are best thought of as a short-term patch rather than a long-term design choice. They work, but they quietly reintroduce a single point of failure at the level of individual user sessions.

Sticky sessions also complicate autoscaling. When a new server joins the fleet to absorb extra load, it starts with zero pinned users, since affinity only forms as new sessions begin. Existing users stay glued to the older, already busy servers, so the new capacity helps less than expected during exactly the moment it is needed most. Removing a server is just as awkward because every user pinned to it must be migrated or will simply lose their session when the instance is terminated.

8. Externalizing Session State in Spring Boot

The better long-term answer to session affinity is to remove the need for it entirely. Instead of storing session data on a specific server, store it in a shared external store that every server instance can read from. This keeps servers stateless, which is what makes horizontal scaling and load balancing work smoothly together in the first place.

In the Spring Boot world, the standard tool for this is Spring Session backed by Redis. Every server instance still creates and reads sessions the normal way through the Servlet API, but under the hood, Spring Session stores the session data in Redis rather than in local memory.

Adding this to a Spring Boot project is mostly configuration, not code changes. The dependency and a small properties block are usually enough.

// build.gradle
implementation 'org.springframework.session:spring-session-data-redis'
implementation 'org.springframework.boot:spring-boot-starter-data-redis'
 
# application.properties
spring.session.store-type=redis
spring.redis.host=localhost
spring.redis.port=6379

Once this is in place, a user can log in through one server, and their very next request can land on any other instance. Every server reads the same Redis session, so the user experience remains seamless, with no sticky sessions required.

This is the pattern worth remembering: sticky sessions solve the symptom by pinning users to a server, while externalized session storage solves the root cause by removing server-side state altogether.

Externalizing session state is not entirely free. Every session read now involves a network call to Redis instead of a local memory lookup, which adds a small amount of latency, typically well under a millisecond within the same data center. It also makes Redis itself a critical dependency, so it needs to be deployed with its own replication and failover strategy. In almost every case, this small added complexity is a worthwhile trade for servers that can be added, removed, or replaced without any user noticing.

9. Load Balancer High Availability

There is one more single point of failure hiding in this whole design: the load balancer itself. If every request passes through one load balancer instance and that instance goes down, it does not matter how many healthy backend servers are waiting behind it.

Production systems solve this by running load balancers in a highly available pair or cluster, often in an active-passive setup with a floating IP address that moves to the standby instance if the primary fails. Cloud-managed load balancers, such as AWS ALB and NLB, handle this redundancy automatically across multiple availability zones.

Reaching four or five nines of availability requires redundancy at every layer, and the load balancer layer is no exception. A single load balancer instance can never be more available than the single server it was designed to protect.

Health checks are what make all of this self-healing. The load balancer periodically checks each backend, and if one starts failing checks, traffic simply stops flowing to it. A Spring Boot Actuator health endpoint plugs directly into this: a load balancer can poll /actuator/health and automatically pull an unhealthy instance out of rotation.

Two more mechanisms commonly reinforce load balancer redundancy at a global scale. DNS failover monitors a load balancer’s health from outside the network and updates DNS records to point elsewhere if it stops responding, though it reacts more slowly due to DNS caching. Anycast routing, used by many large content delivery networks, advertises the same IP address from multiple physical locations and lets network routing itself send traffic to the nearest healthy location, which sidesteps DNS caching delays entirely.

None of these mechanisms is something a typical application team builds from scratch. The practical takeaway is simply to know they exist and to ask, whenever a design includes a load balancer, what happens if that specific box disappears. Cloud-managed load balancers handle most of this automatically, but self-hosted setups need it designed in deliberately.

10. Common Mistakes, Edge Cases, and Interview Insights

A few misunderstandings around load balancing come up again and again, both in real production incidents and in interview rooms.

  • Assuming L7 features are free. Path-based routing and SSL termination are convenient, but they consume CPU. Do not default to L7 without a reason if raw throughput is the priority.
  • Reaching for sticky sessions too quickly. Sticky sessions are an easy fix that quietly brings back a single point of failure. Externalizing session state is almost always the better long-term choice.
  • Forgetting the load balancer can fail, too. A highly available backend fleet behind a single load balancer instance is still not highly available overall.
  • Ignoring health check tuning. Checks that are too slow leave traffic flowing to a dead server for too long. Checks that are too aggressive can flap a healthy but momentarily slow server in and out of rotation.
  • Treating DNS as instant. DNS records are cached by clients and resolvers, so DNS-based failover or region switching can take minutes to fully propagate. Do not assume a DNS change takes effect the moment it is published.
  • Mixing algorithms without a reason. Switching algorithms per environment or per service without a documented reason makes performance issues far harder to reproduce and debug later.

10.1. Interview Insights

When a system design interview reaches the load balancing stage, interviewers usually want to hear you explicitly name Layer 4 or Layer 7 and justify which one fits the scenario. If the design involves multiple microservices reachable through different URL paths, say so out loud: that is an L7 requirement.

Expect a follow-up question about session handling if your design involves logged-in users. Naming sticky sessions is fine as a first answer, but naming externalized session storage as the preferred long-term solution shows a deeper understanding.

A strong closing point in any load-balancing discussion is to note that the load balancer itself requires redundancy. Many candidates design a beautiful fleet of stateless servers and forget that the single load balancer in front of them is still a single point of failure.

It also helps to mention client-side load balancing when the design involves multiple internal microservices communicating with each other, rather than assuming that every internal call passes through a shared load balancer. Naming both server-side and client-side load balancing in the same answer, and explaining when each fits, is usually enough to signal senior-level thinking on this topic.

11. Practical Takeaways and Key Terms

A short list of habits carries most of the value from this article into real projects.

  • Choose L4 for raw speed, L7 when you need path-based routing, SSL termination, or header-based rules.
  • Match the algorithm to the traffic: round-robin for uniform requests, least connections for variable-duration ones.
  • Prefer externalized session storage over sticky sessions whenever you can, since it keeps servers stateless.
  • Run the load balancer itself in a redundant setup, so it is never the sole point of failure.
  • Wire health checks into your load balancer so that failing instances are automatically removed from rotation.
  • Use client-side load balancing for internal service-to-service calls to avoid an unnecessary extra network hop.
  • Whenever you draw a load balancer on a diagram, ask out loud what happens if that exact box fails.

11.1. Key Terms Recap

TermMeaning in One Line
Load balancerDistributes incoming traffic across multiple servers
Layer 4 (L4)Routes based on IP and port only, very fast
Layer 7 (L7)Routes based on full request content, more flexible
Round robinCycles requests through servers in order
Least connectionsSends the next request to the least busy server
Consistent hashingHash ring routing that minimizes reshuffling when servers change
Client-side load balancingThe calling service picks a healthy instance itself; no shared load balancer is needed
Service registryA directory of live service instances, used to discover where to send a call (e.g., Eureka, Consul)
Service meshInfrastructure that moves service-to-service concerns like load balancing into sidecar proxies (e.g., Istio, Linkerd)
Sticky sessionPins a client to the same backend server
Session affinityAnother name for sticky sessions
Externalized sessionSession data is stored in a shared store like Redis, not on the server
Health checkPeriodic probe used to detect and remove unhealthy servers
AnycastSame IP advertised from multiple locations, routed to the nearest one

12. Conclusion

Load balancing is the component that turns a pile of independent servers into a single reliable service. Layer 4 gives you speed; Layer 7 gives you intelligence; and choosing between them depends on whether your routing decisions need to look inside the request.

Sticky sessions solve an immediate problem, but bring back the fragility that horizontal scaling was meant to remove. Externalizing session state with a shared store like Redis is the more durable fix, and it keeps every server instance interchangeable.

Carry one more habit forward from this article: whenever you draw a load balancer in a design, immediately ask whether it is redundant itself. A system is only as available as its weakest single point of failure, and it is very easy to accidentally leave the load balancer as that weak point.

None of these ideas needs to be memorized in isolation. Load balancing, algorithms, sticky sessions, and high availability all answer the same underlying question: when a request arrives, who handles it, and what happens if that choice goes wrong. Keep that question in mind, and the rest of the details in this article fall into place naturally.

Leave a Comment