In the modern day, application workloads have to be both resilient to downtime and have to be scalable to handle a sudden traffic spike. Most applications run inside containers and Kubernetes is used to orchestrate them as it helps scale containerized workloads efficiently while offering self-healing capabilities. Whether you are already running a microservices architecture or migrating legacy monoliths to Kubernetes, understanding the various autoscaling mechanisms can significantly enhance your infrastructure’s reliability while being cost-effective.

There are various tools that can be used to dynamically adjust resources based on real-time demand for Kubernetes clusters. These tools help teams optimise performance, reduce costs, and ensure high availability, especially during peak traffic hours.

Let us dive into why having a robust autoscaling strategy for Kubernetes clusters is so important, and learn about the different types of autoscaling mechanisms in Kubernetes.

Why is Autoscaling important?

Let us understand the importance of setting up an autoscaling system for clusters with the help of a scenario. Imagine that you are handling the operations for an e-commerce website, and during certain times of day, the website traffic increases. Your applications are only given enough resources to handle a certain amount of requests, so when the number of requests exceeds the available resources, the application might start lagging and users would experience degraded performance. To solve this, DevOps engineers will have to manually scale up the number of pods that are a part of the deployment, and increase the node resources or provision more nodes.

Performing this process manually can take up quite a bit of time from detection of increased traffic to actually scaling the workloads or nodes. In some cases, the e-commerce website might have a huge sale that starts at midnight. Engineers can configure the workloads to handle increased traffic, but reality can be different from projected numbers, and there could be a huge spike in traffic. By the time engineers can respond and adjust the application scale and node resources accordingly, end users have already had a negative experience.

In such types of scenarios, having an autoscaling mechanism in place can be very helpful to ensure that the end-user experience is not degraded and that the application can scale effectively to accommodate requests. Certain tools can be used to dynamically adjust the cluster’s resources based on real-time demand. If your application experiences a surge in traffic Kubernetes can automatically spin up additional Pods or nodes to handle the load, ensuring that your customers continue to have a smooth, uninterrupted experience. And when the traffic subsides, it scales down the resources to save on costs, optimising your infrastructure.

Types of K8s Autoscaling

Kubernetes has an extensive list of autoscaling methods that help your applications automatically adapt to ever-changing traffic patterns. Each autoscaler performs a different scaling function within a Kubernetes cluster. Some catheter pods, while others change the number of nodes in a cluster. Let us quickly highlight what those different autoscaling methods are.

Horizontal Pod Autoscaler (HPA): Scales the number of Pods within a deployment or StatefulSet according to resource demand metrics such as CPU or memory. It can also use custom metrics as configured in your deployment.

Vertical Pod Autoscaler (VPA): Rather than scale up the number of Pods, VPA concentrates on optimising the resource allocation for each Pod, so as to work efficiently.

Cluster Autoscaler: Automatically scale the nodes in a cluster by adding additional nodes or removing nodes based on pod scheduling needs. It increases the number of nodes when Pods can not be scheduled due to not having enough resources. Nodes in a cluster which are not serving workloads will be decommissioned in case of downward scaling.

Event-Driven Autoscaling (KEDA): Monitors desktop and cluster events and scales Kubernetes resources according to conditions that need to be met. This can be useful for scaling workloads upon custom metrics or tracking queues with messages or the number of HTTP requests.

Karpenter: An AWS-specific autoscaling mechanism that dynamically provisions and terminates EC2 instances depending on workload needs.

We will now delve further into the details of these autoscaling mechanisms and decide which particular situation fits best with which autoscaling strategy. 

Horizontal Pod Autoscaling (HPA)

Horizontal Pod Autoscaler (HPA) is a commonly used scaling mechanism for Kubernetes. It modifies the number of replicas of a Deployment or StatefulSet depending on the specifics like CPU or memory utilisation. It may also scale the pods based on some custom metrics that are collected by the tool such as Prometheus. The Horizontal Pod Autoscaler (HPA), making calls to the relevant traffic, ensures there are enough replicas of the requested application pods to serve the traffic. Horizontal Pod Autoscaler (HPA) will increase the number of pods if there are not enough pods to be able to handle the incoming traffic, and on the other hand, it will scale the pods down to save costs if there are too many pods and too little traffic.

Autoscaler (HPA) constantly monitors the application’s metrics coming from sources like the Kubernetes Metrics Server and Prometheus. Actual metrics are the real-time values that are compared with the target threshold we initially set while configuring HPA. Horizontal Pod Autoscaler (HPA) increases the number of pods if the actual use surpasses the level, so as to handle the workload. On the contrary, if the actual usage grows below the target, HPA will automatically scale down the Pods thus enhancing resource utilisation and lowering costs. For instance, an 80% CPU utilisation goal can be set to a particular threshold that will indicate the completion of HPA. Once the pod gets to that threshold, HPA will make an additional replica. If the CPU utilisation is less than 80% of 2, one or more pods will be removed through HPA.

HPA uses a formula to determine the desired number of replicas. It multiplies the current number of Pods by the ratio of current utilisation to target utilisation. For example, if the average CPU usage is twice the target, HPA will double the number of Pods. To prevent excessive scaling due to short-lived spikes, HPA includes stabilisation windows and cooldown periods, ensuring that scaling decisions are smooth and efficient.

When to use HPA?

  • When the application pods have to be scaled to efficiently distribute the total load.
  • Microservices that have unpredictable request rates.
  • Applications that need to scale based on custom metrics from sources such as Open Telemetry or Prometheus.
  • SaaS applications that have a dynamic user load.

Vertical Pod Autoscaling (VPA)

The Vertical Pod Autoscaler (VPA) is a tool to increase the number of resources a pod can use, that is, it enables the pod to scale vertically. Unlike the Horizontal Pod Autoscaler (HPA) which primarily increases the number of Pods a workload has, VPA allows adjusting the CPU and memory requests and limits for each pod individually. That is how the apps can have proper resources and never over-pay for some if they are not needed and on the other side under-purchase if the applications are slowing down. Vertical Pod Autoscaler (VPA) continuously observes how much CPU and memory are used by the pods and modifies their resource requests and limits as a result. It is composed of three main parts:

  • VPA Recommender: Gathers metrics on your Pods’ resource consumption, analyses historical usage patterns, and provides recommendations for optimal CPU and memory settings.
  • VPA Updater: Implements whatever is suggested by the recommender. If a pod’s limits have to increase, the updater creates a new pod and evicts the old ones.
  • VPA Admission Controller: When the Updater evicts a pod, the Admission controller updates the CPU and Memory requests in the pod’s manifest before starting the new pod. 

Depending on the configuration mode, the VPA can either provide recommendations only (Off mode), automatically update resource requests for running Pods (Auto mode), or restart Pods with new resource settings (Recreate mode). When set to Auto mode, the VPA adjusts the resource requests of existing Pods on the fly. By dynamically right-sizing your workloads, VPA ensures that applications run efficiently, leading to better resource utilisation and cost savings.

When to use VPA?

  • Stateful applications that need consistent resource availability such as databases.
  • Workloads where right-sizing CPU and memory can lead to performance gains.
  • When you wish to optimise resource utilisation without scaling the number of Pods.
  • VPA and HPA can be used together to further optimise how resources are utilised in the cluster.

Cluster Autoscaling

The Cluster Autoscaler, a Kubernetes component, is smart enough to automatically change the number of nodes placed in the cluster. The autoscaler adds or removes the nodes from the cluster to fulfill the pod scheduling requirements. The autoscaler is smart in the way that it helps you bring up the Kubernetes cluster on the cloud, where the dynamic nodes can be provisioned easily. But it is not only limited to that. The Cluster Autoscaler collaborates with cloud providers such as AWS, Google Cloud, and Azure for better integration. Additionally, it can be set up for an on-premises cluster to automatically adjust the number of nodes.

Imagine a case where the VPA and HPA are already implemented in the cluster. The VPA and HPA are useful in case the node can create the new pods with an abundance of resources but are insufficient otherwise. The main goal of the Cluster Autoscaler is to help Pods get scheduled on nodes that have adequate capacity. It will automatically bring the cluster up, in the process giving your applications the necessary resources to operate. Like the other autoscalers, if there are some nodes with low usage, the autoscaler by scaling down the cluster will save costs by terminating unneeded nodes.

The Cluster Autoscaler keeps looking for un-schedulable Pods that are the reason for not enough CPU or memory resources and can not be placed on existing nodes. When it detects such Pods, it attempts to scale up by adding new nodes to the cluster, using predefined configurations or templates (like node groups or node pools). Cluster autoscaler uses the cloud provider’s APIs to provision new nodes, making additional compute resources available for workloads when required.

When scaling down the cluster i.e. removing underutilised nodes from the cluster, the cluster autoscaler drains and deletes the underutilised node, helping to reduce infrastructure costs. To avoid disrupting your workloads, it respects Pod disruption budgets and ensures that critical Pods are not evicted during the scaling process. The cluster autoscaler scales the cluster horizontally i.e. adds new nodes to the cluster.

When to use Cluster Autoscaler?

  • When your Pods cannot be scheduled due to lack of resources on existing nodes.
  • Optimise infrastructure costs by adding/removing nodes based on demand.
  • Suitable for mixed workloads where different node types are needed.

Event Driven Autoscaling (KEDA)

KEDA (Kubernetes Event-Driven Autoscaling) is an open-source project that helps scale the Kubernetes cluster based on certain events that occur. Unlike traditional autoscalers like HPA, which scale based on resource metrics such as CPU or memory, KEDA allows you to scale your applications based on external event sources like message queues, databases, or custom metrics. It is useful for applications that have unpredictable workloads, such as processing tasks from a queue or responding to real-time events, where scaling based on CPU/memory alone may not be efficient.

KEDA extends Kubernetes’ native scaling capabilities by integrating with multiple different event sources such as Apache Kafka, RabbitMQ, AWS SQS, Azure Event Hubs, and more. This allows applications to scale up or down in response to specific event triggers, ensuring that your resources are used efficiently and your applications remain responsive under varying loads.

KEDA operates by deploying a Kubernetes Custom Resource Definition (CRD) called ScaledObject. The ScaledObject resource defines how your application should scale based on external event metrics. It continuously monitors event sources by connecting to external systems such as message brokers, databases, or HTTP endpoints to gather metrics. These metrics can be things such as the length of a message queue or the rate of incoming HTTP requests. When a specified threshold is met, KEDA triggers Kubernetes’ Horizontal Pod Autoscaler (HPA) to scale the number of Pods accordingly.

For example, if you are using KEDA with an Azure Queue, it can detect when the queue length exceeds a certain number of messages and automatically scale your Pods to process those messages faster. Once the event load decreases, KEDA scales the Pods back down to save resources. KEDA is lightweight and runs as a single Pod in your cluster, ensuring minimal overhead while adding powerful event-driven autoscaling capabilities to your Kubernetes workloads.

When to use KEDA?

  • Event-driven applications like background jobs, message queue consumers, or real-time data processors.
  • Need to scale based on custom metrics or external events.
  • Ideal for serverless architectures running on Kubernetes.
  • Used alongside HPA and VPA.

Karpenter

Karpenter is an open-source, high-performance autoscaler for Kubernetes, designed to improve the efficiency and scalability of your cluster by dynamically provisioning the right compute resources in real-time. Unlike the traditional Cluster Autoscaler, which operates based on predefined node groups, Karpenter optimises infrastructure by launching nodes with custom configurations tailored to the specific resource requirements of your workloads. Karpenter works best with EKS clusters where it can rapidly adjust capacity to meet the demands of your applications.

Karpenter focuses on flexibility and speed, allowing it to launch nodes faster and with more granular control over instance types, zones, and hardware specifications. This makes it ideal for handling highly dynamic workloads, where you need to scale up quickly or use specialised hardware like GPUs or spot instances for cost optimization.

Karpenter works similarly to the Cluster Autoscaler. It continuously monitors the Kubernetes cluster for unschedulable Pods. When it detects such Pods, Karpenter automatically provisions new nodes with the exact resources required and selects the most cost-effective instance types available from the cloud provider. Instead of relying on static node groups, it uses flexible node templates to match workload requirements with optimal infrastructure.

When to use Karpenter?

  • High-performance, cloud-native applications with unpredictable scaling needs.
  • Need for rapid provisioning of diverse node types.
  • Optimise costs by selecting the most efficient instances based on the current workload.
  • Fine-grained autoscaling for AWS-specific resources.

Conclusion

One of the critical components of a strong Kubernetes cluster is autoscaling which can change the cluster to cater to dynamic traffic. Autoscaling has become a trend in the success of resource saving, and management of cloud costs among organisations. But the most separate autoscaling mechanisms of Kubernetes can be applied one by one or together, to meet a wide variety of use case options. The common ones include horizontal autoscaling with HPA and vertical scaling with VPA. Also, the Cluster Autoscaler is one of the tools that dynamically scale the nodes in the cluster, depending on the need. Similarly, the AWS-specific Karpenter, as a cloud autoscaler, could pair nicely with the different AWS products such as S3 buckets, EC2 instances, Graviton instances, and others. KEDA can offer a solution that is event-triggered and yet further, the ability to scale resources of applications to meet the target requirement.

The choice of the best autoscaling strategy is usually determined by the nature of your workloads, the predictable traffic patterns, and the operational aims. Whether you are running a stateless microservice, resource-heavy database, or event-driven apps, the autoscaler can provide the balance between efficiency and performance out there.

These powerful autoscaling tools allow businesses to tune their Kubernetes clusters as per the application needs, thus ensuring reliability and cost-effectiveness. Kubernetes’ adaptability to the new vertical is now relevant to the operational changes that would/are being made by a business as it is during its scaling phase. This could be the only reason it might come to your mind if your company is starting.

For in-depth information on AWS, Kubernetes Autoscaling, nodes, pods, and clusters, visit the CloudZenia website.