Imagine you manage a fleet of delivery trucks. Each truck carries packages (containers) that need to reach specific destinations. A single truck can carry multiple packages, but the packages are stacked together — they share the same route, the same fuel, and the same driver.
Now imagine you have hundreds of trucks and thousands of packages. You need a dispatcher who decides which truck gets which package, a routing system that tells each truck where to go, and a tracking system that monitors every delivery.
Kubernetes is that dispatcher, router, and tracking system for containerized applications. It takes your containers, groups them into pods (trucks), schedules them onto machines (nodes), and keeps everything running according to your specifications.
Kubernetes is a container orchestration platform. It automates deployment, scaling, and management of containerized applications across a cluster of machines.
The name comes from the Greek word for helmsman or pilot — someone who steers a ship. And that is exactly what Kubernetes does: it steers your containers through the waters of production infrastructure.
A Kubernetes cluster has two parts:
The control plane runs services like the API server (the front door), the scheduler (determines where pods run), and the controller manager (watches desired state vs actual state and reconciles them). Worker nodes run the kubelet (the node agent), kube-proxy (networking rules), and a container runtime like containerd.
When you tell Kubernetes “I want 3 copies of my web server running”, the control plane decides which nodes should run them, the kubelet on those nodes starts the containers, and Kubernetes continuously monitors to make sure 3 copies are always running — replacing any that crash.
A pod is the smallest deployable unit in Kubernetes. It is one or more containers that share the same network namespace, storage volumes, and lifecycle.
Think of a pod as a delivery truck. The truck can carry one package (a single container) or multiple packages that need to travel together (sidecar containers). All containers in the pod share:
Why would you put multiple containers in one pod? Consider a web server that needs a log processor. The web server writes access logs to a shared volume, and a sidecar container reads those logs, processes them, and forwards them to a central logging system. They share the volume and the network, so the sidecar can access localhost to check the server health endpoint.
Here is a basic pod spec:
apiVersion: v1
kind: Pod
metadata:
name: web-server
labels:
app: web
tier: frontend
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
resources:
requests:
cpu: 250m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
restartPolicy: Always
Each pod gets a unique IP address within the cluster network. Containers inside the pod reach each other on localhost. This IP address is ephemeral — when the pod is deleted and recreated, it gets a new IP.
The demo above shows the pod spec on the left and a visual diagram on the right. Click each field — name, labels, containers, volumes, restartPolicy — to see what it controls and how it maps to the running pod.
Kubernetes actively manages the lifecycle of each container within a pod. Containers go through several states:
The kubelet on each node runs health checks:
spec:
containers:
- name: app
image: my-app:latest
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
A liveness probe tells Kubernetes whether the container is alive. If it fails, Kubernetes restarts the container. A readiness probe tells Kubernetes whether the container is ready to serve traffic. If it fails, the pod is removed from service endpoints but is not restarted. A startup probe protects slow-starting containers from being killed by liveness checks before they finish initializing.
Some applications need setup before the main containers run — database migrations, permission changes, config file generation, or waiting for a dependency to be ready.
Init containers run in order, one at a time, before the main application containers start. Each init container must complete successfully before the next one begins. If an init container fails, Kubernetes restarts it (or the entire pod, depending on the restart policy).
spec:
initContainers:
- name: migrate
image: my-app-migrate:latest
command: ['node', 'migrations/run.js']
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 1; done']
containers:
- name: app
image: my-app:latest
Init containers use the same volume mounts as the main containers, so they can write shared data. They can also have different resource limits, security contexts, and image pull policies. They share the pod’s network namespace, so they can reach the same services.
Key rules:
When you create a pod, it enters the Pending state. The kube-scheduler watches for unscheduled pods and assigns them to a node.
The scheduler algorithm has three phases:
The scheduler evaluates each node against the pod’s requirements. A node that fails any predicate is eliminated:
Remaining nodes are scored. The scheduler ranks nodes by how well they fit the pod. Common priority functions:
Each priority function returns a score (0-100). The scores are multiplied by weights and summed. The node with the highest total score wins.
The scheduler creates a Binding object in the API server, associating the pod with the chosen node. The kubelet on that node detects the binding, pulls the container image, and starts the pod.
If a node fails predicates, the pod stays pending. You can see why with:
kubectl describe pod my-pod
The output shows events like “0/3 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.”
The demo walks through filtering, scoring, and binding. Adjust the pod’s resource requirements and see which nodes pass each phase.
Pods alone are not enough for production applications. You need controllers that manage pod lifecycle, replication, and identity.
A DaemonSet ensures that every node (or a subset of nodes matching a selector) runs exactly one copy of a pod. When a new node joins the cluster, the DaemonSet automatically creates a pod on it. When a node leaves, the pod is garbage collected.
Use cases:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
containers:
- name: fluentd
image: fluent/fluentd:v1.16
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
hostPath:
path: /var/log
A StatefulSet manages pods that need stable, persistent identity. Unlike Deployments (where pods are interchangeable), StatefulSet pods have:
web-0, web-1, web-2.StatefulSets are used for:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
Each pod in the StatefulSet gets its own PVC (data-postgres-0, data-postgres-1, etc.). If postgres-2 is rescheduled to a different node, it reattaches to its original PVC.
Pods are ephemeral. They crash, get rescheduled, scale up, and scale down. Their IP addresses change with every restart. Other components cannot hardcode pod IPs.
A Service provides a stable network endpoint that decouples clients from individual pods. When you create a Service, it gets a stable ClusterIP (virtual IP) and a DNS name. Traffic to the ClusterIP is load-balanced across the selected pods.
apiVersion: v1
kind: Service
metadata:
name: web-service
spec:
selector:
app: web
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: ClusterIP
The selector (app: web) matches pod labels. The Service automatically tracks which pods match the selector and updates its endpoint list as pods come and go.
The kube-proxy daemon on every node watches the API server for Services and Endpoints. It creates iptables rules on the node to redirect traffic from the Service IP to a real pod IP.
The iptables rules work like this:
# Simplified iptables rules for a Service with 3 pods
-A PREROUTING -p tcp --dport 80 -j KUBE-SVC-XXXXX
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.333 -j KUBE-SEP-A
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.500 -j KUBE-SEP-B
-A KUBE-SVC-XXXXX -j KUBE-SEP-C
-A KUBE-SEP-A -j DNAT --to-destination 10.1.0.2:8080
-A KUBE-SEP-B -j DNAT --to-destination 10.1.0.3:8080
-A KUBE-SEP-C -j DNAT --to-destination 10.1.0.4:8080
The first rule matches traffic to the Service IP. The three statistic rules distribute traffic with a chain of probabilities (1/3, 1/2, 1/1) to achieve even distribution. Each SEP chain performs DNAT to a specific pod.
The demo visualizes the request flow. Click “Send Request” to watch traffic travel from a client through the Service IP to one of three pods via iptables DNAT rules. The rules panel shows the full iptables chain.
<nodeIP>:<NodePort> is forwarded to the Service.# Create a NodePort service
kubectl expose deployment web --type=NodePort --port=80
# Get the NodePort
kubectl get svc web
# web NodePort 10.96.1.1 <none> 80:31234/TCP
# Access via any node's IP
curl http://<node-ip>:31234
Kubernetes runs a DNS service (CoreDNS by default) that provides name resolution for Services. CoreDNS runs as a Deployment in the kube-system namespace.
Every Service gets a DNS record in the format:
<service-name>.<namespace>.svc.cluster.local
A pod in the same namespace can reach the Service by just its name:
# From any pod in the same namespace
curl http://web-service:80
# Cross-namespace
curl http://web-service.default.svc.cluster.local:80
CoreDNS watches the API server for Services and Endpoints, updating DNS records automatically. The cluster’s pod DNS configuration (in /etc/resolv.conf) points to the CoreDNS Service IP (typically 10.96.0.10).
You can test DNS resolution from any pod:
kubectl run debug --image=busybox --rm -it --restart=Never -- nslookup web-service
# Output:
# Server: 10.96.0.10
# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
# Name: web-service
# Address 1: 10.96.1.1 web-service.default.svc.cluster.local
By default, pods get DNS records too. The format is:
<pod-ip-with-dashes>.<namespace>.pod.cluster.local
This is useful for StatefulSets where each pod needs a stable DNS name:
# StatefulSet pod DNS
curl http://postgres-0.postgres.default.svc.cluster.local:5432
Services handle L4 (TCP/UDP) load balancing. But real-world applications need L7 routing — path-based routing, host-based routing, TLS termination, and virtual hosting.
Ingress provides L7 routing rules. An Ingress Controller (like NGINX Ingress, Traefik, or HAProxy) implements those rules.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api-v1-service
port:
number: 80
- path: /v2
pathType: Prefix
backend:
service:
name: api-v2-service
port:
number: 80
- host: admin.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: admin-service
port:
number: 80
The Ingress resource defines routing rules. The Ingress Controller (running as a Deployment behind a LoadBalancer Service) watches the API server for Ingress resources and configures its reverse proxy accordingly.
# Install NGINX Ingress Controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml
# Check the controller
kubectl get pods -n ingress-nginx
The Ingress Controller creates an external LoadBalancer Service. DNS points your domain to the load balancer IP. The controller terminates TLS, inspects the HTTP Host header and path, and routes traffic to the correct backend Service.
spec:
tls:
- hosts:
- api.example.com
secretName: api-tls-secret
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
The TLS secret must contain tls.crt and tls.key. The Ingress Controller terminates TLS and forwards plain HTTP to the backend.
By default, all pods in a Kubernetes cluster can communicate with each other. Network Policies restrict that communication using label selectors and CIDR rules.
A Network Policy defines ingress and egress rules:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-policy
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
This policy:
app: frontend to reach pods with app: api on port 8080.app: api to reach pods with app: database on port 5432.app: api to reach CoreDNS (any namespace, label k8s-app: kube-dns) on UDP 53.Network Policies require a CNI plugin that supports them — Calico, Cilium, Weave, or Antrea. The default flannel CNI does not enforce policies.
# List network policies
kubectl get networkpolicies
# Describe a policy
kubectl describe networkpolicy api-policy
Common patterns:
A Deployment manages a set of identical pods (a ReplicaSet) with rolling update capabilities. It declares the desired state, and the Deployment Controller reconciles the actual state toward it.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 5
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
When you update the image, Kubernetes performs a rolling update:
kubectl set image deployment/web nginx=nginx:1.26
The rolling update strategy is controlled by two parameters:
If replicas=5, maxSurge=1, and maxUnavailable=1, the update proceeds like this:
This continues until all pods are v2. At any point, there are at least 4 running pods (replicas - maxUnavailable) and at most 6 (replicas + maxSurge).
The demo shows each wave of the rolling update: create a new v2 pod, wait for readiness, delete an old v1 pod, repeat. The counters track desired, current, updated, and v1 remaining counts.
If the new version has issues, roll back:
kubectl rollout undo deployment/web
# Roll back to a specific revision
kubectl rollout undo deployment/web --to-revision=2
# View rollout history
kubectl rollout history deployment/web
Deployments store revision history (controlled by revisionHistoryLimit, default 10). Each change to the pod template creates a new revision.
Configuration should be separated from application code. Kubernetes provides two resources for this.
A ConfigMap stores non-sensitive configuration data as key-value pairs:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
APP_COLOR: blue
APP_MODE: production
log_level: info
nginx.conf: |
server {
listen 80;
server_name example.com;
}
A Secret stores sensitive data — passwords, API keys, certificates. Values must be base64-encoded in YAML:
apiVersion: v1
kind: Secret
metadata:
name: app-secret
type: Opaque
data:
DB_PASSWORD: czNjcmV0IQ==
API_KEY: c2stYWJjMTIzeHl6
DB_USERNAME: YWRtaW4=
Create secrets imperatively to avoid manual encoding:
kubectl create secret generic app-secret \
--from-literal=DB_PASSWORD='s3cret!' \
--from-literal=API_KEY='sk-abc123xyz'
As environment variables:
apiVersion: v1
kind: Pod
metadata:
name: config-pod
spec:
containers:
- name: app
image: my-app:latest
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secret
As mounted files:
spec:
containers:
- name: app
image: my-app:latest
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: secret-volume
mountPath: /etc/secret
readOnly: true
volumes:
- name: config-volume
configMap:
name: app-config
- name: secret-volume
secret:
secretName: app-secret
When mounted as files, each key becomes a file. For Secrets, the file content is the decoded value (not base64). Updating a ConfigMap or Secret automatically updates the mounted files (with a propagation delay of minutes). Environment variables are not updated — the pod must be restarted.
The demo shows ConfigMap and Secret data side by side. Toggle between environment variables and file mount modes to see how a pod consumes configuration. Secret values are displayed base64-encoded in the definition but decoded when consumed.
The control plane is the brain of the cluster. It runs on dedicated control plane nodes (or as a managed service in EKS, GKE, AKS).
The API server is the front door to the cluster. It exposes the Kubernetes API (RESTful over HTTPS). Every operation — listing pods, creating deployments, watching for changes — goes through the API server.
It is the only component that talks to etcd. All other components (scheduler, controller manager, kubelet) communicate with the API server, never with etcd directly. This serializes access and ensures consistency.
The API server:
etcd is a distributed, consistent key-value store. It is the single source of truth for cluster state. Kubernetes stores everything here: pod specs, deployments, secrets, configmaps, service definitions, node status.
etcd uses the Raft consensus protocol to maintain consistency across replicas (typically 3 or 5). Writes require a majority (quorum) to commit. If quorum is lost, the cluster cannot accept writes.
# Backup etcd (control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
The scheduler watches the API server for newly created pods with no node assignment. It selects a node using the filtering and scoring algorithm described in section 5. The scheduling decision considers resource requirements, affinity/anti-affinity rules, taints and tolerations, and node health.
The controller manager runs a set of controllers in a single binary. Each controller watches the API server for changes to a specific resource type and reconciles actual state with desired state:
The kubelet is the node agent that runs on every worker node. It is not a container — it runs as a systemd service directly on the node OS.
The kubelet:
kube-proxy runs on every node (as a DaemonSet). It implements the Service abstraction by maintaining network rules (iptables or IPVS) that forward traffic from Service IPs to pod IPs.
The container runtime is the software that actually runs containers. Kubernetes uses the Container Runtime Interface (CRI) to support multiple runtimes:
The demo visualizes the full cluster architecture. Click through each step to trace a pod creation request from kubectl through the API server, etcd, scheduler, and kubelet to the container runtime.
Kubernetes lets you run multiple schedulers in the same cluster. Each pod selects which scheduler to use via the schedulerName field:
spec:
schedulerName: my-custom-scheduler
containers:
- name: app
image: nginx
Run a custom scheduler as a Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-scheduler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
component: custom-scheduler
template:
metadata:
labels:
component: custom-scheduler
spec:
serviceAccountName: custom-scheduler-sa
containers:
- name: scheduler
image: my-custom-scheduler:latest
command:
- /scheduler
- --scheduler-name=my-custom-scheduler
The custom scheduler watches pods with schedulerName: my-custom-scheduler and binds them to nodes. The default scheduler ignores those pods entirely.
This is useful when you have specialized hardware (GPUs, TPUs, FPGA) that requires domain-specific scheduling logic, or when you need custom bin-packing algorithms.
For smaller changes, you can extend the default scheduler with a scheduler extender — an HTTP webhook that the scheduler calls during filtering and scoring. The extender can filter nodes or adjust scores based on external data.
Not all pods are equally important. A user-facing web server should take priority over a batch data processing job. Kubernetes uses PriorityClasses to express importance:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority for user-facing services"
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 10
globalDefault: false
description: "Low priority for batch jobs"
Pods reference a PriorityClass by name:
spec:
priorityClassName: high-priority
containers:
- name: app
image: my-app:latest
When the scheduler cannot fit a high-priority pod, it can preempt (evict) lower-priority pods to free resources. The preempted pods are gracefully terminated (SIGTERM, then SIGKILL after grace period).
Preemption is not available on best-effort basis. It is a hard eviction. Use PodDisruptionBudgets to protect critical services:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-pdb
spec:
minAvailable: 3
selector:
matchLabels:
app: web
A PDB ensures that at least 3 pods are always available during voluntary disruptions (preemption, draining nodes).
The Cluster Autoscaler automatically adjusts the number of nodes in the cluster based on pod resource requests. When pods cannot be scheduled due to insufficient resources, it adds nodes. When nodes are underutilized for an extended period, it removes them.
The autoscaler integrates with cloud providers (AWS, GCP, Azure) to create and terminate VM instances. It does not work with on-premise bare metal unless you have an equivalent provisioning mechanism.
# Install Cluster Autoscaler on EKS
kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
# Check autoscaler logs
kubectl logs -n kube-system deployment/cluster-autoscaler
Scale-down can be disabled entirely:
# Cluster Autoscaler deployment flag
--scale-down-enabled=false
The HPA scales the number of pod replicas based on CPU, memory, or custom metrics:
kubectl autoscale deployment web --cpu-percent=70 --min=3 --max=10
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
The HPA controller queries the metrics server every 15 seconds and adjusts the replica count. Scaling is proportional to the current utilization vs target utilization. A cooldown period (default 5 minutes up, 3 minutes down) prevents thrashing.
Make sure you can answer these questions before closing this page:
If you got them all, you understand the core of Kubernetes orchestration. If not, revisit the demos above — each one illustrates a specific concept that builds on the previous ones.