Kubernetes crossed from experimental to production-standard in most mid-to-large engineering organisations by 2023. The cluster management problems are largely solved. What remains challenging — and where teams still regularly make costly mistakes — is in the operational layer: cost management, security posture, observability, and right-sizing workloads. This is what three years of production Kubernetes actually teaches you.
Lesson 1: Resource Requests and Limits Are Not Optional
The most common production incident pattern in Kubernetes clusters is node pressure caused by pods without resource requests. When containers do not declare CPU and memory requests, the Kubernetes scheduler cannot make informed placement decisions, and nodes become over-committed. The Kubernetes resource management documentation is clear on this, but it is systematically skipped in early deployments.
Set requests based on actual measured usage (from Prometheus metrics) and set limits 20–30% above requests for burst tolerance. Never set CPU limits to values that will cause CPU throttling under normal load — this is a common misconfiguration that manifests as latency spikes without obvious CPU saturation.
Lesson 2: Cost Visibility Must Come Before Optimisation
Kubernetes clusters are deceptively expensive without namespace-level cost attribution. Tools like Kubecost or OpenCost provide per-namespace, per-workload cost visibility that is essential before you can optimise. Without this, teams optimise the wrong workloads and miss the 20–40% waste that typically exists in under-utilised nodes running low-priority jobs.
Cluster Autoscaler and Karpenter (AWS-native node provisioner) reduce idle node cost significantly. Karpenter in particular bins-packs pods more aggressively than Cluster Autoscaler and supports Spot instance handling natively.
Lesson 3: Security Defaults Are Insufficient
Default Kubernetes configurations are not production-secure. Pods run as root by default, inter-namespace network traffic is unrestricted, and RBAC is typically over-permissive in early clusters. A baseline secure posture requires: PodSecurity admission policies enforcing non-root and read-only filesystems, NetworkPolicies isolating namespace traffic, Secrets encryption at rest (not just in etcd), and RBAC scoped to least privilege.
The Kubernetes Security Checklist provides a practical audit framework. Run it quarterly — cluster configurations drift over time as teams add workloads.
Lesson 4: Observability Is Not Optional Post-MVP
Prometheus + Grafana for metrics — standard and battle-tested
OpenTelemetry for distributed tracing — vendor-neutral, now the CNCF standard
Loki or Elasticsearch for log aggregation — Loki is cheaper; Elasticsearch is more queryable
Alerts on pod restart rate, pending pod count, and node memory pressure as baseline signals
Cynaris designs and operates Kubernetes infrastructure across AWS EKS, Azure AKS, and GCP GKE for engineering teams that want cloud-native without the operational overhead. Talk to our cloud engineering team about a Kubernetes readiness assessment for your organisation.