Kubernetes Production Readiness Checklist for Enterprise
Running and managing applications anywhere, on-premise or cloud, with agility and scalability is why Kubernetes is the number one orchestrator. Its ability to self-heal nodes and applications, autoscale the infrastructure and adapt to the expanding business is very attractive proposition for enterprise. Kubernetes as technology is still emerging and rapid changes are taking place within its framework as well supporting toolsets through multiple open source projects.
The following guidelines are important when creating a robust and reliable Kubernetes Production setup for running critical applications.
- Keep security vulnerabilities and attack surfaces to a minimum for the Cluster and Applications. Lockdown the pods and nodes, with traceable break-glass policies. Ensure that the applications you are running are secure and that the data you are storing is secured against attack. And because Kubernetes is a rapidly growing open source project, be on top of the updates and patches so that they can be applied in a timely manner.
- Segregate the Kubernetes Cluster and Configure usage limits. Segregate Production Kubernetes Cluster from DEV/UAT ones to make sure that rapid changes happening in Infrastructure and application level do not impact production workloads. This segregation could be physical or logical, and based on the setup proper guardrails need to be implemented. As Kubernetes is mostly used as a shared infrastructure, proper usage limits need to be applied for running applications based on type and criticality of workloads, to minimize the impact of an outlier. Namespace level isolation and resource limits are common practice for this type of enforcement.
- Implement Secure CI/CD pipelines for Continuous Delivery. An automated continuous deployment and delivery pipeline allows the development team to maximize their velocity and improve productivity through increased release frequency and repeatable deployments. Enable GitOps with approval workflow to have traceability. Test, integrate, scan for vulnerabilities, build and deposit container artifacts to the Enterprise registry. Artifacts should be tagged with Git commit SHA to enable auditability.
- Apply Observability as a deployment catalyst. Observability is not only about being able to monitor your system, but it’s also about having a high-level view into your running services so that you can make decisions before you deploy. To achieve true observability you need the processes and tools in place to act on that monitoring through timely incident alerts.
- Enforce Secret Management. No passwords on any file system. Provide secret/keystore with self-service provisioning & updates for infrastructure and applications. Allow privileged access only with break-glass policy which includes approval and auditability of actions.
- Storage Management for infrastructure and applications. Store the persistent state of your application and critical infrastructure artifacts beyond the pod’s and node’s lifetime. Use recommended PVCs(Persistent Volume Claims) settings as per service providers documentation.
- Setup Monitoring and Alerting. Setup a monitoring and alerting pipeline(with open source tools like Prometheus + grafana) and integrate with other Enterprise tools set. Identify the list of infrastructure and application metrics to to collected and alerted.
- Provide framework for infrastructure and application Logging and Analysis. Make sure operation and application teams do not need to login to Kubernetes to gather and analyze logs. Both infrastructure and application logs need to be stored in centralized logging framework with indexing and RBAC for analysis, alerting and archival. Setup Log rotation at application level to reduce the storage growth and avoid performance issues.
- Setup Ingress controller and/or API Gateway. Create a common routing point for all inbound traffic. They also enable common and centralized tooling for tracing, logging, authentication.
- Adopt Kubernetes best practices for Resiliency and create an HA/DR plan. Maximize resiliency by adopting rolling and/or blue-green deployment models. Ensure that you have high availability and disaster recovery plan in place which means that if you have a failure in node/cluster/site you can recover with full automation and little or no downtime. For Production workload created multiple Master and etcd nodes and etcd setup etcd backup and recovery strategy at a remote site.
About The Author
Head of Product, Cloud Control
Sanjeev has more than 20 years of experience in design, development and architecture of FinTech solutions at BNY Mellon and State Street. He is passionate about automating and reducing the challenges of overall IT implementation. He is a firm believer in IT becoming a utility with major Cloud vendors like AWS, Azure and Google providing the backbone with simple/standardized interfaces for secure and faster application development at reduced cost and complexity.