Kubernetes Production Readiness Checklist

Running and managing applications anywhere, on-premise or cloud, with agility and scalability is why Kubernetes is the number one orchestrator. Its ability to self-heal nodes and applications, autoscale the infrastructure and adapt to the expanding business is very attractive proposition for enterprise. Kubernetes as technology is still emerging and rapid changes are taking place within its framework as well supporting toolsets through multiple open source projects.
Use the attached checklist to create or validate a robust and reliable Kubernetes Production setup for running critical applications.

  1. Availability
    1. Configure liveness and readiness probes
    2. Setup replicated master nodes in odd numbers, minimum 3
    3. Setup isolated etcd replicas on dedicated nodes
    4. Schedule regular etcd backup
    5. Setup distributed master nodes across zones
    6. Setup distributed worker nodes across zones
    7. Configure Autoscaling for both master and worker nodes
    8. Configure active-passive setup for scheduler and controller manager
    9. Configure the correct number of pod replicas for high availability
    10. Setup Ingress controller and/or API Gateway
  2. Resource Management 
    1. Segregate the Production Kubernetes Cluster from DEV/UA(physical or logical) and configure usage limits.
    2. Configure resource requests and limits for containers
    3. Create separate namespaces for your business units and teams
    4. Configure default resource requests and limits for namespaces
    5. Attach labels to Kubernetes objects
    6. Limit the number of pods that can run on a node
    7. Reserve compute resources for system daemons
    8. Configure out of resource handling
  3. Storage Management
    1. Use Cloud provider recommended settings for Persistent Volumes
    2. Include Persistent Volume Claims in the config and never use Persistent Volumes
    3. Create a default storage class
    4. Give the user the option of providing a storage class name
    5. Enable log rotation
  4. Security
    1. Use the latest Kubernetes stable GA version
    2. Enable RBAC (Role-Based Access Control)
    3. Follow user access best practices
    4. Enable audit logging
    5. Set Up a Bastion host for controlled access
    6. Choose a Network plugin and configured network policies
    7. Enable data encryption at rest
    8. Disable default service account
    9. Scan containers for security vulnerabilities
    10. Configure security context for pods, containers and volumes
    11. Enable Kubernetes logging
    12. Lockdown the pods and nodes, with traceable break-glass policies
    13. Provide secret/keystore with self-service provisioning & updates for infrastructure and applications
  5. Scalability
    1. Configure the horizontal autoscaler for deployed pods and replicasets
    2. Configure vertical pod autoscaler
    3. Configure cluster autoscaler
  6. Monitoring, Alerting, Logging & Analysis
    1. Set up a monitoring pipeline for Kubernetes infrastructure and deployed pods
    2. Select a list of metrics to monitor
    3. Integrate with other Enterprise tools sets, if any
    4. Setup alerting and self-healing with threshold.
    5. Store both infrastructure and application logs in centralized logging framework with indexing and RBAC
    6. Setup alerting, report/summary generation and archival based on the logs collected
    7. Setup Log rotation at application level to reduce the storage growth and avoid performance issues
  7. CICD
    1. Implement Secure CI/CD pipelines for Continuous Delivery
    2. Enable GitOps with approval workflow to have traceability
    3. Test, integrate and scan for vulnerabilities
    4. Build and deposit container artifacts to the Enterprise registry
    5. Tag the artifacts with Git commit SHA to enable auditability
    6. Adopt rolling and/or blue-green deployment models to avoid downtime

About The Author

Sanjeev Kumar

Head of Product, Cloud Control

Sanjeev has more than 20 years of experience in design, development and architecture of FinTech solutions at BNY Mellon and State Street. He is passionate about automating and reducing the challenges of overall IT implementation. He is a firm believer in IT becoming a utility with major Cloud vendors like AWS, Azure and Google providing the backbone with simple/standardized interfaces for secure and faster application development at reduced cost and complexity.