SITE RELIABILITY ENGINEERING (SRE)
DEVELOPMENT & OPERATIONS
It is often challenging to design, build, and deploy large complex software systems to production environments. Moreover, it is equally important and challenging to run and maintain these live production systems.
Traditionally, companies employed system administrators to run/operate and respond to any events in complex and large computing systems. However, the skills required for such system admins (operational teams) were different from those skills needed by software developers. Thus leading to the creation of separate teams for developments (DEV) and operations (OPS).
THE DEV-OPS CONFLICT!
Dividing the teams into DEV/OPS has many pitfalls and disadvantages. The first one among them is higher operations costs (direct and indirect). Maintaining a separate operations team and scaling it up as and when load/events increase will increase direct costs too. At the same time, indirect costs are incurred by the organization mostly due to the split between the teams – i.e., in terms of skills sets, interests, objectives, risks, etc.
Development teams will want to push the new features and releases to the production systems as soon as possible, whereas the operations teams will want to maintain their systems stable without any service disruption or outages by keeping the changes to a minimum. The operations team tries to safeguard the running system by reducing changes/risks. This naturally leads to a structural conflict between the two teams regarding the pace of innovation and service stability.
WHAT IS SITE RELIABILITY ENGINEERING (SRE)?
A new approach to run production systems!
Google introduced another approach, where they brought in software engineers to automate the operation processes. This was previously done manually by system admins to run their products and services. According to Google, SRE is what happens when you ask a software engineer to design an operations team.
Knowledge of Unix system internals, networking, and an aptitude for developing complex software systems are the key skill sets required for any SRE. Thus, we can consider SRE as a specific implementation of DevOps with some extensions.
SREs, being software engineers, possess the required skills and knowledge to design and implement automation for the processes that are done manually by the traditional operations team. Traditional operations teams need linear scale-ups (that involves hiring more people) with increasing load/size of the services, which adds to the overall project costs. At the same time, SREs with constant engineering and automation keep the team size independent of the size of services provided. SREs usually spend 50% of the time on engineering/development and 50% on the operations side of running services.
SRE PROS & CONS
Advantages of SRE Approach
- SREs promote innovation and faster change.
- They have smaller team sizes and costs compared to traditional ops teams.
- They help reduce the Devs/Ops split.
- SRE skill sets are not easily available.
- SRE engineers are more costly to hire.
- There’s not enough industry information on SRE management.
- SREs require strong management support to be successful.
RESPONSIBILITIES OF SRE
SRE teams interact with the environments, development teams, testing teams, users, etc., to understand the work practices and business requirements, while focusing on engineering the changes. An SRE is responsible for the following with respect to each of the services running in production:
One of the critical responsibilities of any member of the operations team is monitoring. SREs monitor the system 24/7 to keep track of the system’s health and availability. In a traditional environment, email alerts are generated, which are then reviewed by an operations team member, who then takes the necessary actions. In the SRE world, the software will interpret the alerts and try to resolve them by itself. The software notifies the SREs only when it requires human intervention. Based on the severity, there are three types of monitoring outputs:
Alerts: High severity where human intervention and action are required immediately.
Tickets: Medium severity where a ticket is created for the operations team to take action, but not necessarily immediate action.
Logs: Low severity. In such cases, the information is recorded for audit and forensic purposes and to be used whenever required.
- Emergency Response
Failures and system emergencies can happen at any time. What’s important is how fast the response team can bring the system back to its normal state. In a manual setup, this recovery takes more time.
Mean Time To Repair (MTTR) is a measure of how effective the emergency response is. Automation helps to increase system availability by bringing down the MTTR by at least three times.
- Change Management
One of the critical responsibilities of an SRE is applying changes to the system without causing any downtime. Around 70% of system failures and outages while changes are being made to the live system. SRE employs automation and best practices like progressing rollouts, mechanisms to detect issues quickly and roll back the changes safely if any problem occurs. Automation increases the safety and velocity of change management.
- Ensure Availability, Latency, Performance, and Efficiency
SREs will monitor and modify the services or provision more capacity to meet the expected loads and maintain required performance and efficiency levels for systems. The efficient use of resources will also reduce the overall costs incurred.
- Capacity Planning
An SRE constantly monitors the system resources and the system’s performance to identify expected future demands. Using this data, it then plans the system’s capacity
accordingly. SRE ensures sufficient capacity and redundancy to meet such demands (organic and inorganic) to run the services with expected efficiency and availability. SRE can also use load tests to determine and correlate the available capacity to the required capacity.
Based on change management and capacity planning, SREs do the provisioning when necessary. As increasing capacity is expensive, it should be done only when necessary. During such changes, the SRE validates the changes and also ensures that the changes deliver the correct results and provide the expected performance for the services.
HOW CAN YOU MEASURE SRE SERVICES?
It is very critical to measure the level of services based on some indicators and to take action if these service levels are not met. These metrics and measures will also help the SREs to ascertain that the service is healthy.
- Service Level Indicators (SLI)
SLI is a quantitative measure of an aspect of the service provided. Examples are Latency of a Request (1 second), Number of errors (0.001%), Throughput (N transactions or requests/second), Availability (99.999), etc.
- Service Level Objective (SLO)
An SLO assigns a target value or range of values for an SLI. For example, the latency value (SLI) for API requests is set to less than or equal to 1 second (SLO). Similarly, we can set specific values or a range of values for each of the Service Level Indicators defined for a particular service. Defining and publishing SLOs will set the right expectations for users as well as SREs and can also reduce complaints. The SRE’s function is to meet these SROs promptly.
- Service Level Agreements (SLA)
Service Level Agreements are the contracts between the parties, which clearly state the SLOs and implications of meeting/missing those SLOs. The consequences of meeting/missing an SLO is usually a financial implication like a rebate/penalty for the party offering SRE as a service.
SREs follow the following steps to measure, monitor, and act (if required) to meet the SLAs.
- Monitor and measure the SLIs defined.
- Compare the SLIs measured with the SLOs defined.
- Identify if any action is required.
- If yes, take necessary actions to meet the SLOs.
SLIs and SLOs are two important factors to measure in SRE service levels.
SRE is a relatively new topic that is gaining traction now. SRE applies software engineering practices to the operations processes and brings in many advantages in operations and managing a large complex production system. To a great extent, it bridges the gap and reduces the split between the developers and the operations team.
About the Author
Dr. Anil Kumar
Dr. Anil Kumar
VP Engineering, Cloud Control
Founder | Architect | Consultant | Mentor | Advisor | Faculty
Solution Architect and IT Consultant with more than 25 years of IT Experience. Served in various roles with both national and international institutions. Expertise in working with both legacy and advanced technology stacks and business domains.