We often see operations and development teams collaborate with each other to achieve service or product delivery goals aligned with business objectives. Often, they have different objectives in mind, with the development team focused on faster deployment of code and the operations team focused on effective service management or incident response process.
To make production operations management more efficient and agile, the concept of combining DevOps and Site Reliability Engineering (SRE) has gained traction across many industries and organizations. DevOps focuses on rolling out new features or incorporating feature enhancements while encouraging tighter collaboration between IT operations and software engineering teams. SRE applies software engineering principles to technology operations for improving resiliency, security, performance, and system reliability. Site Reliability Engineers are precious and expensive commodities; hence, they should be utilized on the most critical services, where there is a high level of Agile/DevOps maturity. Many industries, especially financial services, are highly regulated. Establishing an SRE team requires a strong risk management framework that clearly defines role-based access to satisfy a requirement for segregation of duties for global regulators.
We have come a long way from legacy environments that operated in silos to today’s integrated DevOps environments that encourage cross-team collaboration and continuous feedback. It has become even more important to maintain reliability, scalability, efficiency, and performance of services due to SRE’s ability to influence executive or business level decisions such as build features versus run services, etc. SRE culture recognizes disruption of services as an opportunity to strengthen the system, considers it as a speed bump and not a roadblock. SRE team learns from incidents or outages related to service, reliability or performance. Leveraging automation along with SRE principles helps minimize number of unplanned incidents or production breaks and helps expedite recovery time.
Incorporating SRE culture and adopting SRE methods will help organizations realize several benefits including improved customer experience, increased revenue, cross collaboration between team members, reduced operational toil, self-learning and data driven insights and improved service reliability.
Among the leading practices for SRE, the top seven are:
- Secure Executive Sponsorship/Buy-in: Executive sponsorship or buy-in for SRE is necessary to ensure maximum service availability, minimum system downtime, setting aside error budgets for setbacks, prioritizing between reliability, business objectives and customer experience during new feature rollout. Error budget allows for a certain amount of downtime and once this budget has reached its capacity, the SRE team focuses purely on reliability. Senior management plays a crucial role in addressing any barriers related to SRE, while also supporting investment decisions related to improving reliability or adding new features.
- Focus on User Experience: SRE’s primary goal is to improve service reliability, manage scalability and automate operational toil. Assessing user adoption of services allows the SRE team to evaluate the benefits of keeping service(s) reliable. Additionally, incorporating customer feedback on reliable services or system performance allows an organization to gauge customer impact. Service Level Indicators (SLIs) are metrics that enable organizations to measure service behavior for making better decisions by evaluating, monitoring, and aligning service health with user needs or business objectives.
- Ensure Business Alignment: SRE culture encourages alignment with business objectives and monitoring by employing service level agreements and metricsTo understand performance and availability errors, the SRE team monitors the systems on an ongoing basis to identify root causes and learn how to resolve them. SRE emphasizes lowering the risk of incident reoccurrence through continuous monitoring and alerting. To monitor and measure the overall health of a system, Google has defined four golden signals:
SRE uses three primary factors to determine whether the health of a system is aligned with stakeholder and customer expectations.
- Service level indicators (SLI) are aggregated over time to determine the threshold for reliability
- Service level objectives (SLO) ensure that the service reliability expectations align with the expectations of product owners and key stakeholders
- Service level agreement (SLA) determine whether reliability expectations are in line with the customer agreements and how to proceed if they are not.
- Leverage Automation and Self-Healing: Automation is a fundamental component of SRE as it increases reliability, robustness, and precision by automating manual, repetitive and redundant operational tasks. SRE leverages the self-healing process to help identify common error scenarios early in the process and uses automation for simple recovery steps. This reduces the mean time to repair and cost to fix errors encouraging rapid incident response. SRE uses an intelligent and data driven techniques to make observations and improvise the process.
- Establish a Collaborative Command Center: SRE culture promotes effective collaboration and communication between the team members to encourage transparency, support cross collaboration, continuous learning, eliminate silos and encourage pragmatic thinking while also reducing downtime. SRE coaches can help guide the team to strategically assess the bigger picture. An SRE team is cross disciplinary as it focuses on both the customer experience and systems reliability, balancing them with business benefits. It requires a combination of development and operational skills.
- Standardize Tools/Process: SRE encourages standardization of tools and process for greater scalability, reliability, and robust performance. Standardization is key aspect of SRE culture since it requires specialized skillsets to maintain service reliability within an organization. SRE is a unique role that requires the skills of a software developer with additional operations experience such as deployment, configuration, monitoring, latency, change management, emergency response and capacity management of production environments. Additionally, to enhance and automate operational tasks more efficiently, an SRE team should have technical systems know-how.
- Embrace Blameless Postmortems: The SRE team works together with cross-functional teams such as product, business, quality, enterprise infrastructure, accounting, payroll, etc. to investigate an incident’s root cause and logically resolve the incident. This fosters teamwork and encourages the team to resolve the issue together rather than putting them at odds. SRE accepts that failures are unavoidable while encouraging the team to learn from their mistakes and avoid reoccurrence of the incident. Although SRE principles encourage 100 percent perfection in terms of service reliability or uptime through automation, they may support incremental changes to enhance reliability and increase automation.
Applying SRE
SRE is not a “one size fits all approach” but helps an organization improve their service reliability through automation, reduced downtime, improved scalability, efficiency, and performance. Organizations that adopt SRE principles are more likely to benefit from automating manual/redundant tasks, self-healing process, improving customer experience, building a stronger team and ultimately, increasing reliability of systems.
Joseph Etame and Elise Berk also contributed to this post. To learn more about Protiviti’s SRE and Technology Consulting capabilities, contact us.