Not long ago, automation in IT was fairly a new concept. All the IT operations, mostly the administrative jobs, were carried out without automation. Automation started gaining momentum in the middle of the last decade and eventually led to the theory of continuous everything. This improved dev and ops collaboration that was aimed towards continuous integration and continuous deployment backed by agile practices. So, to have seamless product deployment and delivery, DevOps gradually became an integral part of every IT organization.
While DevOps is trying to close the gap between dev and ops, it is important to understand that the main motive of the development process is still the same. Dev team focuses on creating value in the product and pushing them faster to production. The ops team focuses on the stability of the environments by better management of software and hardware. Dev wants faster production while ops wants stable production, and this is where Site Reliability Engineering (SRE) came into the picture. The term was coined by Benjamin Treynor Sloss who was responsible for founding Google’s Site Reliability team. And according to Google, the SRE team is not only responsible for the stability of the production environment, but it is also committed to new features for applications and operational improvement at the same time. At first, different combinations with numbers representing both the dev and ops teams were tried to form the SRE team. While there could be a debate about the exact proportion of dev and ops teams to configure the SRE team, I believe it should vary based on the project needs.
DevOps and SRE are rulers of the software development world, but at the same time, they tend to confuse people by overlapping each other in some aspects. Both these terms tend to focus on automation and monitoring to reduce the time when a developer commits a change until it is deployed to production. According to Google, SRE and DevOps are not so different from each other. Just like DevOps, SRE is also about bringing together dev and ops teams to increase the production speed and at the same time increasing the visibility of the entire application life cycle. If you see DevOps as a philosophy, then SRE is a way of accomplishing that philosophy. They are not two methods competing against each other. However, one is a cultural change while other is a practice that is complementing the change in some or the other way. You can look at them as teams that can work toward breaking down the organizational barriers to deliver superior software faster.
So, what exactly is SRE?
To start with its contents, the SRE team is also made up of dev and ops. However, here in SRE, the ops team is with more of a coding core. Naturally, they try to codify all the aspects of operations. Once the codified monitoring measures are in place, a consolidated method Is developed to calculate reliability at every single stage. SREs are responsible for measuring the SLIs and SLOs. SLI stands for Service Level Indicator and SLO stands for Service Level Objective which is defined as quantitative measures about all major aspects of the level of service that is to be provided. SRE in simple terms can be explained as a software engineering approach to IT operations and takes up the tasks managed by the operations teams in the organization. The manual tasks carried by the operations teams are assigned to engineers explicitly using software and automation to resolve the issues. These engineers are also responsible for managing production systems. The SRE teams in any organization utilize software as a tool to maximize the problem-solving ability, manage their systems, and automate the operational tasks.
More reliability means more immunity against failures. However, SRE is more about accepting these failures. Talking about accepting failures and measuring everything in the software development life cycle, I think SRE makes organizations focus on the reliability aspect more. SREs measure SLI and SLOs, whereas DevOps measures the failure and the success rate with the help of various tools and methods. The reliability is not only related to the infrastructure but also on the quality of your application, performance, and security. To measure this reliability, having reliable data is very important. Your reliable data can consist of implementation stack and bytecode, the total variable state covered on full source code, JVM state which comprises of threads and environment variables, applicable log statements with DEBUG and TRACE in the production, and analytics of the event in terms of frequency, failure rate, deployment, and application. You can also use methods like setting up alerts for various scenarios, peer-code review, unit tests to make your data reliable as well as actionable.
Let’s see how DevOps and SRE are complementing or are different from each other and see how SRE can work within the DevOps paradigm.
DevOps and SRE
DevOps is about “what” needs to be done, SRE is about “how” to do it. The similarities and differences between DevOps and SRE can be explained based on the top five pillars in DevOps:
Minimized organizational silos
Large organizations mostly have teams working in silos with each team responsible for different aspects of the product. Plus, these teams are normally working in silos without or with minimum communication between each other. This can lead to problems in deployment and frustration amongst teams.
DevOps changes this culture of working silos and work unites these teams into a group. This brings the teams on the same page and helps them align with the overall picture and vision behind the application. Now in the case of SRE, developers are asked to share the ownership of production. SRE makes sure that everyone uses the same tools and techniques across the organization for effective communication and share the vision behind the product.
So, in short, DevOps enforces working together of Dev and Ops teams whereas SRE asks Dev teams to share the responsibilities. Effective communication in SRE will help in breaking down silos and shared vision which is the crux of DevOps. So, you can see they are not different.
DevOps believes in sharing the responsibility of failure amongst everyone involved in the process. And rather than passing on the responsibility of these failures, DevOps believes in embracing these failures and minimizing its effects. On the other hand, SRE believes in gradual changes in permissible limits which may introduce failures that are again within limits. SRE calls these limits as the error budget.
SREs follow a formula to balance the accidents and failures against new releases. The formula is based on two major aspects that are Service Level Indicators (SLIs) and Service Level Objective (SLOs). For every service, the SRE team has a Service- Level Agreement (SLA) that defines the reliability of the system for the end-users. If the SRE team agrees on 99.9% SLA, then it would give them an error budget of 0.1%. SRE engineers are responsible for verifying the code quality of changes in the application. They communicate with the development team to provide evidence with the help of automated test results. The SRE teams can fix SLOs to evaluate the performance of changes carried out in the application. Thus, setting an error budget that calculates the maximum allowable threshold for errors and outages. This helps SRE engineers to verify the code quality after changes/upgrades in the application.
Execute measured changes
With the increasing competition, everybody is focusing on making products better and better. This means frequent upgrades to the product. DevOps with its continuous everything practices make it possible. Whereas SRE believes in frequent but small changes that won’t affect the reliability in a measured manner. It won’t be wrong to say that both DevOps and SREs desire to move faster. However, SRE makes a point to reduce failure costs.
Make the most of tooling and automation
One of the central points for DevOps and SRE is automation. They both make the most of automation and tools. SRE involves all the teams that work on the same application/service to select the same technology solutions for efficient functioning. DevOps is all about breaking the silos culture. However, different versions, vendors, tools, and technologies may introduce little isolation in the process. SRE makes sure that all the stakeholders use the same tools and platforms. And the automation part makes sure the whole idea of DevOps is strengthened with SRE.
The fast-automated workflow requires continuous monitoring. It is important to make sure your DevOps and SRE are taking your development processes in the right direction. For this both DevOps and SRE needs to be evaluated. DevOps believes in measuring performance and results whereas SRE believes in the evaluation as per the previously agreed upon SLOs. SRE majorly believes that operations are a software problem and adopt prescriptive ways for computing availability, uptime, outages, etc.
The ‘system administration’ or ‘system engineering’ can be easily adopted into ‘site reliability based on software development skills or development teams with ops knowledge can be easily molded into the SRE role. The people who take up this role are responsible for judging the performance right from infrastructure to the end-user experience. It may seem that the SRE teams are necessary for large enterprises but in my opinion, the organization size has nothing to do with it. Leading organizations like Dropbox, Netflix, and GitHub are already embracing it. It is time for all of us to welcome SRE the way we are welcoming DevOps to achieve the common goal of speed and reliability.