DevOps and SRE are the latest trends in the software world, but do you know how to implement the most successful strategy for your needs? If not, don’t worry, that’s what we’re here for.
In today’s world, it seems that almost every company is shifting to a software-driven mentality. This shift comes with a lot of changes and challenges, the main one being the need to accelerate and automate the write-test-deploy cycle – without breaking anything along the way.
DevOps and SRE teams are the guards at the deployment door, connecting developers with operations and making sure everything is running as fast and as stable as possible. And as such, these teams are becoming a critical part of just about every software company today.
Where Did DevOps (and SRE) Teams Come From – And Why
To understand where DevOps came from, we turned to 2 experts in this field, John Arundel and Justin Domingus, co-authors of ‘Cloud Native DevOps with Kubernetes’. While the book’s main focus is Kubernetes, it also talks about some of the key principles of DevOps and how it came to be:
Before DevOps, developing and operating software were essentially two separate jobs, performed by two different groups of people. Developers wrote software, and they passed it on to operations staff, who ran and maintained the software in production.
There was very little overlap; indeed, the two departments had quite different goals and incentives, which often conflicted with each other. Developers tend to be focused on (and rewarded for) shipping new features quickly, while operations teams care about making services stable and reliable over the long term.
Based on these conflicting goals, there tends to be a lot of contention between the developer and operations roles. “Throwing issues over the fence” and “dev vs. ops blame game” are common terms for describing the relationship between the two teams. Unfortunately, this situation can be a major roadblock for organizations aiming to out-innovate and out-perform their competition.
The people who write the software have to understand how it relates to the rest of the system, and the people who operate the system have to understand how the software works—or fails.
In their book, the two authors attribute the origins of the DevOps movement to attempts to break down the fence and bring these two groups together: to collaborate, to share understanding, to share responsibility for system reliability and software quality and to improve the scalability of both the software systems and the teams of people who build them.
Rather than being concentrated in a single operations team that services other teams, ops expertise will become distributed among many teams. Each development team will need at least one ops specialist, responsible for the health of the systems or services the team provides. She will be a developer, too, but she will also be the domain expert on networking, Kubernetes, performance, resilience, and the tools and systems that enable the other developers to deliver their code to the cloud.
And indeed, if we look for the basic definition of DevOps we can see that it’s “a set of software development practices that combines software development (Dev) and information technology operations (Ops) to shorten the systems development life cycle while delivering features, fixes, and updates frequently in close alignment with business objectives.”
In addition, some companies choose to create an SRE team, or Site Reliability Engineering, instead of DevOps. SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. Created as part of Google, an SRE team’s main goal is to create ultra-scalable and highly reliable software systems.
These two teams have pretty much the same responsibilities and focus area inside the app, as well as the same main goal: to optimize and improve CI/CD (Continuous Integration and Continuous Deployment) workflows. In other words, DevOps and SRE are the ones who’ll make sure nothing breaks, regardless of how frequent your code-build-deploy cycle is.
It’s About More Than Just Bridging the Gap Between Dev and Ops
Now that we know the purpose of creating a DevOps or an SRE team, it’s time to understand what they need in order to be unstoppable. But before we do that, we once again turned to an expert for the “secret sauce” of the ideal team. His answer was quite… surprising.
Dave Farley, founder and director of Continuous Delivery Ltd. and a thought-leader in the field of Continuous Delivery, DevOps and Software Development in general, states that there’s no such thing as an “ideal” DevOps team.
Most experts in this field think that the idea of a “DevOps Team” is the wrong idea. The aim of “DevOps” is really to ensure that operations thinking is part of development and that development teams own responsibility for their software in production. There is much more to it than that, but the idea of a separate “DevOps Team” is usually an anti-pattern, creating a new silo between Dev and Ops.
While Dave’s quote might seem a bit extreme, it’s pretty accurate. Bringing DevOps to your organization takes much more than a handful of developers and IT Ops people sitting down in the same room. The process of implementing DevOps requires a cultural shift within the organization that allows for the combination of two teams originally driven by goals that often conflict with each other (shipping code quickly and maintaining system stability).
And that’s exactly what DevOps and SRE should be all about – finding a balance between the needs to ship code faster, on the one hand, and closing new and widening reliability gaps on the other. In a survey we conducted about a year ago, 73% of respondents answering that both dev and ops teams ultimately share responsibility for application reliability. Successfully implementing DevOps culture is ultimately about leveraging that existing belief and building onto it a deeper culture of accountability and enablement in place of finger pointing and the blame game.
Proactive Reliability vs. Reactive Processes
For most DevOps and SRE teams, the go-to tools of choice are APMs that can monitor the application and alert on anomalous behavior, together with log analysis tools that reveal additional system information and (sometimes) help pinpoint the cause of the anomaly.
These tools take a top-down IT Ops approach for reliability, focusing on trace-level diagnostics (symptoms). But they’re not much help in finding the true cause of slow transactions, application crashes and general issues inside the application.
To enable Continuous Reliability, DevOps and SRE teams need to tools that capture bottom-up, code-level diagnostics (causes) at a lower level than what many think is accessible. This data should include the complete state of the application at the time of failure, and maps to the transaction, microservice, and the deployment which introduced it.
And that was the guiding line for us when we created OverOps Reliability Dashboards, through which we see the current status of the application, identify and prioritize all errors across the application. We can detect issues, slowdowns and anomalies and allow teams to quickly and more easily resolve errors that contribute to high error volumes.
OverOps alerts you when new errors are introduced into your system, when there’s an increase in errors or when slowdowns occur, allowing you to stay on top of things – rather than waiting for your customers to complain about issues.
With proper identification and classification of anomalies in code behavior and performance, you can preemptively block releases from being deployed to production if circumstances suggest that a Sev1 issue is likely to be introduced.