SRE

SRE

What is SRE?

These are just my notes about Systems Reliability Engineering. This is something I have chosen to study as part of my work for a company called BankBI. By all means read it if you will, but I'm not really writing for readability. This is just a notes-drop from various courses.

SRE is often described as "what happens when devs design operations". It's hard to say for definite, that it is one thing or another because different aspects of SRE appeal to different authors.

Often there is a lot of focus on delivering feature after feature, while assuming that once a system is up then it is going to stay that way. When has that ever happened? SRE is concerned with recognising and planning for failure. SRE involves among other things, setting goals for the reliability of a system and making sure that when the system falls short of a goal, the team focuses on reaching the goals of reliability rather than new features. Obviously there's more to it than that, but it's one of the key things that comes up early in many definitions.

SRE sits along side DevOps. Some people like to conflate the two, but they have distinct aims. Nearly every author is against comparing them like they are opposites and only one can be implemented. In a nutshell, DevOps gets the Sys Admins and Developers to work together and SRE gets Support team / Operations and Developers to work together. It would tend to suggest us Devs aren't good at working with anybody since there's all these grand initiatives in place to get us to work well with people!

SRE is a child of Google's inner workings. For years Google kept it to themselves and then suddenly they went public.

The Problem

One way that we can define SRE is to consider the problem that we are trying to solve. We can define SRE as "a collection of principles, ideals and ideas which are brought in to solve a divide between Ops and Development".

Why would there ever be a divide in a company? Well in this case it come because these two teams while seeming to work very closely together actually have very different difficulties and problems that they face. They face customers having very different experiences so they end up seeing customers with very different desires.

The developers will traditionally hear about the latest features wanted by customers and potential new customers. By always considering the desires for new features they will reasonably assume that customers always want new features. Their focus will be to deliver new features as fast as possible to help make the customers and to help the sales team to onboard new customers who want different features.

Operations (or Ops) are managing the system on an ongoing basis. They can see how many computing resources are required and they are often the first point of call went something goes wrong. They are often face-to-face with customers who are experiencing bugs or instability. They will likely assume that customers want stability and no bugs.

As you can see these two teams often have an very different idea of what the customer wants and SRE is the solution to prevent the potential warfare between these two factions.

In SRE you set a target for reliability and stability. When the system falls short of that target then development's priority changes from delivering new features to dealing with stability. With DevOps the development team and the Ops team begin to overlap more.

Core Tenets of SRE

This highest priority of any SRE team or agent is the following things:

  1. Availability
  2. Latency
  3. Performance
  4. Efficiency
  5. Change Management
  6. Monitoring
  7. Emergency Response
  8. Capacity Planning

Toil

While this is not a definition of SRE it is a concept that comes up in a few of the definitions. Toil is hard work. It often refers specifically to repetitive tasks which do not require much brain to do. In SRE these tasks are usually automated to save time. Remember if it takes longer to automate it then to do it a few times you need to think about whether it is worth automating. In an environment that is always changing it might not be worth it because that might only be done a few more times.

Graeme

Leave your message