Book Notes – Site Reliability Engineering

Notes from Chapter 1 of the Site Reliability Engineering book.

SRE teams are characterized by both rapid innovation and large acceptance of change.

SRE teams are responsilbe for the availability, latency, performance, efficiency, change management, monitoring and capacity planning of their services.

Core Tenants of Google SRE:

  • Ensuring a Durable Focus on Engineering – Ensure that 50% of an SRE’s time is spent on Engineering and the other 50% on operations. Safety valves exist in case the volume of operational effort exceeds 50%. This ensures they will have enough time to respond to the incident, restore service and conduct a postmortem.
  • Pursue Maximum Change Velocity without violating the SLO – Having an SLO budget allows for change to happen and does not require adhering to a 100% uptime target.
  • Keep track of system health and availability through Monitoring – 3 kinds of valid outputs: alerts, tickets and logging.
  • Ensure an effective Emergency Response by reducing MTTR – Leverage Playbooks to ensure consistent and efficient responses by all members of the team.
  • Reduce bad changes by automating Change Management – Use automation to implement progressive rollouts, quickly and accurately detect problems and rollback changes safely if necessary.
  • Ensure sufficient capacity and redundancy to serve projected future demand through Demand Forecasting and Capacity Planning – Should take both organic and inorganic growth into account.
  • Provisioning should be conducted quickly and only when necessary – Adding capacity is expensive so must only be done as necessary but when done must be correct so that it will work when needed.
  • Efficiency and Performance ensure effective management of a services costs – Demand, capacity and software efficiency are a large part of a systems efficiency. SREs provision to meet a capacity target at a specific response speed.

Comments are closed.