Monthly Archives: January 2021

Book Notes – Site Reliability Engineering

Notes from Chapter 1 of the Site Reliability Engineering book.

SRE teams are characterized by both rapid innovation and large acceptance of change.

SRE teams are responsilbe for the availability, latency, performance, efficiency, change management, monitoring and capacity planning of their services.

Core Tenants of Google SRE:

  • Ensuring a Durable Focus on Engineering – Ensure that 50% of an SRE’s time is spent on Engineering and the other 50% on operations. Safety valves exist in case the volume of operational effort exceeds 50%. This ensures they will have enough time to respond to the incident, restore service and conduct a postmortem.
  • Pursue Maximum Change Velocity without violating the SLO – Having an SLO budget allows for change to happen and does not require adhering to a 100% uptime target.
  • Keep track of system health and availability through Monitoring – 3 kinds of valid outputs: alerts, tickets and logging.
  • Ensure an effective Emergency Response by reducing MTTR – Leverage Playbooks to ensure consistent and efficient responses by all members of the team.
  • Reduce bad changes by automating Change Management – Use automation to implement progressive rollouts, quickly and accurately detect problems and rollback changes safely if necessary.
  • Ensure sufficient capacity and redundancy to serve projected future demand through Demand Forecasting and Capacity Planning – Should take both organic and inorganic growth into account.
  • Provisioning should be conducted quickly and only when necessary – Adding capacity is expensive so must only be done as necessary but when done must be correct so that it will work when needed.
  • Efficiency and Performance ensure effective management of a services costs – Demand, capacity and software efficiency are a large part of a systems efficiency. SREs provision to meet a capacity target at a specific response speed.

GCP Cloud Architect Study Guide – Statefulness and Measurements

Statefulness

State should be moved as far back in the stack as possible to improve scalability. By replicating and sharding state on the back end, stability and reliability are improved.

Load balancers should be used for distributing incoming load to the frontend servers and then again mapping the load to the backend systems hosting the replicated state.

Measurements

Service Level Objectives are technical attributes describing the quality of the service which you want to achieve.

Performance Indicators are how you measure the objectives. They provide information on how close you are to the objective.

Service Level Agreements are the codification of the SLOs into legal documents.

Objectives should be used to guide the design. They can start off as estimates or as a range and get more specific and refined as the system evolves.

Objectives should be relevant to the user experience. Indicators measure that experience and alerts should be generated when the experience is noticable and causing the user pain.

Alerts are defined through the use of:

  • Policies – define the condition underwhich the service is considered unhealthy
  • Condition – determines when the policy triggers. Should be tuned to filter out false positives.
  • Notifications – channels to inform that an action must be taken.

GCP Cloud Architect Study Guide – Microservices

Serverless Solutions:

Language
Support
TriggersStartup LatencyCost
Cloud Functionnode.js, python, Go, javahttp, Pub/Sub, Cloud Storage, Cloud Firestore, Firebase High, Cold StartNumber of invocations plus tiered usage
App Engine Standardnode.js, python, Go, java, Ruby, PHPhttp using service name, pub/subSecondsTiered usage pricing based on instance size
App Engine Flexnode.js, python, Go, java, Ruby, PHP, .NET, Customn/a – containerized app, limited by the code.MinutesCompute Engine based pricing
Cloud RunAnyn/a – containerized app, limited by the code.MinutesFor Native based on usage

Note: In all cases (except App Engine Flex which runs on Compute Engine), for the services to be able to connect to internal resources using private IP address, Serverless VPC Access must be enabled.

Cloud Functions Best Practices and Notes

  • Design for Indempotency – same result every time
  • Do not call background processes
  • Due to Cold Start limit libraries to only those necessary
  • Use Global Variables
  • Load Global Variables as needed

App Engine Standard Best Practices and Notes

  • Runs in a Sandbox on shared infrastructure
  • Rapid scaling from 0 (when no traffic)
  • App engine specific pricing

App Engine Flex Best Practices and Notes

  • Runs in a Docker Container on dedicated Compute Engine instances
  • Consistent traffic with gradual scaling
  • Cannot scale to 0
  • Compute engine pricing

Cloud Run

  • Cloud Run native seems to be the replacement for App Engine Flex
  • Cloud Run can also run as Cloud Run for Anthos leveraging Knative to provide a serverless environment
  • Can scale to 0 or can be kept at a minimum for faster responses
  • Supports concurrent requests to a given revision if the application supports concurrency

Use Cases for Serverless Solutions

SolutionApplicable Use Cases
App EngineModern Web Applications – Serving basic web content (static and dynamic) to users
Salable Mobile Backends – The backend infrastructure to support mobile applications (gaming, etc.)
Cloud Functions– Trigger Based Notifications – send emails, etc. due based on changes, workflow events.
– Real Time File Processing – Perform actions based on changes in data to new or existing files, generate thumbnailes, validation, aggregate, enhance, transcode, etc.
– Realtime Stream Processing – Enrich, process, transform streaming data from pub/sub from IoT sensors, activity tracking, etc.
Cloud RunContainerized native applications with benefits of serverless architecture
REST API Backend
Lightweight Data Transformation