January « 2021 « b20c:c011:5:b109::/64

Monthly Archives: January 2021

Book Notes – Site Reliability Engineering

Posted on January 4, 2021 by cbroccoli

Comments off

Notes from Chapter 1 of the Site Reliability Engineering book.

SRE teams are characterized by both rapid innovation and large acceptance of change.

SRE teams are responsilbe for the availability, latency, performance, efficiency, change management, monitoring and capacity planning of their services.

Core Tenants of Google SRE:

Ensuring a Durable Focus on Engineering – Ensure that 50% of an SRE’s time is spent on Engineering and the other 50% on operations. Safety valves exist in case the volume of operational effort exceeds 50%. This ensures they will have enough time to respond to the incident, restore service and conduct a postmortem.
Pursue Maximum Change Velocity without violating the SLO – Having an SLO budget allows for change to happen and does not require adhering to a 100% uptime target.
Keep track of system health and availability through Monitoring – 3 kinds of valid outputs: alerts, tickets and logging.
Ensure an effective Emergency Response by reducing MTTR – Leverage Playbooks to ensure consistent and efficient responses by all members of the team.
Reduce bad changes by automating Change Management – Use automation to implement progressive rollouts, quickly and accurately detect problems and rollback changes safely if necessary.
Ensure sufficient capacity and redundancy to serve projected future demand through Demand Forecasting and Capacity Planning – Should take both organic and inorganic growth into account.
Provisioning should be conducted quickly and only when necessary – Adding capacity is expensive so must only be done as necessary but when done must be correct so that it will work when needed.
Efficiency and Performance ensure effective management of a services costs – Demand, capacity and software efficiency are a large part of a systems efficiency. SREs provision to meet a capacity target at a specific response speed.

GCP Cloud Architect Study Guide – Statefulness and Measurements

Posted on January 3, 2021 by cbroccoli

Comments off

Statefulness

State should be moved as far back in the stack as possible to improve scalability. By replicating and sharding state on the back end, stability and reliability are improved.

Load balancers should be used for distributing incoming load to the frontend servers and then again mapping the load to the backend systems hosting the replicated state.

Measurements

Service Level Objectives are technical attributes describing the quality of the service which you want to achieve.

Performance Indicators are how you measure the objectives. They provide information on how close you are to the objective.

Service Level Agreements are the codification of the SLOs into legal documents.

Objectives should be used to guide the design. They can start off as estimates or as a range and get more specific and refined as the system evolves.

Objectives should be relevant to the user experience. Indicators measure that experience and alerts should be generated when the experience is noticable and causing the user pain.

Alerts are defined through the use of:

Policies – define the condition underwhich the service is considered unhealthy
Condition – determines when the policy triggers. Should be tuned to filter out false positives.
Notifications – channels to inform that an action must be taken.

GCP Cloud Architect Study Guide – Microservices

Posted on January 3, 2021 by cbroccoli

Comments off

Serverless Solutions:

	Language Support	Triggers	Startup Latency	Cost
Cloud Function	node.js, python, Go, java	http, Pub/Sub, Cloud Storage, Cloud Firestore, Firebase	High, Cold Start	Number of invocations plus tiered usage
App Engine Standard	node.js, python, Go, java, Ruby, PHP	http using service name, pub/sub	Seconds	Tiered usage pricing based on instance size
App Engine Flex	node.js, python, Go, java, Ruby, PHP, .NET, Custom	n/a – containerized app, limited by the code.	Minutes	Compute Engine based pricing
Cloud Run	Any	n/a – containerized app, limited by the code.	Minutes	For Native based on usage

Note: In all cases (except App Engine Flex which runs on Compute Engine), for the services to be able to connect to internal resources using private IP address, Serverless VPC Access must be enabled.

Cloud Functions Best Practices and Notes

Design for Indempotency – same result every time
Do not call background processes
Due to Cold Start limit libraries to only those necessary
Use Global Variables
Load Global Variables as needed

App Engine Standard Best Practices and Notes

Runs in a Sandbox on shared infrastructure
Rapid scaling from 0 (when no traffic)
App engine specific pricing

App Engine Flex Best Practices and Notes

Runs in a Docker Container on dedicated Compute Engine instances
Consistent traffic with gradual scaling
Cannot scale to 0
Compute engine pricing

Cloud Run

Cloud Run native seems to be the replacement for App Engine Flex
Cloud Run can also run as Cloud Run for Anthos leveraging Knative to provide a serverless environment
Can scale to 0 or can be kept at a minimum for faster responses
Supports concurrent requests to a given revision if the application supports concurrency

Use Cases for Serverless Solutions

Solution	Applicable Use Cases
App Engine	Modern Web Applications – Serving basic web content (static and dynamic) to users Salable Mobile Backends – The backend infrastructure to support mobile applications (gaming, etc.)
Cloud Functions	– Trigger Based Notifications – send emails, etc. due based on changes, workflow events. – Real Time File Processing – Perform actions based on changes in data to new or existing files, generate thumbnailes, validation, aggregate, enhance, transcode, etc. – Realtime Stream Processing – Enrich, process, transform streaming data from pub/sub from IoT sensors, activity tracking, etc.
Cloud Run	Containerized native applications with benefits of serverless architecture REST API Backend Lightweight Data Transformation