AWS re:Invent 2022 -- Building modern apps: Architecting for observability & resilience (ARC217-L)
Observability & Resilience
-----------------------------
AWS's Well-Architected Framework
Resilience in the cloud
- Anticipating
- Monitoring
- Responding
- Learning
Categories of failure
Recovery-oriented patterns
- Backoff and retry
- Circuit breaker
- Graceful degradation
- Throttling
- Load shedding
AWS investments in resilience
The Amazon Route 53 Application Recovery Controller. It enables you to control your application recovery across multiple AWS Regions, availability zones and on-prem.
Zonal Shift feature for Amazon Route53 ARC - New
Speeds recovery for multi-AZ applications
[18:15 - 18:22] Capital One
Implementing essential metrics
In order to detect, investigate, and respond to impact
- Customer experience metrics
- Impact assessment metrics
- Operational health metrics
Monitoring in the cloud
Three pillars of observability tooling
- Metrics
- Logs
- Traces
[34:15 - 43:02] Finra
Reliability design principles
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally to increase aggregate workload availability
- Stop guessing capacity
- Manage change in automation
Operational excellence principles
- Perform operations as code
- Make frequent, small, reversable changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from all operational failures
Game days
Regularly test your people, processes, and systems
- Exercise your procedures
- Ensure no impact to users
- Simulate exceptional event
Infrastructure event management
- Planning for large-scale events
- Framework to ensure alignment
- Planning, review, and risk mitigation
- Post-event recap
AWS use this process for events like Amazon Prime day.
AWS Well-Architectured Framework
https://aws.amazon.com/cn/architecture/well-architected/?wa-lens-whitepapers.sort-by=item.additionalFields.sortDate&wa-lens-whitepapers.sort-order=desc&wa-guidance-whitepapers.sort-by=item.additionalFields.sortDate&wa-guidance-whitepapers.sort-order=desc
AWS Fault Isolation Boundaries Whitepaper
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/abstract-and-introduction.html
AWS Solutions Library
https://aws.amazon.com/cn/solutions/resilience/