CAP - consistency (eventual/data latency), availability, partition (network/quorum)
Consistency is acceptable drift by being eventual
Backup/restore
RTO - Recovery Time Objective
RPO - Recovery Point Objective (PITR)
Monitoring, alerting, escalation
Synthetic application monitoring (white, grey and blackbox)
Latency monitoring
Timeseries metrics
Latency monitoring
Versioned Kubernetes Secrets and ConfigMaps
So What's Next?
Upgrade/rollback Kubernetes application - relies on versioned product
Fully ORM abstraction layer in all database communication to stage backward/forward compatible schema changes between application deltas - urgh potentially loads of effort (handled per application, or shared library or abstraction service)
Geographical/datacentre fault tolerence: deploy all components to at least n+1 availability zones - mostly done