Location: Victoria, London, UK. Occasional travel to international offices.
Core Roles/Skills: Large scale distributed systems administration, capacity and service continuity planning and implementations, out of hours on-call cover, software releases of Google advertising services (AdWords, AdSense etc), close liaison with software development and other SRE teams.
Within the Ads Frontend SRE team, I was responsible for administering globally distributed highly-available (99.9999% uptime), high-QPS, customer-facing frontend web services and backend RPC services using custom storage layers.
This involved configuration of custom software layer load balancers, detailed load testing, analysis and capacity planning to ensure no interruption of service, even during catastrophic failure of entire clusters.
Other duties included frequent on-call duties for critical services (three and four nines), code reviews, launch reviews for new products, technical consultation and production advice to aid development teams design performant products, and regular software releases to production.
Diagnosis of problems in comonents at all layers of the application stack required knowledge of Java, C++ and Python.
During my time I developed a number of tools to aid not only my team, but many SRE teams within Google, including a tool to ensure that recently committed revisions of software were deployed in to production in a timely manner to ensure predictable and reliable service behaviour. Another tool automatically provided a central dashboard that provided a single central view of all services maintained by any given SRE team, with links to internal status and diagnostic interfaces for diagnostic and documentation purposes.