Job Summary
Senior Cloud Platform Engineer
About the role
You will own the reliability, security, and scalability of our GCP-based AI platform infrastructure. Everything runs on Cloud Run, managed via Terraform, deployed through Cloud Build. You are responsible for zero-downtime deployments, cloud cost control, end-to-end observability, and ensuring that IAM, VPC, and data security posture meet enterprise standards. You are also the person the data and AI engineers call when their Terraform apply fails or their Cloud Run service won't start.
Key responsibilities
Own and evolve the Terraform IaC codebase — write and maintain reusable modules for Cloud Run services, AlloyDB clusters, Spanner instances, BigQuery datasets, Memorystore Redis, Vertex AI endpoints, Artifact Registry, and VPC networking
Manage Cloud Build CI/CD pipelines across all services — branching strategy (GitOps), build triggers, test gate enforcement, multi-environment promotion (dev → staging → prod), and automated rollback on failed health checks
Design and maintain GCP security posture — IAM least-privilege service accounts, Identity-Aware Proxy (IAP) for all internal services, VPC Service Controls, Private Service Connect for AlloyDB and Redis, and Secret Manager integration
Build and maintain the full observability stack — Cloud Monitoring dashboards, OpenTelemetry collector configuration, structured JSON logging standards, distributed tracing across FastAPI and LangGraph services, and PagerDuty or equivalent on-call alerting
Define and track SLOs for all platform services — API p50/p95/p99 latency, data pipeline freshness, AI pipeline throughput, Cloud Run error rate — and run monthly reliability reviews
Manage Docker image strategy — multi-stage build patterns to minimise image size, distroless base images, Artifact Registry lifecycle policies, and automated vulnerability scanning with Container Analysis
Implement FinOps practices — BigQuery slot monitoring and reservation management, Cloud Run CPU/memory right-sizing, committed use discount planning, and per-team cost allocation using labels
Conduct quarterly infrastructure security reviews and respond to GCP Security Command Center findings
Must-have skills
Terraform — write modules from scratch, not just modify existing ones; HCL fluency, remote state backends (GCS), workspace management, and Terraform Cloud or Atlantis for GitOps CI/CD integration
GCP — 3+ years hands-on production experience: Cloud Run, BigQuery, Cloud Build, IAM, VPC networking, Cloud Monitoring, Secret Manager, Artifact Registry; GCP Associate Cloud Engineer or Professional DevOps Engineer certification strongly preferred
Docker — multi-stage builds, layer caching optimisation, distroless base images, image security scanning, and Artifact Registry management
CI/CD — Cloud Build or GitHub Actions: pipeline design from scratch, artifact versioning, environment-specific config management, and deployment gating strategies
Linux / bash — comfortable debugging inside running containers, writing shell automation scripts, managing file permissions and system resources
GCP networking — VPC design, subnet allocation, firewall rules, Private Service Connect, Cloud NAT, and DNS configuration for private service endpoints
Key Responsibilities
Skill Requirements
Other Requirements
Good to have
OpenTelemetry — collector configuration, exporter setup (Cloud Trace, Prometheus), and custom instrumentation for Python FastAPI services
Kubernetes / GKE — even if the current stack is Cloud Run, GKE knowledge is valuable for future scale requirements
Python scripting for infrastructure automation — Cloud Functions, custom Cloud Build steps, GCP Admin SDK scripts
Cloud cost management tooling — Looker Studio billing dashboards, Budget Alerts, committed use planning models, and BigQuery billing export analysis
Azure networking basics — enough to understand the cross-cloud connectivity between Azure Databricks and GCP services
GCP Security Engineer certification or equivalent security background