Job Summary
We are looking for a Systems Storage Site Reliability Engineer (SRE) to support and scale our global storage platforms. This is a contractor position focused on applying SRE principles to storage systems—improving reliability, reducing operational toil, and enabling sustainable growth through automation and observability.\r\nYou will work at the intersection of storage engineering and reliability engineering, partnering closely with infrastructure and application teams to operate production systems at scale.\r\nWhat You’ll Do\r\nOwn the reliability, availability, and performance of production NAS and/or Object Storage services.\r\nApply SRE principles to storage platforms: define reliability goals, improve observability, and reduce manual operational work through automation.\r\nDesign and build automation and Infrastructure‑as‑Code to manage storage systems at scale.\r\nLead troubleshooting and resolution of complex storage incidents; participate in on‑call and incident response.\r\nPerform capacity planning, forecasting, and demand modeling to support business growth.\r\nPartner with engineering teams to support application onboarding, testing, and production readiness.\r\nContribute to global storage initiatives, including lab and infrastructure deployments.\r\nCreate and maintain runbooks, documentation, and operational best practices to improve team efficiency.\r\nWhat We’re Looking For\r\n8+ years of experience in SRE, infrastructure automation, or platform engineering, with strong storage exposure.\r\nHands‑on experience operating NAS and/or Object Storage platforms, Luster/Ceph in production.\r\nStrong proficiency with automation and IaC tools (e.g., Ansible, Terraform, Puppet, SaltStack).\r\nExperience running highly available, scalable systems in 24×7 environments.\r\nFamiliarity with containers and orchestration (Docker, Kubernetes).\r\nExperience with CI/CD pipelines, monitoring, logging, and version control systems (Git, Perforce).\r\nStrong incident management, troubleshooting, and communication skills.\r\nBachelor’s degree in Computer Science, Engineering, or a related field.\r\nNice to Have\r\nExperience with large‑scale distributed systems.\r\nStrong understanding of SRE concepts such as SLIs, SLOs, error budgets, observability, and logging.\r\nAbility to debug and optimize infrastructure and automate repetitive workflows.\r\nProven ability to work independently and deliver results as a contractor in a global team environment. About the RoleWe are looking for a Systems Storage Site Reliability Engineer (SRE) to support and scale our global storage platforms. This is a contractor position focused on applying SRE principles to storage systems—improving reliability, reducing operational toil, and enabling sustainable growth through automation and observability.You will work at the intersection of storage engineering and reliability engineering, partnering closely with infrastructure and application teams to operate production systems at scale.What You’ll DoOwn the reliability, availability, and performance of production NAS and/or Object Storage services.Apply SRE principles to storage platforms: define reliability goals, improve observability, and reduce manual operational work through automation.Design and build automation and Infrastructure‑as‑Code to manage storage systems at scale.Lead troubleshooting and resolution of complex storage incidents; participate in on‑call and incident response.Perform capacity
Key Responsibilities
Automation
Skill Requirements
NAS/ Storage Platform/ Luster/ Ceph, CI/CD
Other Requirements
kubernetes, docker