Job Summary
"Role Summary
The Operations SME focuses on day-to-day availability, reliability, and performance of production systems, ensuring SRE practices are embedded into operational processes.
Key Responsibilities
• Maintain and operate production applications, infrastructure, and databases
• Implement and manage monitoring, alerting, and performance tools
• Perform incident response, troubleshooting, and RCA
• Execute hardware and software upgrades
• Conduct capacity planning and performance analysis
• Maintain runbooks, SOPs, and operational documentation
• Support cloud, enterprise, SaaS, COTS, and legacy platforms
Must Have
• 6+ years of experience in production operations
• Experience managing VMs, servers, networks, and applications
• Strong experience with monitoring, logging, and alerting
• Understanding of reliability metrics and operational KPIs
• Proficiency in scripting (Python, Bash, Groovy, GoLang)
• Understanding of container platforms
• Familiarity with ITSM processes
Good to Have
• Application architecture exposure in Java or .NET
• CI/CD and automation experience
• Knowledge of Chaos Engineering
• SRE certification"