Job Summary
As a Subject Matter Expert in Support & Operations, you will play a pivotal role in ensuring the timely resolution of escalated incidents while adhering to quality norms and service level agreements (SLAs). Your expertise in Kubernetes, Ansible, and cloud technologies will be essential in driving customer satisfaction and operational excellence.
Key Responsibilities
==========================================================================================================================
Job Title: Platform Site Reliability Engineer (SRE) – OpenShift (8+ Years)
Location
Bangalore / Kolkata / Pune
Band
L2
Role Summary
We are looking for a Platform SRE (6+ years) to engineer, run, and continuously improve an OpenShift-heavy container platform. This role combines Day‑1 responsibilities (platform setup, standardization, onboarding enablement) with Day‑2 operations (stability, upgrades, performance, incident management, and automation).
Key Responsibilities
Day‑1 (Build / Enablement)
- Support OpenShift platform onboarding: cluster setup assistance, baseline configurations, and environment readiness.
- Implement platform standards: namespaces/projects, RBAC/SCC, resource quotas/limits, routes/ingress patterns, and operator enablement.
- Create reusable deployment patterns using Helm (standard charts/templates, values structure, versioning).
- Build and standardize GitLab CI templates/pipelines for build-test-deploy and environment promotion.
- Develop automation using Ansible to enable repeatable provisioning/configuration workflows.
Day‑2 (Run / Operate / Optimize)
- Own cluster health and reliability: monitoring, capacity planning, scaling, patching and upgrades, and performance troubleshooting.
- Troubleshoot issues across OpenShift components, nodes, networking/storage basics, and workload behaviour.
- Participate in incident response: triage, mitigation, RCA, post-incident actions, and runbook/SOP improvements.
- Reduce operational toil through automation, improved alerts, and self-service enablement for application teams.
- Collaborate with stakeholders to improve security posture and operational governance (access controls, platform hygiene).
Mandatory Skills
- 8+ years’ experience in SRE / DevOps / Platform / Infrastructure Engineering
- Strong hands-on OpenShift Administration
- Helm (deployments + chart maintenance; chart authoring preferred)
- Ansible (playbooks/roles; automation mindset)
- Linux fundamentals (logs, processes, system services, basic networking)
- CI/CD with GitLab CI (pipelines, runners, templates, variables/secrets)
Good-to-Have Skills
- ArgoCD (GitOps)
- vSphere, NSX, VMware Cloud Foundation (VCF)
- Exposure to observability stacks (Prometheus/Grafana, ELK/EFK, Splunk, Datadog, etc.)
Traits We Value
- Strong troubleshooting, ownership, and production support mindset
- Comfortable operating in structured on-call rotations and handling high-severity incidents
- Good documentation habits (runbooks, SOPs, RCA notes)
=================================
Skill Requirements
2. Strong Understanding Of Ansible For Automation And Configuration Management.
3. Familiarity With Redhat Linux And Redhat Cluster Technologies.
4. Knowledge Of Azure Cloud Services And Their Integration With Kubernetes.
5. Excellent Analytical And Problem-Solving Skills, With A Focus On Customer Satisfaction And Operational Efficiency.