Job Summary
The Real Time Payments International team is looking for a Site Reliability Engineer (SRE) to drive application deployment readiness, manage day-to-day operational stability and support the reliability of critical payment platforms by implementing automation, leverage best practices and work with a high‑impact team responsible for driving production readiness, reliability, and DevOps automation across Mastercard platforms.
This role plays a key part in incident management, change readiness, and platform operations, while contributing to continuous improvement initiatives.
Key Responsibilities
Platform Operations & Stability
- Support end-to-end availability, monitoring, and performance of critical payment platforms.
- Execute operational processes to ensure platform health and stability.
- Participate in capacity checks, readiness validations, and environment monitoring.
Incident Management & Execution
- Actively manage and coordinate incident triage and resolution.
- Serve as incident commander driving medium to high-severity incidents.
- Ensure timely updates, accurate impact assessment, and appropriate escalation.
- Contribute to root cause analysis with clear identification of actions and ownership.
Change & Release Support
- Participate in highlighting gaps and defining test cases required for a change in lower environments and validate lower environment test completeness.
- Ensure adherence to change governance processes (test case reviews, checklists, approvals, rollback readiness).
- Engage in creating change plans and support execution of production changes, deployments, and validations.
Technical Troubleshooting
- Perform hands-on troubleshooting across:
- Application behaviour and dependencies.
- Infrastructure components (compute, network, storage).
- Database and performance issues.
- Collaborate with engineering, infrastructure and other technical teams to isolate and resolve issues efficiently.
Monitoring & Observability
- Improve system health monitoring using observability tools and alerts.
- Identify gaps in alerting and contribute to improving quality of alerting and dashboards.
- Ensure proactive detection of anomalies using observability tools.
Automation & Process Improvement
- Contribute to automation initiatives to reduce toil and errors.
- Identify repetitive operational tasks and drive improvements.
- Support implementation of DevOps best practices.
- Leverage AI-driven tools to improve monitoring, incident detection, and operational efficiency, enabling faster troubleshooting and reduced manual effort in day-to-day operations.
Stakeholder Coordination
- Work closely with engineering, program teams, and external partners during incidents and changes.
- Provide structured updates to stakeholders with clarity and consistency.
- Ensure alignment during critical activities.
Risk Identification
- Highlight operational and platform risks including test coverage gaps, infrastructure constraints, dependency risks.
- Escalate issues proactively and support mitigation tracking.
Team Contribution & Mentorship
- Support onboarding and guidance of junior team members.
- Contribute to runbooks, documentation, and knowledge sharing.
- Drive consistency in execution and adherence to operational standards.
Success in This Role Looks Like:
- Deep Operational Ownership (“Built to Run” Mindset)
A successful Lead SRE Engineer is fully accountable for the operational health of their program, not just responsive to incidents.
- Monitoring, alerting, and dashboards that reflect real customer impact.
- Emergency response and incident leadership, including clear communications and post-incident follow‑ups.
- Capacity planning and readiness aligned with product and business growth.
- Change management discipline, ensuring safe, compliant releases.
2.Strong Technical & System-Level Understanding
A Lead SRE Engineer is expected to operate at system dependency level, not just ticket or tool level.
- Have a strong understanding of application business logic and workflows.
- Have a clear grasp of upstream/downstream dependencies.
- Expertise in observability (alerts, dashboards, synthetic monitoring).
- Ability to drive automation to reduce manual toil and recurring issues.
- End to End ownership of tasks and activities.
3. Incident Leadership & Decision-Making Under Pressure
Beyond technical skill, Leads are distinguished by how they lead during high‑severity situations.
- Takes command of major incidents, not waiting to be asked.
- Maintains calm, structured communication with engineering, product, and leadership.
- Balances speed vs risk in decision-making.
- Ensures clear ownership of actions, timelines, and follow‑ups.
- Drives root cause analysis and systemic fixes, not just recovery.
4. Proactive Risk & Reliability Engineering
A successful Lead SRE prevents incidents more than fight them.
- Identifies systemic risks before they become outages.
- Pushes for design, monitoring, or process improvements.
- Challenges “tribal knowledge” by insisting on documentation and runbooks.
- Drives improvements aligned with operational maturity models.
5. Leadership Without Formal Authority
Lead SRE Engineers often lead without being people managers, which requires strong influence skills.
- Mentors and coaches senior and mid-level SRE’s.
- Sets the technical and behavioural bar for the team.
- Gives clear, constructive feedback.
- Acts as a role model for ownership, urgency, and professionalism.
- Builds trust with Engineering, Product, and Platform teams.
- Flexibile in terms of working hours where needed.
6. Excellent Cross‑Functional Communication
SRE Leads sit at the intersection of technology, operations, and business.
- Translating technical issues into business impact to communicate clearly with senior stakeholders during incidents.
- Setting expectations early and transparently with junior team members.
- Represents SRE confidently in planning, reviews, and retrospectives,
- Ensuring post‑incident learnings are shared and acted upon.
7. Continuous Learning & Product Mastery
A Lead SRE Engineer is expected to continuously deepen product and platform knowledge.
- Actively closing knowledge gaps in their program by driving learning within the team.
- Staying current with platform changes, dependencies, and risks.
- Ensuring knowledge is documented and reusable, not person‑dependent.
Skill Requirements
- Experience in production support, SRE, or BizOps roles.
- Exposure to managing incidents and supporting distributed systems.
- Experience in payments ecosystem will be preferred.
- Knowledge of monitoring and alerting tools like Splunk, Dynatrace, Blaze meter.
- Knowledge of automation and DevOps practices. Demonstrated ability to design end‑to‑end CI/CD flows that deliver high‑quality software to production with minimal manual intervention — including centralized configuration and unified pipelines across environments.
- Experience working in cross-functional and high-pressure environments.
- Ability to organize, multi-task and prioritise work based on current business needs.
- Possesses strong verbal and written communication skills.
- Strong relationship skills, collaborative skills and stakeholder management skills.
- Experience in one or more scripting language is preferred.
- Interest in designing, analysing and troubleshooting large-scale distributed systems.
- Ability to work with little or no supervision.
Tech skills --
- Operating System – Unix [Commands and scripting]
- Database - Oracle or equivalent
- Devops -> Chef, Jenkins, Github or any equivalent tools
- Supporting a java-based application in virtualized env and basic knowledge of VMs, hypervisors etc
- Monitoring - Splunk, Dynatrace or equivalent
- Scripting -- Shell or python will be preferred but anything equivalent will do
- Experience of supporting production systems (preferably in finance) is mandatory