Job Summary
Support engineers to onboard into the AIDA ecosystem to ensure high availability and reliability of all AIDA features across a global user base. The engineer(s) will be responsible for proactive monitoring, incident triage, troubleshooting, and resolution, along with delivering small enhancements and bug fixes across Python services, web applications, data/analytics pipelines, and cloud infrastructure.
This role requires strong production support mindset, hands-on troubleshooting skills, and ability to work in a shift-based model for near 24x7 coverage.
Key Responsibilities
Production Monitoring & Uptime (Primary)
Monitor all features/services within the AIDA ecosystem to ensure uptime, stability, and performance
Perform proactive checks on health dashboards, logs, alerts, and metrics; identify issues before users are impacted
Own incident management (triage → diagnosis → mitigation → resolution) and drive restoration within SLA
Conduct root cause analysis (RCA) and implement corrective/preventive actions (CAPA)
Maintain and improve runbooks, SOPs, knowledge articles, and on-call procedures
Support Operations (L2/L3)
Handle support tickets and production issues including functional issues, system errors, integration failures, and data pipeline disruptions
Manage escalations effectively—coordinate with product/engineering, infrastructure, and vendor teams as required
Track reliability KPIs (availability, MTTR, incident trends) and contribute to continuous improvement
Enhancements & Fixes (Secondary)
Deliver small enhancements, configuration changes, and bug fixes in:
Python services/scripts and automation
Web UI components (Angular/React) and/or backend (.NET where applicable)
Data workflows (Dataiku pipelines/recipes, job scheduling)
AWS infrastructure/app services
Release & Change Support
Support deployments (lower → prod), validate smoke tests, and assist in release readiness activities
Ensure proper change documentation and rollback readiness for production changes
Skill Requirements
Required Technical Skills
Python: troubleshooting, scripting, API debugging, automation
Web/App Support: exposure to Angular or React, and understanding of web app troubleshooting (frontend/backend)
Backend exposure: knowledge of .NET is desirable (or ability to troubleshoot service-side issues)
Dataiku: ability to monitor and troubleshoot Dataiku jobs, pipelines, failures, scheduling
AWS: working knowledge of cloud monitoring, logs, and typical services (e.g., IAM, EC2/ECS/EKS, S3, CloudWatch, Lambda—based on your stack)
Observability & Support Tools: logging/monitoring, alerting systems, ticketing tools (ServiceNow/Jira), dashboards