Job Summary
We are seeking a highly motivated Data Platform Engineer with expertise in Big Data technologies, Cloud Platforms, Kubernetes, and Site Reliability Engineering (SRE) to design, build, manage, and optimize large-scale distributed data platforms. The ideal candidate will have strong experience supporting Hadoop and Spark ecosystems, managing cloud-based infrastructure, improving platform reliability, and enabling real-time data processing at scale.
Key Responsibilities
Key Responsibilities
Big Data Platform Administration
•
Design, deploy, and manage large-scale Hadoop clusters supporting enterprise data workloads.
•
Administer and optimize Hadoop ecosystem components including HDFS, YARN, Hive, HBase, Spark, Kafka, and Oozie.
•
Ensure high availability, scalability, and performance of distributed data platforms.
•
Support data ingestion, storage, processing, and analytics workloads across batch and streaming environments.
Spark & Data Processing
•
Develop, optimize, and troubleshoot Spark (PySpark) applications for large-scale batch and streaming workloads.
•
Perform Spark performance tuning through partitioning, caching, memory optimization, and executor configuration.
•
Analyze and resolve Spark job failures, resource bottlenecks, and performance issues.
•
Support Spark Streaming applications integrated with Kafka.
Cloud & Infrastructure Engineering
•
Build and manage cloud infrastructure on AWS using services such as EC2, S3, IAM, VPC, CloudWatch, RDS, and related services.
•
Automate infrastructure provisioning and management using Terraform and Infrastructure-as-Code principles.
•
Implement scalable, secure, and resilient cloud architectures.
Kubernetes & Containerization
Containerize applications using Docker and deploy workloads on Kubernetes.
•
Manage Kubernetes clusters, deployments, services, pods, and Helm charts.
•
Monitor cluster health, troubleshoot workload issues, and optimize resource utilization.
•
Support Spark and data workloads running on Kubernetes platforms.
Site Reliability Engineering (SRE)
•
Implement SRE best practices to improve platform reliability, availability, and operational efficiency.
•
Define and monitor SLIs, SLOs, and reliability metrics.
•
Participate in incident response, root cause analysis (RCA), and post-incident reviews.
•
Drive continuous improvements to reduce MTTR and prevent recurring issues.
Monitoring & Observability
•
Build and maintain monitoring, alerting, and logging solutions using:
o
Grafana
o
ELK Stack
o
Splunk
o
CloudWatch
o
Nagios
o
Datadog
•
Create dashboards and proactive alerting mechanisms to ensure system health and performance.
DevOps & Automation
•
Develop CI/CD pipelines using Jenkins, Git, and related DevOps tools.
•
Automate operational tasks using Python and Shell scripting.
•
Implement deployment automation and configuration management practices.
Security & Governance
•
Implement security controls using IAM, RBAC, Apache Ranger, and network security best practices.
•
Ensure compliance with organizational security and governance standards.
•
Support secure access management and data protection initiatives.
Skill Requirements
Bachelor's or Master's degree in Computer Science, Information Technology, Data Engineering, or related field.
•
5+ years of experience in Data Engineering, Platform Engineering, SRE, or Big Data Administration.
•
Strong experience with:
o
Hadoop Ecosystem (HDFS, YARN, Hive, HBase, Oozie)
o
Apache Spark (PySpark, Spark Streaming)
o
Apache Kafka
o
Kubernetes and Docker
o
AWS Cloud Platform
o
Terraform
o
Linux Administration
•
Experience with monitoring and observability platforms.
•
Strong troubleshooting and incident management skills.
•
Proficiency in Python and Shell scripting.
Technical Skills
Cloud Platforms
•
AWS (EC2, S3, IAM, VPC, CloudWatch, RDS)
•
Microsoft Azure
Big Data Technologies
•
Hadoop (HDFS, YARN, MapReduce)
•
Apache Spark (PySpark, Spark Streaming)
•
Kafka
•
Hive
•
HBase
•
Apache Druid
•
Oozie
Container & Orchestration
•
Docker
•
Kubernetes
•
Helm
DevOps & Automation
•
Terraform
•
Jenkins
•
Git
•
Chef
Monitoring & Logging
•
Grafana
•
ELK Stack
•
Splunk
•
CloudWatch
•
Nagios
•
Datadog
Programming
Python
•
Shell Scripting
Operating Systems
•
Linux (RHEL, CentOS, Ubuntu)
•
Unix
Methodologies
•
Agile/Scrum
•
DevOps
•
Site Reliability Engineering (SRE)
Other Requirements
Experience with Apache Druid for real-time analytics.
•
Exposure to Azure cloud services.
•
Experience implementing SRE frameworks and reliability engineering practices.
•
Knowledge of CI/CD and DevOps methodologies.
•
Experience supporting large-scale production data platforms.