Job Summary
Role: Core Data Engineer (AWS)
Summary: You will design, build, and operate secure, reliable, and cost
efficient data pipelines on AWS, covering batch, streaming, and CDC ingestion
through to curated datasets ready for analytics. You will embed testing,
governance, observability, and CI/CD across the data lifecycle.
Key Responsibilities
Key responsibilities
• Ingestion and CDC: Implement pipelines using AWS Lambda , EMR for
batch/streaming( future need); manage schema evolution and backfills.
• Orchestration: Build workflows with Step Functions and/or MWAA
(Airflow), including retries, alerts, and SLAs.
• Lakehouse and warehouse: Model datasets on S3 (Parquet/Iceberg), Glue
Data Catalog, Athena; partitioning, compaction, and performance tuning.
• Data quality and testing: Add automated tests (Great Expectations/Deequ,
PySpark unit tests), data contracts, and CI gates.
• Security and governance: Implement IAM, KMS, VPC endpoints, Lake
Formation policies, PII handling, audit trails (CloudTrail), and RBAC/RLS.
• Observability and FinOps: CloudWatch metrics/alerts, lineage, usage
dashboards; cost optimisation (S3 lifecycle, compression, job sizing).
• CI/CD and IaC: Provision with Terraform/CloudFormation/CDK; build
pipelines with GitHub Actions; environment promotion and rollbacks.
• Documentation and runbooks: Maintain pipeline diagrams, SLAs/SLOs, and
incident playbooks.
• Adoption: be able to adopt capabilities from other teams instead of
building duplicate capabilities and be able to contribute back to community.
Outcomes (first 60–90 days)
• Productionise one end to end pipeline with tests, monitoring, and
alerting.
• Establish governance baseline (Lake Formation, tagging/classification,
encryption) and data contracts for two key sources.
• Stand up CI/CD and IaC for data services; reduce at least one cost
driver via storage/compute optimisation.
Skill Requirements
Skills and experience
• 10+ years building data platforms on AWS; strong Python (incl. PySpark)
and SQL.
• Hands on with DMS, Kinesis/MSK, Glue/EMR, Lambda, Step Functions/MWAA,
S3/Parquet/Iceberg, Glue Catalog, Athena, Redshift/Serverless.
• IaC (Terraform) and CI/CD (GitHub Actions).
• Data quality frameworks (Great Expectations/Deequ), schema evolution,
backfills.
• Security and compliance: IAM/KMS, private networking, Lake Formation,
audit/lineage.
• Performance and cost tuning across storage and compute.
Nice to have
• Snowflake, Apache Flink, Redshift Spectrum, OpenLineage, OPA/policy as
code.