Job Summary
The requirement is aligned toward an AWS native modern Data Engineering/Lakehouse implementation rather than a traditional ETL/MSBI-only profile.
The customer is looking for candidates who can help build and scale a native AWS Lakehouse platform handling very large-scale datasets, IoT/device feeds, streaming pipelines, unified data models, and analytical workloads. The expected core skill areas are around AWS-native data engineering stack such as:
• AWS Glue (ETL + Catalog)
• EMR / Spark-based distributed processing (Spark in AWS EMR/Glue is commonly written using PySpark, Python API for Apache Spark). So knowledge of Python for not application development but for data engineering/Spark pipelines.
• S3-based Lakehouse architecture
• Step Functions (pipeline orchestration)
• Redshift / Athena for analytics
• CDC/data ingestion patterns
• Iceberg/open table concepts
• Kinesis / Firehose for streaming ingestion
The major pain points customer is trying to solve are:
• Large-scale data migration and transformation
• Unifying multiple application datasets/entities into a common analytical model
• Handling billions of rows of IoT/device/solar/wind farm data efficiently
• Building scalable analytical and reporting architecture on AWS
• Incremental/Agile modernization using CDC and phased migration strategy
Traditional ETL/Data Warehouse experience is helpful as a foundation, but profiles should additionally demonstrate strong AWS-native cloud data engineering capabilities and distributed processing exposure to align with the actual customer expectation.