Job Summary
Job Description: Senior Data Engineer (PySpark / Dataproc / GCP)
We are looking for a Senior Data Engineer with strong hands-on expertise in Python (PySpark) and Google Cloud Dataproc to design, develop, and operate scalable data pipelines on Google Cloud Platform. This role focuses on building reliable, production-grade data solutions across batch and streaming use cases.
Key Responsibilities
Key Responsibilities
• Design, build, and optimize data pipelines using PySpark on Dataproc
• Develop performant, maintainable Spark jobs using Python, with a strong focus on reliability and cost efficiency
• Manage Dataproc clusters, including provisioning, tuning, autoscaling, and ephemeral cluster usage
• Design end-to-end data architectures from ingestion to analytics and downstream consumption
• Collaborate with data consumers, platform teams, and stakeholders to deliver scalable solutions
• Ensure data quality, observability, and operational excellence in production environments
Skill Requirements
Required Skills & Experience
Core Skills: PySpark & Dataproc
• Strong expertise in Python, with extensive hands-on experience using PySpark
• Deep experience developing, tuning, and optimizing Spark batch and streaming workloads
• Practical experience with Google Cloud Dataproc, including:
o Cluster lifecycle management
o Initialization actions and custom configurations
o Autoscaling policies and cost optimization
o Use of ephemeral clusters for job-based execution
• Solid understanding of Spark internals (execution plans, caching, partitions, joins, shuffles, checkpointing)
Google Cloud Platform (GCP)
• Strong working experience with core GCP services, including:
o BigQuery for analytics and data warehousing
o Google Cloud Storage (GCS) as a data lake
o Cloud Run for containerized data services and microservices
o Cloud SQL for relational and transactional workloads
o Pub/Sub for event-driven and streaming ingestion
• Familiarity with IAM, service accounts, and secure service-to-service communication
Programming Languages
• Advanced proficiency in Python for production data pipelines
• Experience with Scala and/or Java for Spark development is a plus
• Ability to write clean, testable, and well-documented code
Data Storage & Processing
• Proven experience designing data lakes on GCS, including:
o Partitioning strategies and lifecycle management
o Optimized file formats such as Parquet and Avro
• Strong experience integrating Spark pipelines with BigQuery
• Knowledge of data modeling concepts for analytics and reporting
Workflow Orchestration
• Experience orchestrating pipelines using:
o Apache Airflow (Cloud Composer), or
o Native Dataproc job submissions and workflow templates
• Familiarity with monitoring, alerting, retries, and dependency management
Data Pipeline Design
• Strong experience designing and developing end-to-end data pipelines
• Ability to build scalable, fault-tolerant, and maintainable systems
• Hands-on experience implementing data validation, error handling, logging, and monitoring
• Experience working with both batch and streaming processing patterns
Streaming & Event Driven Processing
• Hands-on experience with streaming data pipelines
• Practical understanding of event-based ingestion and near real-time processing