Key Responsibilities
- Design, build, and maintain robust, scalable, and secure data pipelines for batch and real-time data processing.
- Develop and optimize ETL/ELT workflows to extract, transform, and load data from multiple sources.
- Architect and implement data warehouses, data lakes, and lakehouse solutions on cloud or on-prem platforms.
- Ensure data quality, lineage, governance, and versioning using metadata management tools.
- Collaborate with Data Scientists, Analysts, and Software Engineers to deliver reliable and accessible data solutions.
- Optimize SQL queries, data models, and storage layers for performance and cost efficiency.
- Develop and maintain automation scripts for data ingestion, transformation, and orchestration.
- Integrate and process large-scale data from APIs, flat files, streaming services, and legacy systems.
- Implement data security, access control, and compliance standards (GDPR, ISO
- Monitor and troubleshoot data pipeline failures, latency, and performance bottlenecks .
Data Engineering & Architecture
- Strong expertise in data modeling (dimensional/star/snowflake schemas) and data normalization techniques .
- Proficient in ETL/ELT tools such as Apache NiFi, Talend, Informatica, SSIS, or Airbyte .
- Advanced knowledge of SQL and distributed computing concepts .
- Experience with data lake and warehouse technologies such as Snowflake, Redshift, BigQuery, Azure Synapse, or Databricks .
- Deep understanding of data partitioning, indexing, and query optimization .
Big Data & Distributed Systems
- Hands-on experience with Hadoop ecosystem (HDFS, Hive, HBase, Oozie, Sqoop) .
- Proficiency in Apache Spark / PySpark for distributed data processing.
- Exposure to streaming frameworks like Kafka, Flink, or Kinesis .
- Familiarity with NoSQL databases such as MongoDB, Cassandra, or Elasticsearch .
- Knowledge of data versioning and catalog systems (e.g., Delta Lake, Apache Hudi, Iceberg, or AWS Glue Data Catalog).
Programming & Automation
- Strong programming skills in Python , Scala , or Java for data manipulation and ETL automation.
- Experience with API integration, REST/GraphQL, and data serialization formats (JSON, Parquet, Avro, ORC).
- Proficient in shell scripting, automation, and orchestration tools (Apache Airflow, Prefect, or Luigi).
Cloud Platforms
- Expertise in at least one cloud ecosystem:
AWS: S3, Redshift, Glue, EMR, Lambda, Athena, Kinesis
Azure: Data Factory, Synapse, Blob Storage, Databricks
GCP: BigQuery, Dataflow, Pub/Sub, Cloud Composer
- Strong understanding of IAM, VPC, encryption, and data access policies within cloud environments.
Data Governance & Security
- Implement and enforce data quality frameworks (DQ checks, profiling, validation rules) .
- Knowledge of metadata management, lineage tracking , and master data management (MDM) .
- Familiarity with role-based access control (RBAC) and data encryption mechanisms.
Preferred Skills
- Experience with machine learning data pipelines (ML Ops) or feature store management .
- Knowledge of containerization and orchestration tools (Docker, Kubernetes).
- Familiarity with CI/CD pipelines for data deployment.
- Exposure to business intelligence (BI) tools like Power BI, Tableau, or Looker for data delivery.
- Understanding of data mesh or domain-driven data architecture principles .
Leadership & Collaboration
- Work closely with cross-functional teams to define data requirements and best practices.
- Mentor junior engineers and enforce coding and documentation standards .
- Provide technical input in data strategy, architecture reviews, and technology evaluations .
- Collaborate with security and compliance teams to ensure data integrity and protection.
Qualifications
- Bachelor's or Master's Degree in Computer Science, Data Engineering, Information Systems, or related field.
- 5–10 years of professional experience as a Data Engineer or similar role.
- Professional Certifications preferred:
AWS Certified Data Analytics / Big Data Specialty
Microsoft Certified: Azure Data Engineer Associate
Google Professional Data Engineer
Databricks Certified Data Engineer