Responsibilities
• Integrate data from multiple sources, such as databases, APIs, or streaming platforms, to provide a unified view of the data
• Implement data quality checks and validation processes to ensure the accuracy, completeness, and consistency of data
• Identify and resolve data quality issues, monitor data pipelines for errors, and implement data governance and data quality frameworks
• Enforce data security and compliance with relevant regulations and industry-specific standards
• Implement data access controls, encryption mechanisms, and monitor data privacy and security risks
• Optimise data processing and query performance by tuning database configurations, implementing indexing strategies, and leveraging distributed computing frameworks
• Optimize data structures for efficient querying and develop data dictionaries and metadata repositories
• Identify and resolve performance bottlenecks in data pipelines and systems
• Collaborate with cross-functional teams, including data scientists, analysts, and business stakeholders
• Document data pipelines, data schemas, and system configurations, making it easier for others to understand and work with the data infrastructure
• Monitor data pipelines, databases, and data infrastructure for errors, performance issues, and system failures
• Set up monitoring tools, alerts, and logging mechanisms to proactively identify and resolve issues to ensure the availability and reliability of data
• It would be a plus if he has software engineering background
Requirements
• Bachelor's or master's degree in computer science, information technology, data engineering, or a related field
• Strong knowledge of databases, data structures, algorithms
• Proficiency in working with data engineering tools and technologies including knowledge of data integration tools (e.g., Apache Kafka, Azure IoTHub, Azure EventHub), ETL/ELT frameworks (e.g., Apache Spark, Azure Synapse), big data platforms (e.g., Apache Hadoop), and cloud platforms (e.g., Amazon Web Services, Google Cloud Platform, Microsoft Azure)
• Expertise in working with relational databases (e.g., MySQL, PostgreSQL, Azure SQL, Azure Data Explorer) and data warehousing concepts.
• Familiarity with data modeling, schema design, indexing, and optimization techniques is valuable for building efficient and scalable data systems
• Proficiency in languages such as Python, SQL, KQL, Java, and Scala
• Experience with scripting languages like Bash or PowerShell for automation and system administration tasks
• Strong knowledge of data processing frameworks like Apache Spark, Apache Flink, or Apache Beam for efficiently handling large-scale data processing and transformation tasks
• Understanding of data serialization formats (e.g., JSON, Avro, Parquet) and data serialization libraries (e.g., Apache Avro, Apache Parquet) is valuable
• Having experience in CI/CD and GitHub that demonstrates ability to work in a collaborative and iterative development environment
• Having experience in visualization tools (e.g. Power BI, Plotly, Grafana, Redash) is beneficial
Preferred Skills & Characteristics
Consistently display dynamic independent work habits, goal oriented, passionate in growth mindsets, possess a ‘can do’ attitude, and self-motivated professional.
Self-driven and proactive in keeping up with new technologies and programming