Job description
AI- Driven Big Data Engineer Employment Type: Full-Time
Location: Remote, Singapore
Level: Entry to Mid Level (PhD Required) Bridge Cutting-Edge AI Research with Petabyte-Scale Data Systems
Pixalate is an online trust and safety platform that protects businesses, consumers and children from deceptive, fraudulent and non-compliant mobile, CTV apps and websites.
We're seeking a PhD-level Big Data Engineer to revolutionize how AI transforms massive-scale data operations.
Our impact is real and measurable.
Our software has uncovered:
+ Gizmodo: An iCloud Feature Is Enabling a $65 Million Scam (https://gizmodo.com/apple-icloud-private-relay-ad-fraud-scam-research-1849803510)
+ Washington Post: Your kids' apps are spying on them (https://www.washingtonpost.com/technology/2022/06/09/apps-kids-privacy/)
+ ProPublica: Porn, Piracy, Fraud: What Lurks Inside Google's Black Box Ad Empire (https://www.propublica.org/article/google-display-ads-piracy-porn-fraud)
About the Role
Work at the intersection of big data and AI, where you'll develop intelligent, self-healing data systems processing trillions of data points daily.
You'll have autonomy to pursue research in distributed ML systems and AI-enhanced data optimization, with your innovations deployed at unprecedented scale within months, not years.
This isn't traditional data engineering - you'll implement agentic AI for autonomous pipeline management, leverage LLMs for data quality assurance, and create ML-optimized architectures that redefine what's possible at petabyte scale.
Key Research Areas & Responsibilities AI-Enhanced Data Infrastructure
+ Design intelligent pipelines with autonomous optimization and self-healing capabilities using agentic AI
+ Implement ML-driven anomaly detection for terabyte-scale datasets
Distributed Machine Learning at Scale
+ Build distributed ML pipelines
+ Develop real-time feature stores for billions of transactions
+ Optimize feature engineering with AutoML and neural architecture search
Required Qualifications Education & Research
+ PhD in Computer Science, Data Science, or Distributed Systems (exceptional Master's with research experience considered)
+ Published research or expertise in distributed computing, ML infrastructure, or stream processing
Technical Expertise
+ Core Languages : Expert SQL (window functions, CTEs), Python (Pandas, Polars, PyArrow), Scala/Java
+ Big Data Stack : Spark 3.5+, Flink, Kafka, Ray, Dask
+ Storage & Orchestration : Delta Lake, Iceberg, Airflow, Dagster, Temporal
+ Cloud Platforms : GCP (BigQuery, Dataflow, Vertex AI), AWS (EMR, SageMaker), Azure (Databricks)
+ ML Systems : MLflow, Kubeflow, Feature Stores, Vector Databases, scikit-learn + search CV, H2O AutoML, auto-sklearn, GCP Vertex AI AutoML Tables
+ Neural Architecture Search: KerasTuner, AutoKeras, Ray Tune, Optuna, PyTorch Lightning + Hydra
Research Skills
+ Track record with 100TB+ datasets
+ Experience with lakehouse architectures, streaming ML, and graph processing at scale
+ Understanding of distributed systems theory and ML algorithm implementation
Preferred Qualifications
+ Experience applying LLMs to data engineering challenges
+ Ability to translate complex AutoML/NAS research into practical production workflows
+ Hands-on project examples of feature engineering automation or NAS experiments
+ Proven success in automating ML pipelines, from raw data to an optimized model architecture
+ Contributions to Apache projects (Spark, Flink, Kafka)
+ Knowledge of privacy-preserving techniques and data mesh architectures
What Makes This Role Unique
You'll work with one of the few truly petabyte-scale production datasets outside of major tech companies, with the freedom to experiment with cutting-edge approaches.
Unlike traditional big data roles, you'll apply the latest AI research to fundamental data challenges - from using LLMs to understand data quality issues to implementing agentic systems that autonomously optimize and heal data pipelines.
Powered by JazzHR
Required Skill Profession
Other General