About the Role
We’re looking for an Research Scientist who blends frontier research curiosity with engineering discipline.
You’ll work at the core of our research efforts, training state-of-the-art models and building training infrastructure.
This role is ideal for someone who thrives in high-performance environments, understands the nuances of training large models, and is obsessed with making experimentation fast, reproducible, and reliable.
What You’ll Do
Own and maintain a modular, high-quality PyTorch training codebase
Design and build training workflows for scaling, checkpointing, logging, and reproducibility
Implement new ideas, debug training runs, and accelerate iteration
Develop and maintain efficient data loading pipelines and training utilities
Ensure training jobs can scale across multiple GPUs and nodes (e.g., with DDP, NCCL)
Optimize model training for performance, stability, and hardware utilization
Maintain long-term code health: organize modules, enforce standards, write clean and testable code
Contribute to experiment tracking, reproducibility, and versioning infrastructure
You Should Have
Deep expertise in PyTorch, including custom modules, loss functions, and distributed training
Proven experience training deep learning models in real-world research or production settings
Strong engineering skills in Python (and optionally C++ for performance-critical components)
Experience working with large datasets, complex pipelines, and real-world debugging
Understanding of training dynamics: what goes wrong, and how to fix it
Familiarity with job launchers, logging tools (e.g., Weights & Biases, TensorBoard), and checkpointing systems
A mindset of engineering rigor applied to research — readable code, thoughtful design, and reproducibility
Bonus Points For
Experience with TorchScript, ONNX, or custom inference runtimes
Contributions to PyTorch or open-source ML tooling
Experience working on transformer models, diffusion models, or large-scale vision/NLP tasks
Familiarity with batch schedulers (SLURM), cluster environments, and GPU resource management
Ability to collaborate closely with systems engineers or MLOps teams to ensure smooth integration
Why Join Us
Collaborate with a world-class research team on meaningful, high-impact projects
Own and shape the core training code infrastructure used daily by the team
Work on real models, real data, and real scale — not toy problems
Help bridge the gap between research velocity and engineering quality
Flexible work environment with a culture that values depth, clarity, and curiosity
#J-18808-Ljbffr