ML Infrastructure Engineer
We are building AI physics models that don't just predict, but understand cause and effect within climate.
What You’ll Do
- Architect and operate distributed training clusters (e.g. 12+ nodes, 8 GPUs per node) using GKE, Kubernetes, and cloud-native infra
- Design scalable, efficient data pipelines for petabyte-scale datasets
- Implement and optimize model/data/pipeline parallelism across foundation models
- Deploy, monitor, and debug large-scale multi-node GPU training jobs using DDP, FSDP, and DeepSpeed
- Tune low-level system components (e.g. CUDA, NCCL, network interfaces) for max throughput
- Build cluster observability tools: failure detection, logging, monitoring, and autoscaling
- Collaborate with research and modeling teams to productionize experiments at scale
You’d Be Great If You
- Have deep hands-on experience with distributed training frameworks (FSDP, DeepSpeed, DDP)
- Know how to set up and debug Kubernetes/GKE GPU clusters, from CUDA to networking
- Are fluent in PyTorch and familiar with its performance quirks (e.g., dataset loading, sampler design)
- Have worked on ML infra at scale (multi-node, multi-GPU setups, 100B+ param models)
- Understand sampling techniques, data sharding, and performance tuning across the ML stack
- Can spot a NCCL timeout from a mile away and know how to fix it
- Value rapid iteration, ownership, and scaling up ambitious systems with a lean team