Machine Learning Engineer

Impax Recruitment • Full-time • San Francisco Bay Area, US • 2d ago

ML Infrastructure Engineer

We are building AI physics models that don't just predict, but understand cause and effect within climate.

What You’ll Do

Architect and operate distributed training clusters (e.g. 12+ nodes, 8 GPUs per node) using GKE, Kubernetes, and cloud-native infra
Design scalable, efficient data pipelines for petabyte-scale datasets
Implement and optimize model/data/pipeline parallelism across foundation models
Deploy, monitor, and debug large-scale multi-node GPU training jobs using DDP, FSDP, and DeepSpeed
Tune low-level system components (e.g. CUDA, NCCL, network interfaces) for max throughput
Build cluster observability tools: failure detection, logging, monitoring, and autoscaling
Collaborate with research and modeling teams to productionize experiments at scale

You’d Be Great If You

Have deep hands-on experience with distributed training frameworks (FSDP, DeepSpeed, DDP)
Know how to set up and debug Kubernetes/GKE GPU clusters, from CUDA to networking
Are fluent in PyTorch and familiar with its performance quirks (e.g., dataset loading, sampler design)
Have worked on ML infra at scale (multi-node, multi-GPU setups, 100B+ param models)
Understand sampling techniques, data sharding, and performance tuning across the ML stack
Can spot a NCCL timeout from a mile away and know how to fix it
Value rapid iteration, ownership, and scaling up ambitious systems with a lean team