Impax Recruitment

Machine Learning Engineer

Impax Recruitment San Francisco Bay Area

Save

ML Infrastructure Engineer


We are building AI physics models that don't just predict, but understand cause and effect within climate.


What You’ll Do


  • Architect and operate distributed training clusters (e.g. 12+ nodes, 8 GPUs per node) using GKE, Kubernetes, and cloud-native infra
  • Design scalable, efficient data pipelines for petabyte-scale datasets
  • Implement and optimize model/data/pipeline parallelism across foundation models
  • Deploy, monitor, and debug large-scale multi-node GPU training jobs using DDP, FSDP, and DeepSpeed
  • Tune low-level system components (e.g. CUDA, NCCL, network interfaces) for max throughput
  • Build cluster observability tools: failure detection, logging, monitoring, and autoscaling
  • Collaborate with research and modeling teams to productionize experiments at scale


You’d Be Great If You


  • Have deep hands-on experience with distributed training frameworks (FSDP, DeepSpeed, DDP)
  • Know how to set up and debug Kubernetes/GKE GPU clusters, from CUDA to networking
  • Are fluent in PyTorch and familiar with its performance quirks (e.g., dataset loading, sampler design)
  • Have worked on ML infra at scale (multi-node, multi-GPU setups, 100B+ param models)
  • Understand sampling techniques, data sharding, and performance tuning across the ML stack
  • Can spot a NCCL timeout from a mile away and know how to fix it
  • Value rapid iteration, ownership, and scaling up ambitious systems with a lean team

  • Seniority level

    Not Applicable
  • Employment type

    Full-time
  • Job function

    Engineering and Information Technology
  • Industries

    Staffing and Recruiting and Software Development

Referrals increase your chances of interviewing at Impax Recruitment by 2x

See who you know

Get notified about new Machine Learning Engineer jobs in San Francisco Bay Area.

Sign in to create job alert

Similar jobs

People also viewed

Similar Searches

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More