Site Reliability Engineer (SRE)

xAI • London, UK & Palo Alto, CA • 4d ago

About the Role

We’re looking for an experienced site reliability engineer (SRE) who can thrive in a dynamic start-up environment. The main responsibilities for this role are:

Improving our observability by adding/adjusting metrics
Building easily parsable dashboards
Building reliable alerts
Designing and overseeing our on-call rotations
Improving our deployment process to increase reliability.

An ideal candidate meets at least the following requirements:

Expert in at least one programming language that compiles to machine code such as Rust, C++, or Go. Rust or C++ experience is preferred
Expert knowledge of monitoring technologies such as Prometheus, Grafana, and PagerDuty
Expert knowledge of deployment technologies such as Pulumi or Terraform
Expert knowledge of Kubernetes.

Location

The role is based in our London office close to Piccadilly Circus underground station. We usually work from the office 5 days a week but allow for work-from-home days when required. Candidates must be willing to attend late meetings at least twice a week to coordinate with the rest of our team, which is based in California. This role includes semi-regular business trips to California. We are also open to hiring in our HQ office in Palo Alto, CA.

Interview process

After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to a 15 minute interview (“phone interview”) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of two technical interviews.

Our goal is to finish the process within one week. All interviews will be conducted via Google Meet.

Benefits

Competitive cash-based compensation
xAI equity
Private health and dental insurance
Unlimited time off subject to prior approval