Research Scientist

Kadence • Full-time • San Jose, CA, US • 6d ago

Full-time | On-site | San Francisco | Founded in 2024

Compensation: $165K - $250K

About The Role

As a Research Scientist at a seed-stage AI start up, you'll be at the forefront of developing benchmarks and evaluation methodologies for large language models. Your work will directly impact how cutting-edge AI systems are tested, validated, and deployed in enterprise environments.

Key Responsibilities

Evaluate newly released AI models (e.g., DeepSeek, Gemini, etc.).
Design and build new benchmarks from scratch, including dataset construction, hiring labelers, and authoring white papers.
Enhance methodologies for automated evaluation of generated text.
Work closely with engineering teams to implement and scale evaluation frameworks.
Collaborate with leading AI labs and enterprise customers to refine evaluation strategies.

Who We're Looking For

2+ years of experience in applied AI, with a focus on benchmarking, evaluation methodologies, or language models.
Experience designing and developing new evaluation methodologies is highly valued.

Tech Stack

Backend: Django
Infrastructure: AWS
Frontend: React with TypeScript (TSX)

Perks & Benefits

Equity 0.3% - 5% (flexibility for the right candidate)
Visa Sponsorship is available
Excellence is well rewarded.
Relocation and transportation support.
Health/dental insurance coverage.
Lunch and dinner provided, free snacks/coffee/drinks.
Unlimited PTO.
Friday happy hours with friends and community members
Occasional team outings like rock climbing, hiking, and bowling

About The Company

Current AI model benchmarks are inadequate for real-world applications. At this company, they provide industry-specific performance evaluations for language models, ensuring they are tested on the exact tasks where they will be deployed.

They have built a proprietary evaluation infrastructure that enables large-scale assessment of any LLM model. Their platform collects expert review criteria and applies it to general and task-specific LLM applications, delivering actionable insights into model performance.

Apply