Full-time | On-site | San Francisco | Founded in 2024
Compensation: $165K - $250K
About The Role
As a Research Scientist at a seed-stage AI start up, you'll be at the forefront of developing benchmarks and evaluation methodologies for large language models. Your work will directly impact how cutting-edge AI systems are tested, validated, and deployed in enterprise environments.
Key Responsibilities
- Evaluate newly released AI models (e.g., DeepSeek, Gemini, etc.).
- Design and build new benchmarks from scratch, including dataset construction, hiring labelers, and authoring white papers.
- Enhance methodologies for automated evaluation of generated text.
- Work closely with engineering teams to implement and scale evaluation frameworks.
- Collaborate with leading AI labs and enterprise customers to refine evaluation strategies.
Who We're Looking For
- 2+ years of experience in applied AI, with a focus on benchmarking, evaluation methodologies, or language models.
- Experience designing and developing new evaluation methodologies is highly valued.
Tech Stack
- Backend: Django
- Infrastructure: AWS
- Frontend: React with TypeScript (TSX)
Perks & Benefits
- Equity 0.3% - 5% (flexibility for the right candidate)
- Visa Sponsorship is available
- Excellence is well rewarded.
- Relocation and transportation support.
- Health/dental insurance coverage.
- Lunch and dinner provided, free snacks/coffee/drinks.
- Unlimited PTO.
- Friday happy hours with friends and community members
- Occasional team outings like rock climbing, hiking, and bowling
About The Company
Current AI model benchmarks are inadequate for real-world applications. At this company, they provide industry-specific performance evaluations for language models, ensuring they are tested on the exact tasks where they will be deployed.
They have built a proprietary evaluation infrastructure that enables large-scale assessment of any LLM model. Their platform collects expert review criteria and applies it to general and task-specific LLM applications, delivering actionable insights into model performance.