Vals Fellowship

Monday, May 25, 2026 8:30 AM 08:30
Tuesday, June 30, 2026 5:30 PM 17:30

Google Calendar ICS

At a glance

Duration: 3–6 months
Location: We are excited to have people in-person working in tighter partnership, but also happy to have a more remote partnership with check-ins.
Focus: hard, unsolved problems in AI evaluation with real frontier models + real customers
Apply here: Vals Fellowship Application
Deadline: Apply by June 30, 2026

Why this fellowship exists

Most AI evaluation today is a mess. Benchmarks saturate, domain experts disagree on what “correct” even means. Often, real-world performance diverges from leaderboard numbers in ways nobody can predict.

The field needs better methodology, better measurement, better infrastructure, and better theory. Most of that work isn’t getting done because the people best positioned to do it (PhD students or academics) usually don’t have access to frontier models, customer-grade evaluation problems, sufficient API credits/funding, or the engineering infrastructure to run experiments at scale.

Vals builds evaluations for AI in legal, finance, science, and other high-stakes domains. We work with the leading AI labs and Fortune 500 companies, which means fellows get access to problems and infrastructure that are difficult to replicate in a university lab. Our goal is to develop better benchmarks, evaluation techniques, and ensure we’re measuring what matters- as a fellow, you’ll be helping tackle some of these problems!

What fellows do

Fellows apply with a proposal for a new benchmark they want to build. If accepted, the fellowship is time and support to design, implement, and validate that benchmark. Some domains we’re interested include:

Long horizon agentic benchmarking in computer use or software engineering
Cybersecurity
Finance, law, insurance
AI for Science evals- research-level mathematics, biology, materials science, theoretical physics, and more.

These are example domains we’re particularly interested in seeing applications for, but other domains and benchmark ideas are welcome. We are looking for benchmarks that have construct validity, and that are reflective of their usage in the domain corresponding to the benchmark.

While preference will be given to applications that propose building new benchmarks, we will also consider strong applications that deal with science-of-evals work. Such work can include:

How do we measure capabilities in the regime where agents operate for extremely long horizons?
How can we audit benchmarks, analyze logs effectively, detect reward hacking, incorrect benchmark tests, and qualitative observations about model behavior during the course of the evaluation?
New modes/techniques of evaluation, including but not limited to: evaluation of multi-agent systems, forecasting, games, and evaluation of collaboration skills.

You’re not limited to these- the strongest applications could be proposals we hadn’t thought of. If you have ideas around new measurement techniques or theory we’d love to hear about it.

What we provide

Stipend: $1,000-2,500 / week

Compute: Unlimited API credits + budget capacity for GPUs and human data
Mentorship: weekly 1:1s with a Vals research lead, plus regular access to the broader team
Access: frontier model APIs, our internal evaluation infrastructure, and (where appropriate) real customer evaluation problems
Workspace: desk in our San Francisco office (if you’re interested in being in-person)
Network: intros to researchers across frontier labs

Who we’re looking for

Preferred background in CS, ML, statistics, or an adjacent relevant field, although all applications will also be considered on the merit of the ideas and team.
Genuine interest in evaluation as a research discipline, not just as a stepping stone
Comfortable working in a startup environment: faster iteration, less hand-holding, more direct contact with work that ends up mattering

You don’t need to have published in ML venues specifically. Great evaluation work can come from measurement, psychometrics, statistics, HCI, and social science backgrounds.

We ask fellows to be able to commit at least 20 hours per week during the fellowship. This can be done during a leave, over the summer, or during a flexible period of a PhD program.

Timeline

Applications open: May 25, 2026
Application deadline: June 30, 2026 (rolling review — earlier is better)
Decisions: Within 2 weeks of submission