Role Overview

Part-time position for PhD-level experts to design and evaluate AI models through challenging STEM benchmark problems.

Key Responsibilities

Design challenging, real-world STEM benchmark problems in data science, machine learning, finance, and software engineering
Implement tasks within an agentic development environment using Python
Create reproducible problem setups with clear specifications and executable tests
Evaluate and analyze AI model behavior, including reasoning traces and agent workflows
Diagnose reasoning failures, logic gaps, and problem-solving limitations in AI systems
Contribute to improving benchmark quality and evaluation frameworks for frontier AI models

Active or recently graduated PhD
Deep expertise in data science, machine learning, finance, and/or Python-based software development
Strong research background in advanced STEM topics
Ability to commit reliably for 30+ hours per week
Demonstrated technical output such as high-quality open-source contributions or research work
Ability to analyze agent behavior traces and diagnose failures beyond surface-level errors