Python Developer: AI Benchmark Task Construction, Review & Validation at Biz-Tech Analytics | Torre
Python Developer: AI Benchmark Task Construction, Review & Validation
Report
warning

Heads-up

The job you’re trying to post already exists in Torre:

Python Developer: AI Benchmark Task Construction, Review & Validation

You'll build and validate AI benchmarks, directly shaping the future of AI model capabilities.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Freelance
Recurrent

USD75.4K - 100K/year

~COP150M - 200M/year

+ Equity

+ Bonuses

location_on
Remote (anywhere)
Shared by
Emma of Torre.ai
10 days ago

Requirements and responsibilities


What you'll actually do: review and validate AI benchmark tasks in Python repos – and build new ones.That includes writing tasks with working solutions that are hard enough to make current AI coding agents fail. Run Docker-based test suites, verify oracle solutions, debug flaky tests, and assess task quality for reproducibility and correctness.This is an evaluation and task-design role, not a feature-building role – if you'd rather ship product than construct hard problems and find what's broken, this isn't for you.Must haves3+ years production Pythondeep pytest knowledgeDocker (building images, debugging containers)Linux CLI fluencyability to read large open-source repos quicklyNote: there's more to the role than Python alone – Docker, Linux, debugging depth, task design, and a few bonus areas (security, Kubernetes, async) all factor in.Also open to staffing/recruitment consultants who place contract tech roles – happy to discuss terms.
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.