Who We Are:We build infrastructure that delivers massive amounts of web data to the companies training the world’s most powerful AI models.We're the team that helps to power and support Grass, a bandwidth-sharing network that lets us operate a massive distributed crawler, giving us unique access to high-quality public web data at global scale. On top of that, we’ve built pipelines for ingesting, segmenting, and annotating billions of videos, transcripts, and audio files, powering dataset creation for frontier labs.We’re lean, technical, and move fast. No red tape, no slow decision-making; just a team of builders pushing to expand what’s possible for open web data and AI.Overview:As a Research Crawling Engineer, you will design and operate large-scale web data acquisition systems for research and model development. You will work will span distributed systems, scraping infrastructure, and data pipelines.Responsibilities:Build and maintain large-scale web crawlers across diverse domainsDesign high-throughput, fault-tolerant systems for data collection (millions to billions of URLs/day)Handle anti-bot systems, rate limits, and dynamic/JS-heavy sitesDevelop pipelines for cleaning, deduplication, filtering, and normalizationConstruct and maintain datasets for research and model trainingMonitor crawl performance, coverage, and data quality; iterate quicklyCollaborate with research teams to align data collection with modeling needsOptimize infrastructure for cost, latency, and reliabilityRequirements:Strong programming experience in one or more of: Go, Rust, Python, Java, or C++Experience building web crawlers or large-scale data pipelinesSolid understanding of HTTP, networking, and browser behaviorFamiliarity with distributed systems and parallel processingExperience working with large datasets (TB–PB scale preferred)Ability to debug unstable or adversarial environmentsPreferred / Bonus:Experience with NLP pipelines or dataset curation for MLFamiliarity with LLM pretraining data or retrieval systemsExperience with headless browsers (e.g., Chrome DevTools Protocol, Playwright, Puppeteer)Knowledge of proxy systems, IP rotation, and large-scale request orchestrationBackground in data quality evaluation or benchmarkingExperience running workloads on cloud or bare-metal infrastructureWhat This Role Involves:Operating at the boundary of scale and reliabilityAdapting to constantly changing web environmentsBalancing throughput, coverage, and data qualityOwning end-to-end data acquisition pipelinesEvaluation Criteria:Ability to design systems that scale without degrading qualityPractical problem-solving under real-world constraintsSpeed of iteration and ownershipMeasurable improvements in data coverage, quality, or efficiencyCompensation:Based on experience and demonstrated ability to operate at scaleExample Projects:Build a distributed crawler for a continuously updated, high-quality web projectDesign a system to classify and filter billions of pages for pretrainingExtract structured data from dynamic, JS-heavy sites at scaleImprove deduplication and quality scoring across multimodal datasetsWhy Work With Us:Opportunity. We are at the forefront of developing a web-scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.Culture. We're a lean team with a high bar. We come to work not to be comfortable, but to find out what we're capable of and to do work that matters. We're not calling for people who keep things moving. We're calling for people who make everyone around them better. We prioritize low ego and high output. This is a fully remote team.Compensation. You’ll receive a competitive salary, benefits and equity package.

Heads-up

Research Crawling Engineer

Emma

Requirements and responsibilities

Skills wanted:

Language(s) required:

About Wynd Labs:

wyndlabs.ai

Admin access needed

Payment confirmed

A member of the Torre team will contact you shortly