Company Overview:We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data.Solving AI’s data problem is a generational opportunity. We’re backed by world-class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech.We’re a lean, fast-moving, high-trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.About the RoleProtege is hiring a Senior Software Engineer to own the data processing layer at ingestion — the part of the platform that takes large-scale source data and turns it into clean, structured, enriched, validated, AI-ready datasets. This is a hands-on, backend- and data-heavy role with end-to-end ownership of the pipelines that move and process data at volume.Protege connects organizations that hold high-value data with the AI builders who need it. The value of that exchange depends on what happens at ingestion: raw, varied, high-volume source data has to be processed reliably, securely, and at scale before it's useful to anyone.You'll work across imaging, audio, video, and other data modalities, crossing healthcare, media, and other disparate industries and data partners. You’ll partner closely with product, Data Lab, and partner engineering teams to build robust ingestion and processing systems for structured and unstructured data at massive scale, from millions to billions of records, files, and other source objects. This role is ideal for engineers who are energized by messy data at scale, want deep ownership of critical infrastructure, and like turning ambiguity into reliable systems.What You'll DoIngestion & Processing SystemsDesign, build, and operate the ingestion systems that process large volumes of multimodal data into usable, well-structured datasetsOwn the ingestion path end to end, from how data lands to how it is validated, processed, tracked, and made available downstreamBuild modality-specific processing steps for real-world source data, such as medical imaging processing, audio and video metadata extraction, quality validation, and notes processingBuild parsers, validators, and normalization logic that can systematically handle messy, non-standard, and high-variance source formatsTurn repeated one-off data handling work into reusable processing patterns, internal tooling, and platform capabilitiesScale, Performance & ReliabilityBuild for high volume and high throughput, optimizing systems for reliability, cost, and speedWork across distributed and parallel compute systems to process workloads that do not fit well on a single machineChoose the right execution model for the workload, including batch processing, distributed execution, and modern compute patterns for unstructured data and inference-heavy processingDiagnose and resolve bottlenecks across ingestion and processing systems, and keep performance from degrading as volume and modality complexity growData Quality, Security & ComplianceBuild validation and quality checks that catch bad, incomplete, or malformed data before it propagates downstreamHandle sensitive and regulated data, including PHI, with the security and care the domain demands, including de-identification where requiredTrack provenance, metadata, and usage constraints through the ingestion path so downstream use remains compliant and auditableRaise the quality bar for observability, debuggability, and operational reliability across the ingestion layerCross-Functional PartnershipPartner with product and Data Lab to support new modalities, new partner requirements, and non-standard source dataWork directly with partner engineering teams when needed to translate source-system realities into robust ingestion and processing designSurface recurring patterns that are worth standardizing into reusable transforms, validators, and internal toolingHelp shape how Protege handles new data types as the platform expands into more complex data environmentsWhat Success Looks Like30 days: RampGet productive in the codebase and ship your first improvements to existing pipelinesBuild a working map of the ingestion and processing stack, the major data flows, and how we handle each modalityMeet the engineering, product, and Data Lab teams to understand how the function operates across the company60 days: Take OwnershipOwn a processing pipeline or modality end to end, from ingestion through delivery of AI-ready outputDevelop depth in how we handle one or two data types at scaleStart raising the bar on data quality, observability, and processing best practices90 days: Operate IndependentlyOwn a significant part of the ingestion and processing layer and lead design on new modalities or scaling challengesShip reliably with minimal hand-holding, and help unblock others working in the data layerIdentify at least one leverage opportunity — a reusable transform, tool, or architectural improvement — worth investing in, and drive itWhat You BringMust Haves5+ years building and operating production backend or data systems, with real experience in data processing at scaleHands-on experience designing and running large-scale data pipelinesStrong programming skills in PythonExperience with distributed data processingStrong proficiency with AWSComfort with messy, varied, high-volume data and high ambiguity, with a knack for finding patterns in complex environmentsAttention to detail without losing speed, and a bias to actionExcited to work on a product built around moving and processing large volumes of dataCurious, tenacious, and proactiveNice to HavesExperience processing one or more specific modalities at scale: medical imaging (e.g., DICOM), text, audio or videoBackground working with sensitive or regulated data environments (HIPAA, healthcare compliance, PHI handling)Experience with streaming systems or workflow orchestration (e.g., Airflow, Dagster)Experience with GCP and AzurePrior startup experience as a founding or early engineerFamiliarity with ML, NLP, or LLM-based systems, including embeddings and fine-tuningProtege Values Pass the Loved Ones’ TestWe act with integrity and do the right thing — especially when it’s hard and no one is watching.Always Find a WayWe are resourceful, resilient builders who solve hard problems and push through obstacles.Go Fast and Grow FastVelocity matters. We move with urgency, learn quickly, and continuously improve as individuals and as a company.Practice Kindness and CandorWe communicate directly and respectfully, building trust through honest feedback and genuine care for one another.Deliver TogetherWe win as one team. Collaboration, accountability, and shared ownership drive our success.Own the Outcome. Hone the Craft.We take pride in our work, sweat the details, and continuously raise the bar for excellence.

Heads-up

Senior Software Engineer, Data Processing

Emma

Requirements and responsibilities

Skills wanted:

Language(s) required:

Richard Ho

Richard Ho

About Protege:

www.withprotege.ai/

Admin access needed

Payment confirmed

A member of the Torre team will contact you shortly