Data/Pipeline Engineer at PharosGraph | Torre

Data/Pipeline Engineer

You'll build foundational AI-powered data pipelines, shaping political and policy intelligence with high autonomy and impact.
Emma highlights
This highlight was written by Emma’s AI. Ask Emma to edit it.
Full-time

Legal agreement: Depends on the location of the candidate

Currency exchange and taxes to be paid by:

Depends on the location of the candidate

Hidden
(SIGN IN to learn if it’s a match)
Job Admin(s) requested that this be hidden from the public.

+ Equity (up to 0.5% of the company)

location_on
Remote (specific timezone)
public
GMT-09:00 to GMT-03:00
skeleton-gauges
You have opted out of job matches in .
To undo this, go to the 'Skills and Interests' section of your preferences.
Review preferences
Posted about 2 months ago

Requirements and responsibilities


About Us We're PharosGraph, a US-based startup building AI-powered political and policy intelligence tools. Our platform analyzes political communications, projects issue perceptions to highly granular geography, scores narratives across moral foundations and emotional dimensions, and delivers geographic intelligence that no one else in the market has. We're not pre-product. We have production pipelines processing real data: election issue tracking across 15+ US states, candidate communication analysis, geographic audience targeting, and an interactive map dashboard with highly granular vector tiles. We're now expanding into government affairs and public affairs with a new platform applying our proven election-side methodologies to year-round policy intelligence, legislator profiling, and regulatory tracking. What we've already built: - A multi-step Python data pipeline that ingests news + social data, runs multi-key parallel LLM analysis, and produces Parquet datasets at high geo resolution. - Dynamic issue lifecycle management: auto-detecting emerging topics, tracking salience, evaluating issue swaps. - An async LLM processing pool that distributes work across multiple API keys for high volume parallel requests. - A React + FastAPI web application for geographic audience targeting with population-weighted scoring. You'd be joining as the first dedicated data/pipeline engineer, working closely with a co-founder (who built all current pipelines) to extend this architecture into government affairs intelligence. The Role You'll build the data ingestion and processing pipelines that power the GovScape platform. This means bringing in new data sources such as Congressional Record speeches, committee hearing transcripts, legislator social media (535 accounts), Federal Register filings, state regulatory feeds, etc. and processing them through our existing analytical engine. This is a builder role, not a maintenance role. You'll design and implement new pipelines from scratch, following established patterns but solving novel data engineering problems. The analytical engine (LLM scoring, emotion/moral foundations analysis, geo projection) already exists, your job is to feed it the right data and ensure that data flows reliably, on schedule, at quality. How your time breaks down: - Data Ingestion Pipelines (50%): Build production data pipelines for new sources: Congressional Record API, committee hearing transcripts, legislator social media feeds (535 accounts), press releases, Federal Register API, state regulatory feeds. Handle rate limiting, pagination, deduplication, schema normalization, and incremental updates. - Data Processing & Transformation (25%): Transform raw ingested data into the formats our analytical engine expects. Entity extraction (identifying company/industry mentions in political speech). Text preprocessing for LLM analysis. Parquet schema design and evolution. Population-weighted aggregation across geographic hierarchies. - Pipeline Reliability & Operations (15%): Monitoring, alerting, data freshness checks, pipeline recovery. Build confidence that data arrives on time, at quality, every run. - Collaboration & Architecture (10%): Work with a founder on pipeline architecture decisions. Document pipeline designs and data schemas. What You'll Work On (Real Examples) In months 1-2, you'll: - Onboard to the election-pipeline codebase and understand the existing multi-step pipeline architecture. - Build the Congressional Record API integration, ingesting floor speeches and committee hearing statements for all 535 members of Congress. - Scaffold the legislator social media ingestion pipeline (Twitter/X, press releases). In months 2-4, you'll: - Build the full StakeholderScape data ingestion pipeline: the multi-source feed that powers legislator intelligence profiles. - Integrate the Federal Register API for RegulatoryRadar: tracking regulatory proceedings, comment periods, and enforcement actions. - Build the NarrativeScape entity extraction pipeline: identifying company and industry mentions in political speech data using LLM-powered extraction. In months 4-6, you'll: - Add state regulatory feed ingestion for top-priority states. - Build data quality monitoring dashboards and pipeline freshness alerting. - Harden pipeline reliability: retry logic, failure recovery, incremental processing. - Document all pipeline designs, data schemas, and operational runbooks. Our Tech Stack Layer Technologies - Language: Python 3.11 — all pipelines are Python. - Data Format: Apache Arrow / Parquet (via pyarrow) — NOT pandas. - LLM Processing: Async multi-key LLM pool — high volume parallel requests. - Data Sources: NewsAPI, Reddit, Congressional Record API, Federal Register API, social media APIs. - Configuration: YAML configs drive all pipeline behavior (race configs, issue configs, scenario templates). - Intermediate Formats: JSON / JSONL for pipeline intermediates. - Geographic Data: 220K Census Block Groups, GeoJSON, MBTiles. - Infrastructure: Docker, AWS (EC2, S3), GitHub Actions. What you WON'T need to know: - React / TypeScript. - FastAPI routing or API design. - Design systems or CSS. Requirements Must Have - 4+ years of professional experience building data pipelines in Python. - Strong experience with pyarrow or Apache Arrow (or willingness to learn quickly: we do NOT use pandas in pipelines). - Experience integrating with REST APIs at scale: handling pagination, rate limiting, retries, and incremental data pulls. - Experience with data transformation: schema design, normalization, deduplication, format conversion. - Experience with structured and semi-structured data: JSON, JSONL, Parquet, CSV. - Familiarity with async Python (asyncio): our LLM pool is fully async. - Experience with Git and code review workflows. - Fluent communicating in English (written and verbal: async-first team). - Available to overlap 4+ hours with US Eastern time (9am-1pm ET or 2pm-6pm ET). Strongly Preferred - Experience with LLM/AI API integration: calling language model APIs, handling token limits, managing costs, parsing structured outputs. - Experience with web scraping or social media API integration (Twitter/X API, RSS feeds, government data APIs). - Experience with text processing and NLP: entity extraction, text cleaning, tokenization. - Experience with pipeline orchestration: scheduling, dependency management, failure recovery (Airflow, Prefect, Dagster, or custom). - Experience with Docker and containerized development environments. - Experience with AWS (S3, EC2) or similar cloud platforms. - Previous startup or small team experience: high autonomy, minimal process overhead. Nice to Have - Experience with political data, government data, or civic tech (Congressional Record, Federal Register, OpenSecrets, ProPublica Congress API). - Experience with geographic data: Census data, FIPS codes, GeoJSON, spatial joins. - Experience with YAML-driven configuration systems. - Experience with data quality monitoring and alerting. - Interest in political technology, policy intelligence, or government affairs. Working Arrangement - Location: Remote. - Hours: Flexible with required overlap of 4+ hours with US Eastern (9am-1pm ET or 2pm-6pm ET preferred). - Employment: Full-time contractor via EOR/COR (Deel, Remote.com, Oyster, or similar). - Communication: Async-first. Direct Slack communication with the founder. Minimal meetings. What We Offer - Early-stage with production systems: not a wireframe startup. You'll extend working, revenue-generating pipelines. - High autonomy, high impact: you'll be the first dedicated pipeline engineer. The architecture decisions you make will last. - Direct founder access: a founder built every current pipeline. Knowledge transfer is direct, not through docs. - Novel data engineering: political/policy data at geographic resolution is a genuinely interesting problem space. Nobody else does this at granular geo level. - Growth potential: if GovScape succeeds (and the election-side platform is already proving the model), this role grows into Data Engineering Lead. How We Hire 1. First call (25 min): We talk about your background. I show you the pipeline architecture and the GovScape vision. We see if there's mutual interest. 2. Technical interview (60 min): Data pipeline design discussion. We'll walk through a real scenario: "Design the ingestion pipeline for 535 legislators' public communications across 4 source types." Plus code review of actual pipeline code from our codebase. 3. Take-home project (3-4 hours): Build a small data pipeline that ingests from a public API, transforms the data, and outputs Parquet files. We provide the API and schema spec. 4. Take-home review (30 min): Walk through your solution together. What decisions did you make? What would you change with more time? 5. Final call (20 min): Mutual confirmation, offer discussion. We move fast. Expect the full process in 2-3 weeks.
Optionally, you can add more information later (benefits, pre-screening questions, etc.)
check_circle

Payment confirmed

A member of the Torre team will contact you shortly

In the meantime, continue adding information to your job opening.