Crawler + Sentiment Pipeline

Project overview

I started this project as a public sentiment and propaganda anaysis tool intended to promote and highlight the delta between the two in times of ever growing uncertainty. The flow is as follows: pull text from the open web at scale, score it for sentiment, and never lie about how confident the model is. Two halves fell out naturally: a crawler that has to be polite and survive crashes, and an analysis pipeline that has to be auditable. Both are generic enough to outlive the thing I first built them for, which is why I'm writing them up on their own.

Ingestion (Go). A crash-resumable crawler with a SQLite frontier, per-domain rate limiting, robots.txt compliance, and content-addressed storage.
Analysis (Python). A hybrid sentiment analyzer: deterministic signals first, an optional LLM pass second, a human-review loop third.

The contract between them is a content-addressed store on disk and a SQLite database: the crawler writes, the pipeline reads, and neither needs to know the other's internals.

A crawler that survives being killed

The whole crawler is organized around a frontier: a SQLite table of URLs moving through a small state machine: QUEUED → INFLIGHT → DONE | FAILED. The interesting requirement is crash-resumability. If the process dies mid-fetch, some URLs are stranded in INFLIGHT forever unless something reclaims them. So on every startup, anything that's been in flight too long gets swept back to QUEUED:

// RecoverStale requeues items stuck INFLIGHT past a deadline, called on startup
// so a crashed run resumes cleanly instead of leaking work.
result, err := f.db.Conn().ExecContext(ctx, `
    UPDATE pages SET state = ?, next_fetch_at = ?, inflight_at = 0
    WHERE state = ? AND inflight_at < ? AND inflight_at > 0
`, model.StateQueued, now, model.StateInflight, cutoff)

Workers claim work in atomic batches with UPDATE … RETURNING, so concurrent crawlers never hand the same URL to two workers. Failures back off exponentially (1m, 2m, 4m, 8m) and only become permanent after a retry cap. The state lives entirely in SQLite.

Polite by construction

Crawling the open web means not being a nuisance. Politeness is enforced in two independent places. Each domain gets its own token-bucket rate limiter, created lazily and capped at a configured requests-per-second:

func (f *Fetcher) getRateLimiter(domain string) *rate.Limiter {
    f.limitersMu.Lock()
    defer f.limitersMu.Unlock()
    if l, ok := f.limiters[domain]; ok {
        return l
    }
    l := rate.NewLimiter(rate.Limit(f.ratePerSec), 1)
    f.limiters[domain] = l
    return l
}

And robots.txt is fetched and cached per host (1-hour TTL) before any path is requested. A page flows through one pipeline (robots check → rate-limited fetch → store → extract links → mark done), and any stage's failure routes to retryable backoff or permanent failure based on the error.

Store once, by content

Raw fetched bytes are written to a content-addressed store: the SHA-256 of the content is the filename, sharded two levels deep to avoid blowing past filesystem inode limits. Identical content always lands at the same path, so deduplication is automatic and the write is idempotent and atomic (temp file + rename):

h := sha256.Sum256(data)
hashStr := hex.EncodeToString(h[:])
path := filepath.Join(s.baseDir, "sha256", hashStr[:2], hashStr+ext)
if _, err := os.Stat(path); err == nil {
    return hashStr, nil          // already have it, no-op
}
// atomic: write to .tmp, then rename into place

The hash becomes the join key downstream: every analyzed document traces back to the exact bytes it came from.

Sentiment, hybrid by design

The analyzer never blindly trusts a model. It computes deterministic signals first (positive/negative lexicon hits, intensifiers, entity proximity), then decides whether to involve an LLM at all. If the model is disabled or unreachable, those same signals drive a heuristic classification, so the pipeline degrades instead of failing.

When the LLM does run, its job is refinement, and it's held to a hard rule: any evidence span it cites must be a verbatim substring of the source. Fabricated evidence is caught and the confidence is capped:

sent_spans, had_invalid = _validate_evidence_spans(
    response.get("sentiment_evidence_spans", []), text
)
if had_invalid and not sent_spans:
    conf = min(conf, UNVERIFIED_EVIDENCE_CONFIDENCE_CAP)   # don't trust unbacked claims

Every stored output carries its label, a confidence score, the model ID, the prompt version, and whether it came from the LLM or the heuristic fallback. Aggregates drop anything below a confidence threshold rather than averaging noise into the result.

The model backend is pluggable

The LLM sits behind a factory, chosen by one environment variable, so the rest of the pipeline never names a vendor:

def get_llm_client():
    if get_settings().llm_backend.lower() == "ollama":
        return get_ollama_client()    # local
    return get_gemini_client()        # cloud

Both backends enforce the same JSON schema at generation time (Gemini via its response_schema, Ollama via its format parameter), and both retry with exponential backoff. Swapping a cloud model for a local one (or back) is a config change, not a code change.

Heavy aggregation never happens at request time. The pipeline pre-computes dashboard snapshots into JSON files (written atomically), and the API just serves those, so reads are cheap and the expensive work runs on a schedule.

Keeping the model honest

The piece I'm most attached to is the human-in-the-loop review queue. Instead of asking people to label random samples, it surfaces the outputs the model was least sure about (lowest confidence first) because that's where review buys the most:

sql = """
    SELECT a.output_id, a.output_json, a.confidence, d.title, d.text
    FROM ai_outputs a
    JOIN docs d ON d.doc_id = a.doc_id
    LEFT JOIN ai_output_evals e ON e.ai_output_id = a.output_id
    WHERE a.task_type = ? AND e.ai_output_id IS NULL
    ORDER BY a.confidence ASC LIMIT ? OFFSET ?
"""

Human verdicts feed a golden set and accuracy stats, which is what lets me actually claim the pipeline is calibrated rather than just hoping.

What's next

A real evaluation harness on top of the golden set: precision/recall by source and confidence band.
Pluggable fetchers beyond the current set, behind the same frontier interface.
Packaging the crawler and the analyzer as standalone services: the domain-specific glue is the only thing tying them together today.

update log