sitemapOverview

Training examples grounded in real outcomes

Dataset generation turns raw data into labeled forecasting samples through a configurable pipeline. You define the sources, question style, and labeling approach; the SDK handles the rest.

Pipeline Stages

A typical pipeline runs through these stages:

  1. Seed generation — Fetch raw data (news articles, documents, etc.)

  2. Question generation — Create forecasting questions from seeds using AI

  3. Deduplication (optional) — Remove near-duplicate questions via exact or fuzzy matching

  4. Context (optional) — Enrich samples with relevant news or RAG-retrieved documents

  5. Labeling — Resolve questions with ground truth via web search

  6. Rendering — Format questions into prompts for model input

  7. Rollouts (optional) — Send prompts to one or more LLMs

  8. Scoring (optional) — Score model outputs (from rollouts) against ground truth

QuestionPipeline

QuestionPipeline orchestrates these stages. You configure each stage and pass the config to lr.transforms.run():

Running the Pipeline

  • lr.transforms.run(config, input_dataset=None, max_seeds=None, max_cost_dollars=None, name=None, detach=False) — Submit and wait for completion. Returns a Dataset.

  • lr.transforms.submit(config, input_dataset=None, max_seeds=None, max_cost_dollars=None, name=None) — Submit without waiting. Returns a TransformJob.

  • lr.transforms.estimate_cost(config, max_seeds=None) — Estimate cost in dollars before running.

Use detach=True for long-running jobs so the job continues even if your local process exits.

Next Steps

Last updated