sitemapOverview

Training examples grounded in real outcomes

Dataset generation turns raw data into labeled forecasting samples through a configurable pipeline. You define the sources, question style, and labeling approach; the SDK handles the rest.

Pipeline Stages

A typical pipeline runs through these stages:

  1. Seed generation — Fetch raw data (news articles, documents, etc.)

  2. Question generation — Create forecasting questions from seeds using AI

  3. Context (optional) — Enrich samples with relevant news or RAG-retrieved documents

  4. Labeling — Resolve questions with ground truth via web search

  5. Rendering — Format questions into prompts for model input

  6. Rollouts (optional) — Send prompts to one or more LLMs

  7. Scoring (optional) — Score model outputs (from rollouts) against ground truth

QuestionPipeline

QuestionPipeline orchestrates these stages. You configure each stage and pass the config to lr.transforms.run():

Running the Pipeline

  • lr.transforms.run(config, input_dataset=None, max_questions=None, max_cost_dollars=None, detach=False) — Submit and wait for completion. Returns a Dataset.

  • lr.transforms.submit(config, ...) — Submit without waiting. Returns a TransformJob.

  • lr.transforms.estimate_cost(config, max_questions=None) — Estimate cost in dollars before running.

Use detach=True for long-running jobs so the job continues even if your local process exits.

Next Steps

Last updated