Overview

Dataset generation turns raw data into labeled forecasting samples through a configurable pipeline. You define the sources, question style, and labeling approach; the SDK handles the rest.
Pipeline Stages
A typical pipeline runs through these stages:
Seed generation — Fetch raw data (news articles, documents, etc.)
Question generation — Create forecasting questions from seeds using AI
Context (optional) — Enrich samples with relevant news or RAG-retrieved documents
Labeling — Resolve questions with ground truth via web search
Rendering — Format questions into prompts for model input
Rollouts (optional) — Send prompts to one or more LLMs
Scoring (optional) — Score model outputs (from rollouts) against ground truth
QuestionPipeline
QuestionPipeline orchestrates these stages. You configure each stage and pass the config to lr.transforms.run():
Running the Pipeline
lr.transforms.run(config, input_dataset=None, max_questions=None, max_cost_dollars=None, detach=False)— Submit and wait for completion. Returns aDataset.lr.transforms.submit(config, ...)— Submit without waiting. Returns aTransformJob.lr.transforms.estimate_cost(config, max_questions=None)— Estimate cost in dollars before running.
Use detach=True for long-running jobs so the job continues even if your local process exits.
Next Steps
Core Concepts — Sample, Pipeline, Dataset, Transform Job
Seed Generators — News, GDELT, FileSet, FileSetQuery
Question Generators — Question, ForwardLooking, QuestionAndLabel, Template
Labeling and Context — WebSearchLabeler, NewsContextGenerator
Rollouts & Scoring — QuestionRenderer, RolloutGenerator, RolloutScorer, model consensus analysis
Answer Types — Binary, Continuous, MultipleChoice, FreeResponse
Datasets — Creating, fetching, and using datasets
Examples — Notebooks and tutorials
Last updated
