Rollouts and Scoring
Rollouts send rendered prompts to one or more LLMs and score their outputs against ground truth. Use them to benchmark models, compare forecasts, or analyze model consensus on forecasting questions.
Pipeline Flow
When you add rollout_generator and scorer to a QuestionPipeline, the pipeline runs two extra stages after rendering:
Rollout — Each rendered prompt is sent to every configured model; each model returns a completion (and optionally structured output).
Scoring — Each model's output is scored against the sample's label using a reward function that matches the answer type.
Samples end up with a rollouts list: one entry per model, each with model_name, content, parsed_output, and reward.
QuestionRenderer
Formats questions into prompts for model input. The rendered prompt becomes sample.prompt, which RolloutGenerator sends to each model. Optional; a default renderer is used if omitted.
template
str
No
—
Custom template. Placeholders: {question_text}, {context}, {answer_instructions}
answer_type
AnswerType
No
—
Used to render answer instructions
QuestionRenderer(
answer_type=BinaryAnswerType(),
)RolloutGenerator
Sends prompts to multiple models. Requires a list of ModelConfig objects.
models
list[ModelConfig]
Yes
—
Models to run (e.g. via open_router_model())
prompt_template
str | None
No
—
Template with {column} placeholders. If None, uses sample.prompt
input_columns
list[str]
No
—
Columns from meta to substitute into template
output_schema
Any
No
—
Pydantic model for structured output
RolloutScorer
Scores each rollout against the sample's label. The reward function depends on the answer type (Brier score for binary, log score for continuous, etc.).
answer_type
AnswerType
Yes
—
Must match the pipeline's answer type
multiple_choice_options
dict | None
No
—
For multiple choice: option key → display value
is_mutually_exclusive
bool
No
True
Whether options are mutually exclusive
Analysis Utilities
After downloading samples with rollouts, use these utilities to analyze model performance and consensus.
compute_metrics_summary
Per-model accuracy and calibration metrics.
For multiple choice, pass the options mapping:
compute_consensus
For binary/continuous probability forecasts: where do models agree or disagree?
Each entry has:
predictions— model name → predicted probabilityspread— max minus min probability across models (higher = more disagreement)all_agree— whether all models predict the same side of 0.5
compute_multi_choice_consensus / compute_consensus_summary
For multiple-choice tasks, use compute_multi_choice_consensus or compute_consensus_summary with the options mapping. See document_classification.ipynb.
When to Use Rollouts
Benchmark multiple models on the same questions
Add RolloutGenerator + RolloutScorer to pipeline
Compare model consensus (agreement/disagreement)
Run pipeline, then compute_consensus(samples)
Per-model accuracy and calibration
Run pipeline, then compute_metrics_summary(samples)
Training data only (no model comparison)
Omit rollout_generator and scorer
Example: Rollouts in a QuestionPipeline
RolloutGenerator and RolloutScorer plug into QuestionPipeline as optional stages after rendering. The pipeline runs seeds → questions → labeling → rendering, then sends each rendered prompt to every configured model and scores the outputs.
For a full end-to-end example including seed generation, question generation, and consensus analysis, see model_consensus.ipynb.
Related
Answer Types — Binary, Continuous, MultipleChoice affect scoring
Last updated
