Evaluation
Run evals on your trained model against a test dataset. Access via lr.evals on your LightningRod client.
EvalModel
Specify each model to evaluate when using create or run (you supply the full list). For run_from_training_job, the SDK adds the base and fine-tuned models for you; use EvalModel only for extra_models.
model_id
str
Yes
—
Model ID to evaluate (trained model or any supported model)
label
str | None
No
None
Human-readable label shown in results display
from lightningrod import EvalModel
EvalModel(model_id=job.model_id, label="my-fine-tune")Methods
create
Create an eval job without waiting. You pass the complete list of models to compare:
from lightningrod import EvalModel
eval_job = lr.evals.create(
dataset=test_dataset,
models=[EvalModel(model_id=job.model_id, label="my-fine-tune")],
)dataset
SampleDataset
—
Test dataset
models
list[EvalModel]
—
Models to evaluate
reasoning_comparison
ReasoningComparisonOptions | None
Unset
Optional reasoning comparison config (see Reasoning Comparison)
run
Create an eval job and poll until completion. In notebooks, shows a live progress display. You pass the full models list (same idea as create, but waited).
dataset
SampleDataset
—
Test dataset (e.g. test split).
models
list[EvalModel]
—
Models to evaluate (complete list).
reasoning_comparison
ReasoningComparisonOptions | None
Unset
Optional reasoning comparison config (see Reasoning Comparison)
run_from_training_job
Same waiting behavior as run, but builds the model list from your training config and completed job.
The benchmark always includes two models, in order:
Base —
config.base_model_id(label"Base").Fine-tuned —
job.model_idafter training completes (label"Fine-tuned").
You do not pass these explicitly. Use extra_models only for additional models (e.g. OpenAI baselines).
SFT: run_from_training_job raises NotImplementedError if config is SFTTrainingConfig, because SFT-specific eval metrics are not implemented yet. Use GRPO with run_from_training_job, or use run / create with your own EvalModel list.
config
GRPOTrainingConfig | SFTTrainingConfig
—
Same training config you used with lr.training.run (SFT raises until supported).
job
TrainingJob
—
Completed job from lr.training.run; must have model_id set.
dataset
SampleDataset
—
Test dataset (e.g. test split).
extra_models
list[EvalModel] | None
None
Optional extra models appended after Base and Fine-tuned.
reasoning_comparison_sample_size
int
0
Number of sample pairs for reasoning comparison between base and fine-tuned models. Set to 0 to disable.
get
Fetch a single eval job by ID:
list
List eval jobs with pagination:
print_eval
Pretty-print eval results:
Evaluating Intermediate Checkpoints
run_from_training_job always evaluates the final job.model_id as the fine-tuned slot. To compare intermediate checkpoints, swap in a different checkpoint as one of the EvalModel entries, or evaluate several checkpoints in one job, use run / create and pass the full EvalModel list yourself.
After training finishes, a completed TrainingJob exposes:
job.model_id— the final trained adapter (same IDrun_from_training_jobuses as"Fine-tuned").job.model_id_by_step— a mapping from training step to checkpoint model ID. Keys are strings (e.g."10","20"). Values are the same model ID strings you pass toEvalModel. Inspect which steps exist on your job (print(job.model_id_by_step)or iterate keys) instead of assuming a particular step number.
Checkpoint cadence follows your save_frequency (and server defaults when omitted). Which step keys appear depends on the run; read them from the completed job.
model_id_by_step may be None or unset for some job states; only use keys that are actually present. SFT workflows that cannot use run_from_training_job for metrics still use the same pattern: build an explicit list of EvalModels, including IDs from model_id_by_step when available.
Use run if you want the same waited, live progress behavior as elsewhere; use create if you only need to submit the eval and poll or fetch it yourself.
Same models with create (no built-in wait; fetch or poll eval_job.id as needed):
Reasoning Comparison
Compare the reasoning quality of two models side-by-side using an LLM judge. The judge evaluates n sample pairs and produces a report accessible on the completed EvalJob.
ReasoningComparisonOptions
model_a_id
str
—
First model ID (typically the base model)
model_b_id
str
—
Second model ID (typically the fine-tuned model)
judge_model_id
str
"anthropic/claude-sonnet-4.6"
Model used as the judge
n
int
10
Number of sample pairs to compare
instructions
str
"Compare the reasoning quality of the two models."
Custom instructions for the judge
With run or create
run or createPass a ReasoningComparisonOptions directly:
With run_from_training_job
run_from_training_jobUse the shorthand reasoning_comparison_sample_size parameter — the SDK fills in model_a_id (base) and model_b_id (fine-tuned) for you:
The reasoning comparison report is available on the completed job as eval_job.reasoning_comparison_report.
Downloading Results
After an eval completes, download the per-model rollout results as Parquet files.
get_results
Get signed download URLs for eval result files:
download_results
Download result files to disk:
eval_id
str
—
Eval job ID
output_dir
str | Path
"."
Directory to save files
timeout
float
1800.0
HTTP download timeout in seconds
load_results
Load result files directly into pandas DataFrames (requires pandas):
eval_id
str
—
Eval job ID
timeout
float
1800.0
HTTP download timeout in seconds
Example
See notebooks/getting_started/05_grpo_training.ipynb for the full GRPO workflow including evals.run_from_training_job.
Last updated
