clipboard-checkEvaluation

Run evals on your trained model against a test dataset. Access via lr.evals on your LightningRod client.

EvalModel

Specify each model to evaluate when using create or run (you supply the full list). For run_from_training_job, the SDK adds the base and fine-tuned models for you; use EvalModel only for extra_models.

Parameter
Type
Required
Default
Description

model_id

str

Yes

Model ID to evaluate (trained model or any supported model)

label

str | None

No

None

Human-readable label shown in results display

from lightningrod import EvalModel

EvalModel(model_id=job.model_id, label="my-fine-tune")

Methods

create

Create an eval job without waiting. You pass the complete list of models to compare:

from lightningrod import EvalModel

eval_job = lr.evals.create(
    dataset=test_dataset,
    models=[EvalModel(model_id=job.model_id, label="my-fine-tune")],
)
Parameter
Type
Default
Description

dataset

SampleDataset

Test dataset

models

list[EvalModel]

Models to evaluate

reasoning_comparison

ReasoningComparisonOptions | None

Unset

Optional reasoning comparison config (see Reasoning Comparison)

run

Create an eval job and poll until completion. In notebooks, shows a live progress display. You pass the full models list (same idea as create, but waited).

Parameter
Type
Default
Description

dataset

SampleDataset

Test dataset (e.g. test split).

models

list[EvalModel]

Models to evaluate (complete list).

reasoning_comparison

ReasoningComparisonOptions | None

Unset

Optional reasoning comparison config (see Reasoning Comparison)

run_from_training_job

Same waiting behavior as run, but builds the model list from your training config and completed job.

The benchmark always includes two models, in order:

  1. Baseconfig.base_model_id (label "Base").

  2. Fine-tunedjob.model_id after training completes (label "Fine-tuned").

You do not pass these explicitly. Use extra_models only for additional models (e.g. OpenAI baselines).

SFT: run_from_training_job raises NotImplementedError if config is SFTTrainingConfig, because SFT-specific eval metrics are not implemented yet. Use GRPO with run_from_training_job, or use run / create with your own EvalModel list.

Parameter
Type
Default
Description

config

GRPOTrainingConfig | SFTTrainingConfig

Same training config you used with lr.training.run (SFT raises until supported).

job

TrainingJob

Completed job from lr.training.run; must have model_id set.

dataset

SampleDataset

Test dataset (e.g. test split).

extra_models

list[EvalModel] | None

None

Optional extra models appended after Base and Fine-tuned.

reasoning_comparison_sample_size

int

0

Number of sample pairs for reasoning comparison between base and fine-tuned models. Set to 0 to disable.

get

Fetch a single eval job by ID:

list

List eval jobs with pagination:

Pretty-print eval results:

Evaluating Intermediate Checkpoints

run_from_training_job always evaluates the final job.model_id as the fine-tuned slot. To compare intermediate checkpoints, swap in a different checkpoint as one of the EvalModel entries, or evaluate several checkpoints in one job, use run / create and pass the full EvalModel list yourself.

After training finishes, a completed TrainingJob exposes:

  • job.model_id — the final trained adapter (same ID run_from_training_job uses as "Fine-tuned").

  • job.model_id_by_step — a mapping from training step to checkpoint model ID. Keys are strings (e.g. "10", "20"). Values are the same model ID strings you pass to EvalModel. Inspect which steps exist on your job (print(job.model_id_by_step) or iterate keys) instead of assuming a particular step number.

Checkpoint cadence follows your save_frequency (and server defaults when omitted). Which step keys appear depends on the run; read them from the completed job.

model_id_by_step may be None or unset for some job states; only use keys that are actually present. SFT workflows that cannot use run_from_training_job for metrics still use the same pattern: build an explicit list of EvalModels, including IDs from model_id_by_step when available.

Use run if you want the same waited, live progress behavior as elsewhere; use create if you only need to submit the eval and poll or fetch it yourself.

Same models with create (no built-in wait; fetch or poll eval_job.id as needed):

Reasoning Comparison

Compare the reasoning quality of two models side-by-side using an LLM judge. The judge evaluates n sample pairs and produces a report accessible on the completed EvalJob.

ReasoningComparisonOptions

Parameter
Type
Default
Description

model_a_id

str

First model ID (typically the base model)

model_b_id

str

Second model ID (typically the fine-tuned model)

judge_model_id

str

"anthropic/claude-sonnet-4.6"

Model used as the judge

n

int

10

Number of sample pairs to compare

instructions

str

"Compare the reasoning quality of the two models."

Custom instructions for the judge

With run or create

Pass a ReasoningComparisonOptions directly:

With run_from_training_job

Use the shorthand reasoning_comparison_sample_size parameter — the SDK fills in model_a_id (base) and model_b_id (fine-tuned) for you:

The reasoning comparison report is available on the completed job as eval_job.reasoning_comparison_report.

Downloading Results

After an eval completes, download the per-model rollout results as Parquet files.

get_results

Get signed download URLs for eval result files:

download_results

Download result files to disk:

Parameter
Type
Default
Description

eval_id

str

Eval job ID

output_dir

str | Path

"."

Directory to save files

timeout

float

1800.0

HTTP download timeout in seconds

load_results

Load result files directly into pandas DataFrames (requires pandas):

Parameter
Type
Default
Description

eval_id

str

Eval job ID

timeout

float

1800.0

HTTP download timeout in seconds

Example

See notebooks/getting_started/05_grpo_training.ipynbarrow-up-right for the full GRPO workflow including evals.run_from_training_job.

Last updated