Data Preparation
prepare_for_training is the main entry point for preparing Lightning Rod datasets for model training. It filters invalid samples, deduplicates, splits into train/test, and returns SampleDataset objects ready for lr.training.run(...) with a GRPOTrainingConfig or SFTTrainingConfig, for lr.evals.run_from_training_job(config, job, test_dataset) (GRPO), or for lr.evals.create(...) when you supply the full model list (e.g. SFT or custom evals).
What It Does
Filter — Drops invalid samples, optionally by resolution horizon and context presence
Deduplicate — Removes duplicates by (question_text, resolution_date) or custom key
Split — Splits into train/test by temporal order or random shuffle
Parameters
from lightningrod.training import prepare_for_training, FilterParams, DedupParams, SplitParams
train_dataset, test_dataset = prepare_for_training(
dataset,
filter=FilterParams(...),
dedup=DedupParams(...),
split=SplitParams(...),
verbose=True,
)dataset
SampleDataset
—
Dataset to prepare (required)
filter
FilterParams | None
None
Filtering options (see below)
dedup
DedupParams | None
None
Deduplication options (see below)
split
SplitParams | None
None
Train/test split options (see below)
verbose
bool
True
Print filter/dedup/split stats
FilterParams
Controls which samples are kept before splitting.
days_to_resolution_range
(min, max) | None
None
Filter by resolution horizon. E.g. (90, None) keeps ≥ 90 days; (14, 60) keeps 14–60 days
drop_missing_context
bool
False
Exclude samples with no rendered context
DedupParams
Controls duplicate removal.
key_fn
Callable | None
None
Custom dedup key; default is (question_text, resolution_date)
SplitParams
Controls how samples are divided into train and test sets.
strategy
str
"temporal"
"temporal" (chronological) or "random"
test_size
float | None
0.2
Fraction of samples for test set (0.0–1.0)
test_start
str | None
None
ISO date for temporal split cutoff; alternative to test_size for exact date control
random_state
int
196
Seed for reproducible random splits
filter_leaky_train
bool
True
Remove train samples whose date_close or resolution_date falls within the test period
sort_key
Callable | None
None
Custom sort key for temporal ordering; defaults to resolution_date
leakage_keys
list[Callable] | None
None
Additional fields to check for temporal leakage
Guidelines
Prefer
"temporal"split for forecasting — It reflects real-world deployment where the model never sees future questions during training.Use
test_startvstest_size— Usetest_startwhen you want the test set to cover a specific time window (e.g. a recent month); usetest_sizewhen you just want a fixed proportion.Use
days_to_resolution_rangeto control difficulty — Short horizons (e.g.(7, 30)) are easier; longer horizons (e.g.(90, None)) test genuine uncertainty.Set
drop_missing_context=Trueonly if your dataset was generated withcontext_generators(e.g.NewsContextGenerator) inQuestionPipeline. Context is absent by default, so enabling this on a dataset without context will drop all samples. See Labeling and Context.Keep
filter_leaky_train=True(default) unless you intentionally want to test without temporal leak protection.Use
verbose=Trueon first runs to understand how many samples are dropped at each stage and why.
Example
Output
Returns (train_dataset, test_dataset) — two SampleDataset objects ready for lr.training.run(training_config, dataset=train_dataset) (GRPO or SFT config) and evaluation on test_dataset via lr.evals.run_from_training_job(training_config, job, test_dataset) or lr.evals.create(...).
Last updated
