Data Preparation
filter_and_split is the main entry point for preparing Lightning Rod datasets for model training. It filters invalid samples, deduplicates, splits into train/test, and returns SampleDataset objects ready for lr.training.run() or lr.evals.run().
What It Does
Filter — Drops invalid samples, optionally by resolution horizon and context presence
Deduplicate — Removes duplicates by (question_text, resolution_date) or custom key
Split — Splits into train/test by temporal order or random shuffle
Parameters
dataset
SampleDataset
—
Dataset to prepare (required)
test_size
float
0.2
Fraction of samples for test set (0.0–1.0)
split_strategy
str
"temporal"
"temporal" (chronological) or "random"
test_start
str | None
None
ISO date for temporal split cutoff; alternative to test_size for exact date control
drop_missing_context
bool
False
Exclude samples with no rendered context
days_to_resolution_range
(min, max) | None
None
Filter by resolution horizon. E.g. (90, None) keeps ≥ 90 days; (14, 60) keeps 14–60 days
random_state
int
196
Seed for reproducible random splits
filter_leaky_train
bool
True
Remove train samples whose date_close or resolution_date falls within the test period
deduplicate_key_fn
Callable | None
None
Custom dedup key; default is (question_text, resolution_date)
verbose
bool
False
Print filter/dedup/split stats
Guidelines
Prefer
"temporal"split for forecasting — It reflects real-world deployment where the model never sees future questions during training.Use
test_startvstest_size— Usetest_startwhen you want the test set to cover a specific time window (e.g. a recent month); usetest_sizewhen you just want a fixed proportion.Use
days_to_resolution_rangeto control difficulty — Short horizons (e.g.(7, 30)) are easier; longer horizons (e.g.(90, None)) test genuine uncertainty.Set
drop_missing_context=Trueonly if your dataset was generated withcontext_generators(e.g.NewsContextGenerator) inQuestionPipeline. Context is absent by default, so enabling this on a dataset without context will drop all samples. See Labeling and Context.Keep
filter_leaky_train=True(default) unless you intentionally want to test without temporal leak protection.Use
verbose=Trueon first runs to understand how many samples are dropped at each stage and why.
Example
Output
Returns (train_dataset, test_dataset) — two SampleDataset objects ready for lr.training.run(config, dataset=train_dataset) and lr.evals.run(model_id=..., dataset=test_dataset).
Last updated
