seedlingSeed Generators

Seed generators produce the raw data (seeds) that feeds into question generation - news articles, GDELT results, or chunks from your documents. They are the first stage of the pipeline; choose based on where your source data lives.

NewsSeedGenerator

Fetches news articles from Google News search.

Parameter
Type
Required
Default
Description

start_date

datetime

Yes

Start date for seed search

end_date

datetime

Yes

End date for seed search

search_query

str or list[str]

Yes

Search query. Multiple queries run separate searches

interval_duration_days

int

No

7

Duration of each interval in days

articles_per_search

int

No

10

Articles per search (max 100)

filter_criteria

FilterCriteria or list

No

Optional LLM-based filtering before scraping

source_domain

str or list[str]

No

Optional URL source (e.g. https://reuters.com/business)

NewsSeedGenerator(
    start_date=datetime(2025, 1, 1),
    end_date=datetime(2025, 3, 1),
    search_query=["Trump", "Fed rates"],
    articles_per_search=20,
)

GdeltSeedGenerator

Fetches articles from the GDELT global news database via BigQuery.

Parameter
Type
Required
Default
Description

start_date

datetime

Yes

Start date for seed search

end_date

datetime

Yes

End date for seed search

interval_duration_days

int

No

7

Duration of each interval in days

articles_per_interval

int

No

1000

Articles to fetch per interval from BigQuery

BigQuerySeedGenerator

Runs a BigQuery SQL query and converts results into seeds. Use when your source data lives in BigQuery.

Parameter
Type
Required
Default
Description

query

str

Yes

BigQuery SQL to execute

seed_text_column

str

No

"text"

Column mapped to seed text

date_column

str

No

Column mapped to seed creation date

max_rows

int

No

10000

Total rows to fetch across all batches

FileSetSeedGenerator

Chunks documents from an uploaded file set. Use when you have PDFs, text files, or other documents.

Parameter
Type
Required
Default
Description

file_set_id

str

Yes

FileSet ID to read from

chunk_size

int

No

4000

Characters per chunk

chunk_overlap

int

No

200

Overlapping characters between chunks

metadata_filters

list[str]

No

Metadata filters (e.g. ["ticker='AAL'"]). Files matching ANY filter are included

FileSetQuerySeedGenerator

Runs RAG-style queries against a file set. Produces seeds from retrieved chunks instead of full chunks.

Parameter
Type
Required
Default
Description

file_set_id

str

Yes

FileSet ID to query

prompts

list[str]

Yes

Queries to run against the file set

metadata_filters

list[str]

No

Optional metadata filters

Using with QuestionPipeline

Pass any seed generator to QuestionPipeline.seed_generator:

Custom Input Seeds

To use your own samples instead of a seed generator, create a dataset with lr.datasets.create_from_samples() and pass it as input_dataset to lr.transforms.run(). The pipeline will skip seed generation and use your samples as input.

Last updated