Labeling and Context
Labeling resolves questions with ground truth; context enriches samples with relevant information, which leads to better results in training. This page covers the labeler, context generators, and filter criteria you plug into QuestionPipeline.
WebSearchLabeler
Resolves questions with ground truth via web search. Use when you need real-world answers.
answer_type
AnswerType
No
—
Expected answer type (guides the labeler)
confidence_threshold
float
No
0.9
Minimum confidence to include a question
resolve_redirects
bool
No
False
Resolve redirect URLs to destinations
WebSearchLabeler(
answer_type=BinaryAnswerType(),
confidence_threshold=0.9,
)Omit labeler when using QuestionAndLabelGenerator, which produces labels synthetically.
NewsContextGenerator
Enriches samples with relevant news articles. Add to context_generators in QuestionPipeline.
num_search_queries
int
No
5
Search queries per question
articles_per_query
int
No
3
Articles per search query
num_articles
int
No
10
Max articles in final output
relevance_threshold
int
No
2
Min relevance (1–6) to include
min_articles
int
No
6
Minimum articles to ensure
time_delta_days
int
No
30
Days to look back for news
enable_relevance_ranking
bool
No
True
Use LLM-based relevance ranking
QdrantContextGenerator
Retrieves context from documents in a FileSet via vector search. Builds a Qdrant index on first use (chunking + embedding with BAAI/bge-small-en-v1.5), then retrieves the top-k most semantically relevant chunks per question. Use when your seeds come from a FileSet and you want to enrich questions with passages that may be scattered across many documents. Add to context_generators in QuestionPipeline.
file_set_id
str
Yes*
—
FileSet ID to load the Qdrant collection from (*or collection_name for direct injection)
top_k
int
No
5
Number of chunks to retrieve
temporal_direction
str
No
—
"before" (timestamp ≤ seed date, includes seed's doc) or "after" (timestamp > seed date)
payload_filters
dict[str, str]
No
—
{qdrant_payload_key: sample_meta_key} — values pulled from the sample's metadata at query time
index_chunk_size
int
No
1500
Chunk size used when building the backing Qdrant index
index_chunk_overlap
int
No
150
Chunk overlap used when building the backing Qdrant index
embedding_model
str
No
"BAAI/bge-small-en-v1.5"
FastEmbed model name for query embedding
FileSetDocumentContextGenerator
Resolves a single document from a FileSet by temporal ordering, downloads its full text, and appends it as context. No vector search — picks the one document whose chronological position matches temporal_constraint. Optionally processes the document through an LLM before injection. Add to context_generators in QuestionPipeline.
Use this instead of QdrantContextGenerator when you want the complete text of one document rather than RAG-retrieved chunks from multiple documents.
file_set_id
str
Yes
—
FileSet ID to resolve documents from
temporal_constraint
TemporalConstraint
No
—
Temporal filtering direction relative to the seed document's date
metadata_filter_keys
list[str]
No
—
Keys from sample's file metadata for exact-match filtering
system_instruction
str | None
No
None
System prompt for optional LLM processing of the document
model
ModelConfig | None
No
None
Model for LLM processing; if None, raw document text is used as context
max_document_chars
int | None
No
None
Character limit for document text; truncates from the end if exceeded
QdrantRAGLabeler
Resolves questions by vector-searching a FileSet for answer evidence and using an LLM to extract a structured label from the retrieved chunks. Use when your seeds come from a FileSet and the answer may appear in chunks scattered across multiple documents (e.g. forward-looking questions resolved by future quarterly reports).
file_set_id
str
Yes*
—
FileSet ID to load the Qdrant collection from (*or collection_name)
answer_type
AnswerType
No
—
Expected answer type (guides the labeler)
payload_filters
dict[str, str]
No
—
{qdrant_payload_key: sample_meta_key} mapping
temporal_direction
str
No
—
"before" or "after" relative to the seed date
confidence_threshold
float
No
0.9
Minimum confidence to include a question
top_k
int
No
5
Number of chunks to retrieve
extraction_model
ModelConfig | None
No
gemini-2.5-flash
LLM used for structured label extraction
index_chunk_size
int
No
1500
Chunk size used when building the backing Qdrant index
index_chunk_overlap
int
No
150
Chunk overlap used when building the backing Qdrant index
embedding_model
str
No
"BAAI/bge-small-en-v1.5"
FastEmbed model name for query embedding
FileSetDocumentLabeler
Resolves a single document from a FileSet by temporal ordering, downloads its full text, and uses an LLM to extract a structured label. No vector search — picks the one document whose chronological position matches temporal_constraint. Use this instead of QdrantRAGLabeler when you want to label from the full content of a specific document rather than RAG-retrieved chunks.
file_set_id
str
Yes
—
FileSet ID to resolve documents from
temporal_constraint
TemporalConstraint
No
—
Temporal filtering direction relative to the seed document's date
metadata_filter_keys
list[str]
No
—
Keys from sample's file metadata for exact-match filtering (e.g. ["district"])
confidence_threshold
float
No
0.7
Minimum confidence threshold for valid labels
answer_type
AnswerType
No
—
Expected answer type (guides the labeler)
model
ModelConfig | None
No
None
Model for label extraction; defaults to gemini-2.5-flash
system_instruction
str | None
No
None
Domain-specific system instruction (e.g. "You are labeling Federal Reserve Beige Book questions.")
TemporalConstraint
Enum used by FileSetDocumentContextGenerator and FileSetDocumentLabeler to pick the single document resolved relative to the seed. (The Qdrant transforms use a separate temporal_direction string — "before" / "after" — instead.)
BEFORE
Documents on or before the seed date (no lookahead, multiple docs)
AFTER
Documents after the seed date (future docs, multiple docs)
NEXT_DOCUMENT
First document after the seed timestamp (single-doc resolution)
PREVIOUS_DOCUMENT
Most recent document before the seed timestamp (single-doc context)
EQUAL
Document with an exact matching date (single-doc)
Use BEFORE for context (historical documents). Use AFTER for labeling from multiple future documents. Use NEXT_DOCUMENT or PREVIOUS_DOCUMENT when you want exactly one document relative to the seed — useful with the document-level transforms (FileSetDocumentContextGenerator, FileSetDocumentLabeler).
FilterCriteria
LLM-based content scoring and filtering. An LLM scores each item against your rubric; items below min_score are excluded.
rubric
str
Yes
—
Scoring rubric/prompt
min_score
float
No
0.5
Minimum score threshold
model_name
str
No
google/gemini-3-flash-preview
Model for scoring
Use case 1 — filter seeds (news snippets): Pass filter_criteria to NewsSeedGenerator. Snippets scored below min_score are dropped before scraping.
Use case 2 — filter generated questions: Pass filter_ to question generators. Questions scored below min_score are dropped after generation.
Composing in QuestionPipeline
Last updated
