folder-openFilesets

A FileSet is a collection of documents with optional metadata that you can use as a data source for question generation or labeling. Use filesets when you have PDFs, text files, or other documents (e.g. quarterly reports, 10-Ks, internal memos) that you want to chunk, query, or use for context and labeling.

Creating a FileSet

Create a fileset with lr.filesets.create(). Optionally define a metadata schema so you can filter and organize documents by fields like ticker, quarter, or document_type.

from lightningrod import (
    FileSetMetadataSchemaInput,
    MetadataFieldDefinitionInput,
    MetadataFieldType,
)

schema = FileSetMetadataSchemaInput(fields=[
    MetadataFieldDefinitionInput(
        name="ticker",
        field_type=MetadataFieldType.STRING,
        required=True,
        description="Company ticker symbol",
        extraction_hint="The stock ticker symbol mentioned in the document.",
    ),
    MetadataFieldDefinitionInput(
        name="quarter",
        field_type=MetadataFieldType.STRING,
        required=True,
        description="Fiscal quarter (e.g. Q1 2024)",
        extraction_hint="The fiscal quarter covered by the report.",
    ),
])

fileset = lr.filesets.create(
    name="Quarterly Reports",
    description="Company quarterly investor reports.",
    metadata_schema=schema,
)
Parameter
Type
Required
Description

name

str

Yes

FileSet name

description

str

No

Optional description

metadata_schema

FileSetMetadataSchemaInput

No

Schema for file metadata fields

rag_enabled

bool

No (default True)

Enable RAG indexing for this FileSet; set to False for document-only workflows

MetadataFieldDefinitionInput fields: name, field_type (MetadataFieldType.STRING or MetadataFieldType.NUMBER), required, description, extraction_hint.

Uploading Files

The SDK provides high-level methods that handle all upload complexity:

upload_files() — Upload a list of files

Auto-extract metadata during upload

Install the optional extraction dependencies before using LLM-based metadata extraction:

Then pass auto_extract_metadata=True. The SDK fetches the FileSet metadata schema, extracts values from supported text files and PDFs, and uploads the generated metadata manifest with the files. Values you provide manually still take precedence.

You can also inspect or edit extracted values before uploading:

Visual document seed summaries

Use this for image-heavy PDFs, exported decks, and page images where plain text extraction misses charts or visual layout. It creates page-level executive seed summaries, not raw OCR.

Install with pip install "lightningrod-ai[visual]". OpenRouter-compatible clients work too by passing an OpenAI(base_url=..., api_key=...) client and the model name. The SDK only converts files; upload stays with upload_files(). Generated filenames include a short source hash so duplicate source basenames do not collide during upload.

upload_directory() — Upload all files from a directory

Parameter
Type
Default
Description

file_set_id

str

FileSet ID

file_paths / directory

list or str

Files to upload

metadata / metadata_fn

dict or callable

None

File metadata

pattern

str

"*"

Glob pattern (for upload_directory)

max_workers

int

10

Parallel upload threads

use_transfer_manager

bool

True

Use GCS Transfer Manager when available for large uploads

show_progress

bool

False

Display upload progress; requires google-cloud-storage

auto_extract_metadata

bool

False

Extract metadata from files before upload using the FileSet schema

extraction_max_pages

int

None

Limit PDF extraction to the first N pages

extraction_model

str

"gpt-4.1-mini"

Model to use for auto-extraction

extraction_max_chars

int

20000

Max extracted characters to include in each prompt

extraction_max_workers

int

max_workers

Parallel extraction calls

The vector index is built automatically when the FileSet is first used in a pipeline.

Using FileSets in Pipelines

Use the FileSet with:

  • FileSetSeedGenerator — chunks documents into seeds (see Seed Generators)

  • QdrantContextGenerator — retrieves context from the FileSet during question generation (see Labeling and Context)

  • QdrantRAGLabeler — resolves questions by searching the FileSet for answers (see Labeling and Context)

For document-level transforms:

  • FileSetDocumentContextGenerator — adds full document text as context

  • FileSetDocumentLabeler — extracts labels from full documents

See the Custom Filesets examplesarrow-up-right for a full workflow.

Last updated