# Core Concepts

The dataset generation system is built around four concepts: **samples** (the data unit), **transforms** (dataset generation pipeline) and **datasets** (collections of samples). Understanding these makes the rest of the docs easier to follow.

## Sample

The fundamental unit of data. A sample contains:

| Field      | Description                                          |
| ---------- | ---------------------------------------------------- |
| `seed`     | Raw starting data (news articles, documents, etc.)   |
| `question` | Forecasting question generated from the seed         |
| `label`    | Ground truth answer with confidence score            |
| `prompt`   | Formatted prompt ready for model input               |
| `context`  | Additional context (news, RAG results)               |
| `meta`     | Custom metadata                                      |
| `is_valid` | Whether the sample passed validation (default: True) |

Samples flow through the pipeline: seeds produce questions, questions get labeled, and the result is a sample ready for training or analysis.

## Transform

A dataset generation pipeline that processes data through multiple stages. Pipelines chain components:

* **Seed generator** — Produces raw data
* **Question generator** — Creates questions from seeds
* **Deduplication** (optional) — Removes near-duplicate questions
* **Labeler** — Resolves questions with ground truth
* **Context generators** (optional) — Add relevant context
* **Renderer** (optional) — Format prompts

`QuestionPipeline` is the main orchestrator. You configure each stage and run it via `lr.transforms.run(config)`.

When you run a pipeline, the SDK submits a job to the server. The job executes the pipeline stages and produces an output dataset.

## Dataset

A collection of samples. Datasets serve as:

* **Input** — Feed custom samples into a pipeline via `input_dataset`
* **Output** — The result of `lr.transforms.run()` is a dataset

Datasets have an `id` and `num_rows`. Use `dataset.download()` or `dataset.samples()` to fetch samples, and `dataset.flattened()` for a flat list of dicts.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.lightningrod.ai/python-sdk/dataset-generation/core-concepts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
