> ## Documentation Index
> Fetch the complete documentation index at: https://docs.nodaldata.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation as you build it

> The interview harvests labeled ground truth, so you can measure accuracy with vs. without context.

The distinctive thing about building context through an interview is that **the measurement comes
for free**. Every disambiguation the analyst makes ("active client means X, not Y") is at once a
context entry and a labeled eval pair. Building context *is* harvesting ground truth.

## The model evaluates itself

During the interview, the model checks its own work against your source of truth: *"I get this
number for revenue last month — can you enter the value from your dashboard so I can validate
myself?"* When the numbers match, that answer is saved as a **verified** eval pair. This is
build-measure-learn applied to context: slower on the initial build, but you see the value of each
incremental piece of context with real questions, in real time.

## The local eval delta

Each domain ends with a verify step — the same questions run twice against your warehouse, once
with the context **on** and once **off** — and you can re-run it any time:

```
"Run the eval delta on session-financials."
```

The open-source `eval_harness/` reports the accuracy difference — a concrete number you can show.

<Note>
  The delta is honest in both directions. In [the demo](https://youtu.be/y6rvhNmHjx8), context made
  one revenue question *better* and another *worse* — a reference file routed every revenue question
  to a table with no channel dimension. Catching that during the interview, while the analyst is in
  the room to fix it, is the point.
</Note>

## Format-agnostic

The harness reads ACF, dbt models and docs, or raw markdown, normalizes them, and measures the
delta the same way — so you can evaluate context you already have, not just ACF.

## From one-shot to continuous

The one-shot, run-locally eval delta is **free**. Continuous re-evaluation, drift detection, and
observability across a team are the hosted product —
[see enterprise evaluation](/enterprise/evaluation).
