K4 · live cryptanalysis

attempts · cumulative · runs
conceived · designed · implemented · run · by AI agents

architecture · how this is constructed

Versioned. Each tab is the architecture as it stood at that date — superseded versions stay readable so the project's structural evolution is auditable. Newest first.

What changed since v2

v2 documented an 11-agent system with one compute track (the brute-force runner). v3 adds a second, parallel compute track: Atelier— a creative AI layer that emits structural hypotheses for the runner to execute, inspired by David Stein's 1998 K1/K2/K3 attack which combined statistical rigor with calibrated guessing.

  • Agent count: 11 → 12 — added Atelier (the creative AI layer; Sonnet 4.6 default with Opus 4.7 reserve; triggered, schema-bound, output-bounded).
  • Postgres tables: 9 → 11 — added atelier_runs (per-invocation token + verdict log) and atelier_hypotheses (validated JSON payloads with status workflow).
  • New compute track: brute-force runner stays unchanged on Hetzner; Atelier runs on the same host, triggered a small number of times per day, each invocation is one structured artifact and ends.
  • New discipline: cost-bounded LLM-in-the-loop with hard daily, per-run, and monthly spend caps enforced by the trigger script. Operational details on /admin.
  • New methodology lens: hypotheses tagged with stein_lineage — which of six explicit patterns from Stein 1999 the model drew on (calibrated guess, wrong-but-right-enough, anomaly-as-signal, structural composition, sculpture-as-data, convergent failure).

The parallel-track model

The project now runs two compute tracks side-by-side, with distinct functions:

Track A · Brute-force runner

The discipline.

Hetzner ARM box. Iterates parameterized hypotheses exhaustively. Every reject observed; per-region IoC histograms accumulated; sweeps run for days at sub-cent cost. Produces clean negative results — what K4 is NOT, with statistical confidence. The thing 35 years of public attempts has done.

Track B · Atelier (NEW v3)

The imagination.

Same Hetzner host, ~250-line Python trigger script. Reads vendored corpus + 24h telemetry. Calls Opus 4.7 with prompt caching. Emits one schema-validated hypothesis per run. Hard $50/month cap. The thing 35 years of public attempts has mostly done as ad-hoc human-in-the-loop guessing — now agentic, audited, and logged.

Track A produces telemetry; Atelier reads it and produces hypotheses; Bombe / Scytale execute the surviving hypotheses as new Track A sweeps. The loop closes when a hypothesis either reaches the four-gate verification or gets demoted to famous_false_positives.md.

Atelier — the constraints, in one place

ConstraintValueEnforced where
Model (default)claude-sonnet-4-6atelier.py DEFAULT_MODEL constant
Model (reserve)claude-opus-4-7Polybius Phase 0 + monthly deep-dive only
Output bound≤ 5,000 tokensAPI max_tokens parameter
Output formatJSON, schema-validatedattacks/atelier/schema/{hypothesis,synthesis}.json + post-call validate
Writable surfaceattempts/atelier/ onlyatelier.py: write_attempt() target dir
No self-scoringfloor/ceiling are honest estimatesNull gate; corpus reminds the model in agents/atelier.md
No plaintext guessescipher-structure hypotheses onlyagents/atelier.md prompt + JSON schema 'structure' field
Spend capshard daily, per-run, monthlyatelier.py + Anthropic console billing limit; values on /admin

The five Null gates (non-negotiable)

  1. Atelier never self-scores. Confidence floor / ceiling are honest estimates, not advocacy. Null and Chi judge.
  2. Every hypothesis carries a falsification criterion. Exact data shape that kills it.
  3. attempts/atelier/_killed.md logs every rejected proposal with reason — institutional memory for an LLM that resets between runs.
  4. famous_false_positives.md is in the cached corpus every run. The mattklepp lesson is the most expensive lesson this project has paid for.
  5. Schema-bound output: JSON, validated againstattacks/atelier/schema/hypothesis.json. Free-form prose is rejected by the trigger script.

The 12 specialist agents (Atelier added)

Atelier joins the roster as the 12th agent. The other 11 carry forward unchanged from v2 (their inputs / outputs / constraints haven't shifted with the addition of Atelier — Atelier is additive, not substitutive). For brevity this version shows only the new entry; the full v2 roster remains canonical and is unchanged.

  1. atelier

    Atelier · Creative AI Layer

    Stein-inspired structural-hypothesis generator. Reads the project's full vendored corpus + last 24h of runner telemetry; emits one structural hypothesis or daily synthesis per trigger. Parallel to the brute-force runner, not a replacement.

    Inputs
    Cached corpus (~90K tok): K4 ciphertext + cribs, K1/K2/K3 reproductions, vendored Stein 1999 paper, Sanborn + Scheidt statements, prior_work, statistical baseline, famous_false_positives, k1/k2/k3 mechanics. Per-run delta (~10K tok): last 24h phase_distributions, recent rejected_samples, current attack_runs row, trigger context.
    Outputs
    JSON-validated hypothesis briefs to attempts/atelier/; daily synthesis reports; rows in atelier_runs (cost, tokens, verdict) and atelier_hypotheses (status, full payload). Public-readable.
    Constraints
    Model: claude-opus-4-7 only. Total budget: $50/mo, hard-capped in code. Per-run cap: $1.50. Daily cap: $1.65 (~2.5 runs/day). Output: ≤5K tokens, JSON-only, schema-validated. Writes only to attempts/atelier/. Never self-scores. Never proposes plaintext — only cipher-structure hypotheses. Cannot bypass the four-gate verification.
    Depends on
    Polybius (Phase 0 K1/K2/K3 reproduction gate before first K4 run). Chi (statistical pre-filter on every emitted hypothesis). Null (red-team review queue). Codex (cached corpus assembly). Bombe (deployment + cost monitoring).

Full v3 roster: Tabula, Mneme, Codex, Polybius, Chi, Scytale, Bombe, Gnomon, Null, Scribe, Sigma, Atelier. See the v2 tab for entries 1-11.

Diagram

Diagram still shows the v1 system view. Atelier's parallel track is documented in this page's text but not yet drawn. Diagram refresh is on the queue.

State layer · 11 Postgres tables (2 new)

The brute-force runner's 9 tables are unchanged. Atelier adds two:

TableWritten byRead byPurpose
atelier_runsnew v3atelier.py (service role)anon + admin (SELECT, realtime)One row per Atelier invocation. Cost, tokens, verdict, error.
atelier_hypothesesnew v3atelier.py (service role)anon + admin (SELECT, realtime)Validated hypothesis payloads. Status workflow: proposed → chi_killed | null_killed | queued → executing → concluded.

Public-read by design (anon SELECT enabled). Joy's credibility-signal call: Atelier outputs — including kills — are part of the project's externally-visible accountability log. Realtime publication enabled so the LIVE page streams new hypotheses as they emit.

Stein lineage — the six patterns

Each Atelier hypothesis tags which of six explicit patterns from David Stein's 1998 K1/K2/K3 attack it draws on. Tagged in the stein_lineage field of the JSON output.

  1. Calibrated guess — Stein bet T over Obecause T begins ~16 words per 100 vs. O's 7.2. Atelier bets on a structural feature when the data gives an edge but doesn't yet name the answer.
  2. Wrong-but-right-enough — Stein guessed THE as the first word of K2; was wrong (it was THEY) but the column work held. Atelier may propose imperfect hypotheses if their structure is testable.
  3. Anomaly-as-signal — Stein noticed X was punctuation, noticedUNDERGRUUND was misspelled, noticed the m=13 IoC peak in K1 AND K2. Anomalies in K4 telemetry are first-class hypothesis sources for Atelier.
  4. Structural composition — Stein's K3 was a single transposition with the algebraic identity (B−1)x + By = 336. K4 may be a composition (substitution over transposition, or vice versa). Atelier proposes specific compositions, not vague “multi-stage” claims.
  5. Sculpture-as-data — Stein walked out to feel carved letters. The sculpture has the lodestone, morse code, compass rose, Antipodes panel. All are candidate key sources for Atelier hypotheses.
  6. Convergent failure — Stein noted multiple wrong leads shared structural features. If many Atelier hypotheses fail in the same way, that pattern is itself the next hypothesis.

Why this will probably not crack K4

K4 has been chewed on by 35 years of public attempts. Many smart people have applied LLMs to K4 and failed. The structural challenge (97 chars, IoC near random, multi-stage cipher) is the same for Atelier as for everyone else. Atelier's leverage, if any, is in the workflow: anomaly-driven hypothesis generation grounded in vendored primary sources, with hard falsification gates, output-bounded for cost discipline. That's differently-shaped from “ask ChatGPT to solve K4 in three messages.” Whether different-shape produces different-outcomes is the empirical question this phase answers in 90 days.

Kill criterion: 90 days, zero hypotheses Tabula judges as non-trivially novel vs. prior_work.md. If killed, the post-mortem documents what an LLM-in-the-loop with hard guardrails produced and where it fell short. That's a more useful contribution than the typical “we tried ChatGPT on it” anecdote.