Cycle 451: Autonomous Release Notes — Per-PR “What’s Changed” Blog Posts in Mintlify
Priority: HIGH (pre-pilot DX investment, AI-native publishing showcase) Status: DONE Domain: infra Wave: 10 (Process & Spec-Driven Dev) Milestone: Pre-pilot DX Owner: @pj Dependencies: Cycle 214 (Mintlify platform foundation,docs.json, root AGENTS.md)
Issue: #451
Plan PR: #452
Product: Flux — AI-native hiring platform
Organization: Employ Inc. (employ-inc GitHub org)
Overview
Per-PR release communication today is limited to PR descriptions (uneven quality), commit messages (no narrative), and — once cycle 400 lands — a Slack digest of merged PRs (team-facing, daily, multi-PR). What’s missing is a single, polished, navigable changelog surface that evaluators, testers, employer customers, and external readers can use to see what shipped and click through to try it. Rapid SOTA teams (Linear, Resend, Knock, PostHog, Vercel) all publish blog-style changelogs, and almost all of them are human-written — which is why most rapid teams’ changelogs are stale, terse, or missing. Flux has an asymmetric advantage here: cycle docs already carry the “why” (motivation, scope, design), diffs are fully accessible, Playwright is already in the stack, and Mintlify (cycle 214) is the publishing surface. Feed those four inputs into a three-pass Claude pipeline — Facts → Narrative → Verifier, with a hard-coded slop-voice guard — and we can produce SOTA narrative changelog entries on every merge tomain with no human in the loop, without sounding like AI slop.
This cycle ships the autonomous generator end to end: GitHub Action trigger, context assembly, Playwright screenshots, three-pass synthesis, Mintlify MDX output, and a graceful human-handoff failure mode. The post is the canonical artifact for “what shipped” — not the PR description, not the commit message, not the Slack digest.
This is also a deliberate showcase of Mintlify’s full AI-native surface: MDX, <Update> component, section-level AGENTS.md, Autopilot as a secondary reviewer, auto-generated MCP server (so AI agents can answer “what shipped this week?”), llms.txt for downstream agent grounding, and Mintlify’s contextual buttons for one-click handoff to Claude/Cursor.
Current State
What Works
- Cycle 214 has merged the spec-driven dev model and selected Mintlify as the docs platform. The
docs.jsonconfig, rootAGENTS.md, and Mintlify Autopilot are all part of the platform foundation. - Cycle 400 (open implementation PR #418) ships a
/changelogClaude Code skill that generates a Slack-formatted digest of merged PRs — different surface, different cadence, different audience. - Cycle 380 (UI design skill) establishes Playwright screenshot conventions and
docs/design/cycle-{N}/artifact patterns we can mirror. - CI infrastructure:
gh apiaccess to PR metadata, check runs, and diff is already wired through GitHub Actions.
What’s Missing
- No per-PR changelog surface — readers (evaluators, testers, customers) have no canonical place to see what shipped, formatted for human consumption.
- No Mintlify changelog page —
docs/changelog/does not exist. - No autonomous publishing pipeline — Mintlify has Autopilot for spec drift, but no out-of-the-box “PR → narrative blog post” generator.
- No voice-quality enforcement — nothing prevents AI-generated marketing-speak slop from being published.
- No diff → screenshot pipeline — Playwright is wired for cycle 380’s design workflow but not for changelog automation.
Scope
In Scope (Phase 1 — this cycle)
1. GitHub Action autonomous-changelog.yml (~0.5 day)
Triggers (parity with cycle 400’s daily-changelog.yml):
on: pull_request: types: [closed]filtered tomerged == trueagainstmain— primary pathon: workflow_dispatch:withpr_numberinput — manual regeneration / debugging / replaying a PR after prompt iteration
contents: write (commit MDX), pull-requests: write (open follow-up PR on failure), issues: write (post status comment).
Concurrency: scoped per-PR to prevent duplicate generation on retry.
2. Context assembly module (~1 day)
scripts/autonomous_changelog/context_assembly.py. Pure function assemble_context(pr_number) -> ChangelogContext that gathers, in parallel:
- PR metadata — title, body, author, labels, conventional commit prefix from title
- Diff —
gh api repos/.../pulls/{n}/filesfor file list;git diff base...merge_sha -U3for hunks - Commits — full commit messages and bodies via
gh pr view --json commits - Cycle doc — resolved from branch name regex
cycle(\d+(\.\d+)?)/...→docs/roadmap/cycles/cycle{N}-*.md. Falls back to PR body link extraction if branch doesn’t match. - CI results —
gh api repos/.../check-runsfor the merge SHA. Captures pass/fail summary per check, not full logs. - Linked issues — parse
Closes #N/Fixes #Nfrom PR body; fetch issue titles for context. - Preview/staging URL — read from existing GH deployment status API (cycle 209.2 preview env) or fall back to staging.
ChangelogContext Pydantic model (defined in schemas.py). Heavy diffs are truncated to the first 10 000 changed lines + a one-line summary per truncated file, to bound token cost.
3. Playwright screenshot pass (~1 day)
scripts/autonomous_changelog/screenshot_runner.py. Diff-driven route discovery:
- Identify changed user-facing routes by scanning
web/app/**/page.tsxpaths in the diff and mapping to URL paths. - Identify changed components by file path; for each, find a containing route via static analysis (best-effort).
- Spin up Playwright (Chromium, stable) against the already-deployed preview or main URL (no local app spin-up). Capture each route at desktop (1440×900) and mobile (390×844).
- Save to
docs/changelog/images/<slug>/<route-slug>-{desktop,mobile}.png. - Capture metadata: route, viewport, response status, capture timestamp.
4. Three-pass Claude synthesis (~2 days)
scripts/autonomous_changelog/synthesis/. Implemented per the claude-api skill (Anthropic SDK, prompt caching, structured outputs). Detailed in Three-Pass Synthesis below.
5. Voice & anti-slop guardrails (~1 day)
scripts/autonomous_changelog/synthesis/voice_guard.py. Detailed in Voice & Anti-Slop Guardrails. Includes:
- Hard-coded regex blocklist (~30 phrases)
- Soft heuristics (sentence length, adjective density, opener patterns)
- Few-shot voice samples in
docs/changelog/_examples/ - Voice guide in
docs/changelog/_voice-guide.md(consumed by prompts)
6. Mintlify integration (~1 day)
scripts/autonomous_changelog/mintlify_writer.py— emits per-PR MDX files using the<Update>componentdocs/changelog/index.mdx— landing page that aggregates entries (newest first, grouped by month)docs/changelog/AGENTS.md— section-level AI customization (immutability rules, MCP grounding instructions)docs/docs.json— adds aChangelogtab with auto-grouped pages via globchangelog/*- Mintlify Autopilot is invoked as a secondary review on the generated MDX (catches markdown/component syntax errors before publish)
7. Failure mode + human handoff (~0.5 day)
scripts/autonomous_changelog/failure_handoff.py. When Pass 3 rejects, slop-guard fires, or any pass errors:
- Open a follow-up PR titled
chore(changelog): handoff for #{N} — {failure reason}containing the draft MDX and a structured comment with flagged issues. - Post a non-blocking PR comment on the original merged PR linking to the handoff PR.
- Never revert the original merge; never block CI on changelog generation.
8. Golden set + voice samples (~1 day)
- Hand-curate five reference posts in
docs/changelog/_examples/covering: a frontend feature, a backend feature, a bug fix, a refactor with no user-facing change, and a complex multi-component cycle. These are the few-shot exemplars for Pass 2. - Hand-write
docs/changelog/_voice-guide.md(one page) — voice rules, what to avoid, what good looks like. Prompts cite this guide. - Hand-curate the slop blocklist seed list (~30 phrases) from public Linear / Resend / Knock / PostHog / Vercel changelogs (positive examples) versus AI-generated marketing copy (negative examples).
9. Verification, observability, documentation (~0.5 day)
- Unit tests for voice guard (≥ 50 phrase test cases)
- Unit tests for context assembly (3 fixture PRs)
- Integration test: end-to-end on a known-good past PR (e.g., cycle 365 plan PR), output reviewed manually
- Token cost emitted to GitHub Action summary per run
- Failure mode tested by injecting a deliberate slop phrase into Pass 2 output
- Operator/contributor guide:
docs/guides/autonomous-changelog.md
Out of Scope (Phase 2+)
- Slack notification on publish — adjacent to cycle 400; deferred to keep cycles separate.
- Weekly AI-synthesized “Shipped” roll-up post — a separate generator that consumes the per-PR posts.
- Eval harness auto-trigger from generated “Try it” steps — ties to cycle 209.7 (post-merge validation + evals).
- Internal-only “evaluator notes” section — role-gated content via Mintlify auth tiers.
- Customer email digest — monthly newsletter sourced from changelog.
- Multi-PR release-level summaries — group merged PRs in a release window into a single post.
- Author-edit loop — letting authors comment
/changelog editon a PR to trigger regeneration with hints. Phase 2 if friction emerges.
Architecture
Pipeline Flow
Repository Layout
Three-Pass Synthesis
The core IP of this cycle. Every detail matters because the difference between a great post and AI slop lives in the prompts, model choice, and verifier rigor.Pass 1 — Facts
| Setting | Value |
|---|---|
| Model | claude-opus-4-7 (deepest reasoning for code-diff understanding) |
| Temperature | 0 |
| Tools | None |
| Output | Structured JSON, validated against FactList Pydantic schema |
| Caching | System prompt + voice guide cached (5-min TTL) |
claim: "JobGet channel adapter posts jobs to JobGet's /jobs API",evidence: [{file: "backend/domains/hiring/distribution/channels/jobget.py", line: 42}],user_facing: false, surface: "backend", confidence: 0.95claim: "Candidate portal sidebar collapses to icon-only at <768px viewport",evidence: [{file: "web/components/candidate/Sidebar.tsx", line: 87}],user_facing: true, surface: "frontend", confidence: 0.9
- Empty fact list → abort, post a PR comment “diff too sparse to summarize” (e.g., dependency bumps with no behavior change).
- Output fails Pydantic validation → retry once with stricter schema reminder; second failure → abort with handoff.
Pass 2 — Narrative
| Setting | Value |
|---|---|
| Model | claude-opus-4-7 (voice + structure) |
| Temperature | 0.4 (some creativity within guardrails) |
| Tools | None |
| Inputs | FactList (Pass 1 output) + cycle doc text + voice guide + few-shot samples + screenshot URLs + linked issues |
| Output | Raw MDX (no frontmatter — writer adds frontmatter) |
| Caching | System prompt + voice guide + few-shot samples cached |
- Cites
docs/changelog/_voice-guide.mdverbatim - Includes the five
_examples/*.mdxposts as in-context few-shot demonstrations - Names the slop blocklist explicitly (“never use these phrases: …”)
- Instructs Claude to lead with the change (not an announcement), use specifics over abstractions, prefer active voice
- Tells Claude to use Mintlify components (
<Frame>,<CardGroup>,<Card>,<CodeGroup>) where appropriate - Requires a “Try it” section if
preview_urlis present
- Invalid MDX (component misuse, unmatched tag) → retry once with error feedback; second failure → handoff.
- Pass 2 ignores few-shot voice → caught by Pass 3 verifier or slop guard.
Pass 3 — Verifier
| Setting | Value |
|---|---|
| Model | claude-sonnet-4-6 (cheaper, faster, sufficient for cross-check) |
| Temperature | 0 |
| Tools | None |
| Inputs | FactList (Pass 1) + narrative MDX (Pass 2) + slop blocklist |
| Output | Structured JSON: Verdict |
verdict = "publish"iff:len(unsupported_claims) == 0ANDlen(slop_phrases_detected) == 0ANDlen(voice_concerns) <= 2.- Otherwise
verdict = "human_review"and the failure handoff PR opens with the verdict JSON included for context.
Cost & Caching Strategy
Per the claude-api skill, the implementation must use prompt caching:- System prompt + voice guide + few-shot samples are cached across all three passes (same Anthropic API key, 5-min TTL). Pass 2’s call hits the cache established by Pass 1; Pass 3 also hits it.
- Cycle doc is cached when present (used by Pass 2; also referenced by Pass 1’s reasoning).
- Diff is the only large per-PR input that cannot be cached — it changes every PR.
- Pass 1: ~15k input (mostly diff) + ~2k output, Opus 4.7 → ~$0.05
- Pass 2: ~5k input (mostly cached) + ~3k output, Opus 4.7 → ~$0.04
- Pass 3: ~5k input + ~1k output, Sonnet 4.6 → ~$0.01
- Total per post: ~$0.10
- Hard cap: 50k input tokens per pass. Diffs over the cap are truncated by file (whole files preserved, tail dropped) with a “(truncated)” marker.
- If the cap forces truncation of more than 30 % of the diff, the post adds an “Under the hood” disclaimer and links to the full diff on GitHub.
Voice & Anti-Slop Guardrails
This is the single most important section of this cycle. The whole pipeline fails to deliver value if the output reads like AI slop. Three layers of defense:Layer 1 — Prompt-level (Pass 2 system prompt)
Voice rules embedded in the prompt:- Lead with the change, not the announcement. Bad: “We’re excited to announce a new way to schedule interviews.” Good: “Interview scheduling now suggests time slots based on the candidate’s stated availability.”
- One specific over three abstractions. Bad: “powerful, intuitive, seamless experience.” Good: “creates a 30-minute slot in the next 48 hours that fits both calendars.”
- Show, don’t tell — screenshots beat adjectives. If you’d reach for an adjective (“clean”, “polished”, “intuitive”), reach for a screenshot instead.
- Active voice, present tense. Bad: “A new feature has been added that allows users to…” Good: “The candidate portal now shows pending interview requests at the top.”
- Names and numbers > generalities. Bad: “much faster”. Good: “p95 search latency dropped from 1.4 s to 240 ms.”
- Say what’s NEW, not what’s “now possible”. Bad: “It’s now possible to filter candidates by skill.” Good: “Candidate list has a Skill filter.”
- Don’t editorialize. No “we think this is going to be transformative.” Just say what shipped.
Layer 2 — Few-shot exemplars (Pass 2 in-context)
Five hand-curated reference posts indocs/changelog/_examples/:
| Example | Purpose |
|---|---|
frontend-feature.mdx | A new user-facing feature with screenshots |
backend-feature.mdx | A backend capability with no UI, but downstream impact |
bug-fix.mdx | A reported bug, now fixed — terse, specific |
refactor.mdx | An internal refactor with no behavior change — minimal post |
multi-component-cycle.mdx | A cycle that touched 5+ surfaces — structured, with sections |
surface distribution.
Layer 3 — Deterministic slop guard
scripts/autonomous_changelog/synthesis/voice_guard.py runs after Pass 3, regex-only, no LLM.
Seed blocklist (sample — full list in code, ~30 entries):
Verdict.voice_concerns):
- Sentence average length > 28 words
- Adjective density > 18% of tokens (per
nltkPOS tag) - Opening sentence starts with “We ” (lead with the change, not the team)
- More than 2 marketing adjectives in any single sentence
- Use of em-dash chains (3+ in one paragraph — a known Claude tic)
Layer 4 — Sampling audit (post-publish)
Weekly: a human (rotating, owner = cycle owner this iteration) reads the last 5 published posts and rates each on:- Specificity (1–5)
- Voice match to references (1–5)
- Would I publish this if I’d written it? (yes/no)
docs/changelog/_audit-log.md. When patterns emerge (e.g., posts about backend changes are too dry), the voice guide and few-shot examples are updated.
Mintlify Primitives Used (Showcase)
This cycle exercises the full Mintlify AI-native surface. This table is part of the cycle on purpose: the goal is not just “publish a changelog” — it is to demonstrate Mintlify’s AI-native publishing model end to end.| Primitive | Usage in this cycle |
|---|---|
| MDX files | Native authoring surface — generator emits MDX directly, no transformation layer |
<Update> component | Wraps each entry with a date label, description, and content slot — Mintlify’s first-class changelog primitive |
<Frame>, <CardGroup>, <Card> | Hero images, “Under the hood” file links, “Try it” callouts |
<CodeGroup>, <Tabs> | Multi-language code samples (rare in changelog, supported when needed) |
| Frontmatter | title, description, date, tags, pr, cycle, preview_url, authors — drives navigation, search, AI indexing |
docs.json navigation | Adds a top-level Changelog tab; pages auto-grouped by month via glob pattern changelog/2026-04-* |
Root AGENTS.md | Already configured by cycle 214; we extend with a Changelog section |
Section AGENTS.md | docs/changelog/AGENTS.md declares: entries are immutable; Autopilot must not edit them; MCP queries should treat changelog as canonical “what shipped” source |
| Mintlify Autopilot | Runs as a secondary review on each generated MDX — catches markdown/component syntax errors before publish; if Autopilot rejects, generator falls through to human handoff |
| Auto-generated MCP server | Evaluators ask Claude/Cursor “what shipped this week?” via specs.flux.employinc.io/mcp; changelog entries are first-class MCP resources |
llms.txt / llms-full.txt | Auto-includes changelog entries; downstream agents (support bot, sales bot) can ground answers in shipped features without a separate KB |
contextual buttons | Each entry surfaces “Copy”, “Open in Claude”, “Open in Cursor”, “MCP” buttons (configured in docs.json) |
| AI traffic analytics | Mintlify dashboard reports which agents read which entries and where they 404 — feedback loop for entry quality |
| Tags + filtering | Domain tags (hiring, distribution, frontend, agents, etc.) drive Mintlify’s tag-filter UI; readers can scope to their area of interest |
| Search | Mintlify’s built-in search indexes entries; tagged for relevance boost on cycle-related queries |
| Bi-directional sync | Generator commits MDX to main; Mintlify auto-deploys within seconds; PMs/engineers can hand-edit a published entry via Mintlify’s web editor and the change syncs back to the repo |
Failure Modes & Recovery
| Stage | Failure | Behavior |
|---|---|---|
| Workflow trigger | Concurrent PR merges | Per-PR concurrency group; each PR processed independently |
| Workflow trigger | Re-fire on already-published PR (label change, manual workflow_dispatch, re-merge after revert) | Detect existing entry by PR number in frontmatter; overwrite only if both Pass 3 and slop guard pass on the new run; otherwise open a handoff PR with a diff-of-diffs explaining what changed |
| Context assembly | Cycle doc not found by branch regex | Continue without cycle doc; log warning; Pass 2 falls back to PR body for “why” |
| Context assembly | Diff is empty (revert, no-op merge) | Skip post entirely; post non-blocking PR comment “no changelog entry — no diff” |
| Context assembly | Diff > 5 000 lines / > 50 files | Generate post but flag as “large change — review recommended”; truncate diff input |
| Playwright | No user-facing routes detected | Generate post without screenshots (backend-only style) |
| Playwright | Browser crash / route 500 | Capture error-state screenshot; note in narrative; continue |
| Pass 1 — Facts | Empty fact list | Abort; PR comment “diff too sparse to summarize” |
| Pass 1 — Facts | JSON validation fails | Retry once with stricter schema reminder; second failure → handoff |
| Pass 2 — Narrative | Invalid MDX (parse fails) | Retry once with error feedback; second failure → handoff |
| Pass 2 — Narrative | Slop voice (caught by Pass 3) | Handoff PR opened with flagged phrases |
| Pass 3 — Verifier | Unsupported claims detected | Handoff PR opened with claim list and fact list for human review |
| Pass 3 — Verifier | Pass 3 itself errors | Default to handoff (fail closed) |
| Slop guard | Regex hit | Handoff PR opened with matched phrases highlighted |
| Mintlify writer | MDX file write fails | Retry; if persistent, handoff PR with content as artifact |
| Git commit/push | Push conflict | Pull latest, retry; on second failure, open handoff PR |
| Mintlify deploy | Mintlify Autopilot rejects MDX | Open handoff PR with Autopilot feedback included |
Quality Bar — Definition of “Not AI Slop”
A passing post must satisfy all of the following:- Specificity — every benefit claim has a concrete artifact (screenshot, code link, number, named feature)
- Brevity — opening paragraph ≤ 3 sentences; full post ≤ 400 words for a typical PR (multi-component cycles get more)
- Voice — zero hits on the slop blocklist; ≤ 2 soft heuristic violations
- Grounding — every factual claim in the narrative maps to a fact from Pass 1 (verified by Pass 3)
- Visual — at least 1 screenshot if the diff touches user-facing routes
- Navigability — “Try it” link present and resolves (curl HEAD check at publish time)
- Cycle context — if a cycle doc exists, the “why” is reflected (verified by Pass 3 — the narrative must contain at least one phrase semantically aligned with the cycle doc’s overview)
tests/test_synthesis_e2e.py against fixture PRs and enforced by Pass 3 + slop guard at runtime.
Implementation Plan
Step 1 — Scaffold + GH Action skeleton (~0.5 day)
- Create
scripts/autonomous_changelog/package with__init__.py,pipeline.pystub - Create
.github/workflows/autonomous-changelog.ymlwith trigger + Python setup, callingpipeline.py --pr-number <N> - Wire dry-run mode (no commit, prints MDX to logs) for testing
- Permissions:
contents: write,pull-requests: write,issues: write - Concurrency group:
changelog-pr-${{ github.event.pull_request.number }}
Step 2 — Context assembly module (~1 day)
- Pydantic schemas in
schemas.py(ChangelogContext,Fact,FactList,Verdict,FileLineRef) context_assembly.pywith parallel async fetches viaasyncio.gather- Branch-name → cycle doc resolution
- Diff truncation logic (whole-file preservation, tail-drop)
- Unit tests with 3 fixture PRs (frontend feature, backend-only, large refactor)
Step 3 — Playwright screenshot runner (~1 day)
screenshot_runner.pywith route discovery from diff paths- Playwright Chromium (pinned version), desktop + mobile viewports
- Screenshot output to
docs/changelog/images/<slug>/ - Failure tolerance: per-route try/except, never aborts the pass
- Skip when no user-facing routes touched
Step 4 — Three-pass synthesis (~2 days)
- Anthropic SDK with prompt caching (per claude-api skill)
synthesis/facts.py(Pass 1) — Opus 4.7, structured output with Pydanticsynthesis/narrative.py(Pass 2) — Opus 4.7, MDX output, few-shot from_examples/synthesis/verifier.py(Pass 3) — Sonnet 4.6, structured Verdict output- Prompt files in
synthesis/prompts/— reviewable, version-controlled - Token cost emitted per pass to GH Action summary
- Integration test: end-to-end on cycle 365 plan PR fixture; manual review of output
Step 5 — Voice guard + slop blocklist (~1 day)
voice_guard.pywith seed regex blocklist (~30 patterns)- Soft heuristics (sentence length, adjective density, opener pattern, em-dash chain)
- Unit tests with ≥ 50 phrase test cases (positive + negative)
- Integration: voice guard runs after Pass 3, results merged into Verdict
Step 6 — Mintlify integration (~1 day)
mintlify_writer.py— emits MDX with frontmatter +<Update>wrapperdocs/changelog/index.mdx— landing page with monthly groupingdocs/changelog/AGENTS.md— section-level AI customization (immutability, MCP grounding)- Edit
docs/docs.json— add Changelog tab with auto-glob pages _voice-guide.md— voice rules (consumed by Pass 2 prompt)- Verify Mintlify renders generated entries correctly (manual check on a deployed preview)
Step 7 — Failure mode + human handoff (~0.5 day)
failure_handoff.py— open follow-up PR with draft MDX + structured comment- PR comment integration on the original merged PR
- Test by injecting deliberate slop into Pass 2 output
Step 8 — Golden set + voice samples (~1 day)
- Hand-curate 5 reference posts in
_examples/ - Hand-write
_voice-guide.md - Curate slop blocklist seed (~30 patterns from real changelog corpora)
Step 9 — Verification + observability + documentation (~0.5 day)
- Operator/contributor guide:
docs/guides/autonomous-changelog.md - Token cost monitoring (GH Action summary + Mintlify analytics)
- Sampling audit log:
docs/changelog/_audit-log.mdtemplate - Final E2E test: run pipeline on 3 historical PRs, manually review outputs
Verification Plan
-
.github/workflows/autonomous-changelog.ymltriggers on PR merge tomainand only on merge (closed without merge does not fire) - Workflow completes in < 10 minutes for a typical PR (≤ 1 000 changed lines)
- Generated MDX validates against Mintlify schema (Autopilot review passes or workflow rejects)
- Slop blocklist catches all 50 phrase test cases in
test_voice_guard.py - Pass 3 verifier catches injected hallucinations in 5 deliberate test cases
- Generated post for cycle 365 plan PR (test sample) passes voice review by a human
- Generated post for cycle 401 implementation PR (test sample) passes voice review by a human
- Generated post for a hypothetical “fixes typo” PR is either suppressed (per quality bar) or is appropriately terse
-
docs/docs.jsonChangelog tab navigates to entries; entries render with<Update>wrapper -
docs/changelog/AGENTS.mdis detected by Mintlify (verified in Mintlify dashboard) - Mintlify MCP server returns changelog entries to a query “what shipped this week?”
-
llms.txtincludes changelog entries (verified at the deployed/llms.txtURL) - Token cost per PR ≤ $0.20 (caching working — Pass 2 + 3 inputs largely cached)
- Token cost monitor reports per-pass token usage in GH Action summary
- Failure handoff opens a follow-up PR within 60 seconds of verifier rejection
- Failure handoff PR contains the draft MDX and a structured list of flagged issues
- Non-blocking PR comment posted on the original merged PR (link to either published entry or handoff PR)
- Sampling audit log template exists at
docs/changelog/_audit-log.md - Operator guide
docs/guides/autonomous-changelog.mdexists and explains: configuration, debugging, prompt iteration, audit cadence -
make quality-gatesgreen (lint + format + typecheck + tests)
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Generated narratives still feel AI-written despite the three-layer guard | Defeats the whole purpose | Few-shot from real human-curated samples; verifier slop check; deterministic regex guard; weekly sampling audit; iterate prompts when patterns emerge from audit |
| Pass 3 verifier false positives block legitimate posts | Toil — every PR needs human polish | Calibrate verdict threshold against 20 hand-labeled fixtures before launch; track override rate as quality signal; allow author label changelog:approve-handoff to publish a handoff draft as-is |
| Pass 3 verifier false negatives let slop through | Quality leak | Sampling audit (weekly, last 5 posts); deterministic regex guard as defense in depth; voice guide updated quarterly based on audit findings |
| Cost per PR exceeds estimate (large diffs, many PRs) | Token spend | Hard 50k input cap per pass with truncation; Pass 3 uses cheaper Sonnet; cost emitted to GH Action summary; alert if weekly spend > $50 |
| Cycle doc not found for a branch (legacy or non-cycle work) | Loss of “why” context | Fallback to PR body and linked issues; over time cycle 381 enforces cycle docs per cycle; document the fallback in operator guide |
| Diff is too large or too unfocused to summarize meaningfully | Generic post | Skip post (diff > 5 000 lines or > 50 files) and open a non-blocking PR comment “large change — manual changelog recommended”; provide a starter template |
| Race condition: two PRs merge in same minute | Filename collision | Filename uses merge-commit SHA suffix on collision; fall back to YYYY-MM-DD-<slug>-<sha7>.mdx |
| Mintlify Autopilot accidentally edits historical posts | Loss of immutable record | docs/changelog/AGENTS.md declares entries immutable; entries’ frontmatter contains immutable: true; Autopilot configuration set to ignore the directory by default |
| Playwright dependency makes CI slow or flaky | Workflow latency / failure | Pin Playwright Docker image; per-route try/except so single bad route never aborts; investigate Mintlify preview screenshot service as a Phase 2 optimization once cycle 214’s Mintlify implementation lands |
| Author objects to autogenerated post about their PR | Process friction | changelog:skip label suppresses generation; changelog:edit label triggers handoff (draft only); Mintlify web editor lets author hand-edit a published entry, which syncs back |
| Voice guide drift — what feels SOTA today feels stale in 6 months | Long-term staleness | Quarterly review of _voice-guide.md and _examples/ against current SOTA changelogs (Linear, Resend, etc.); voice guide is version-controlled; refresh is a one-day chore |
| Generator publishes sensitive details (e.g., security fix details before disclosure) | Disclosure risk | security label on a PR routes to handoff PR (no auto-publish); Pass 1 prompt instructed to never describe vulnerability mechanics |
| First N posts will need iteration after launch | Early-life messiness | Reserve a follow-up cycle (after first 20 posts ship) for prompt + voice-guide iteration informed by audit |
Phase 2 Roadmap (Future Cycles)
| Phase 2 Item | Surface | Likely Cycle Number |
|---|---|---|
| Slack publish notification | Cross-post Mintlify URL + hero image to a Slack channel on publish | TBD — coordinate with cycle 400’s surface |
| Weekly “Shipped” roll-up | A separate generator that consumes the week’s per-PR posts and writes a narrative weekly summary for external readers | TBD |
| Eval harness auto-trigger | Generated “Try it” steps fed into eval scenario nominator | Ties to cycle 209.7 (post-merge validation + evals) |
| Internal-only “evaluator notes” section | Role-gated content via Mintlify auth tiers; technical detail for testers | TBD |
| Customer email digest | Monthly newsletter sourced from changelog tags; uses email service | TBD |
| Multi-PR release-level summaries | Group merged PRs in a release window into a single post (vs per-PR) | TBD |
| Author-edit loop | /changelog edit <hint> PR comment triggers regeneration with author hint | TBD if friction emerges |
| Translation (es/pt-BR for pilot regions) | Mintlify supports i18n; auto-translate via Claude as a fourth pass | TBD post-pilot |
| Visual diff comparison | Use the screenshot pass to capture before/after of the same route across the merge | TBD if value is demonstrated |
Relationship to Other Cycles
- Cycle 214 — Spec-driven dev with Mintlify (REQUIRED dependency). Provides the Mintlify platform,
docs.jsonnavigation pattern, rootAGENTS.md, and Autopilot configuration that this cycle extends. This cycle cannot ship until 214’s Mintlify implementation PR has merged. - Cycle 400 —
/changelogSlack digest skill (ADJACENT, complementary). Slack-facing team digest of merged PRs. Different surface (Slack vs Mintlify), different cadence (daily cron vs per-PR merge), different audience (team vs external + evaluators). Phase 2 will tie the two together (publish notification cross-posts to Slack). - Cycle 401 —
/standupskill (ADJACENT, similar shape). Both cycle 400 and 401 are interactive Claude Code skills. Cycle 451 is fully autonomous (GH Action only) but shares conventions forgh apiPR fetching and conventional-commit grouping. - Cycle 209.7 — Post-merge validation + evals (PHASE 2 INTEGRATION TARGET). The “Try it” sections this cycle generates can feed back into eval scenario nomination — when the changelog says “candidate portal sidebar collapses at
<768px”, that becomes a candidate eval fixture. - Cycle 221 — Chief Engineer review (COMPLEMENTARY). CE reviews quality of code; this cycle reports quality of shipped product. Both feed the AI-native quality loop.
- Cycle 380 — UI design skill (REFERENCED). Screenshot conventions and
docs/design/cycle-{N}/artifact patterns inform this cycle’sdocs/changelog/images/<slug>/structure. - Cycle 365 — Pilot evals harness (REFERENCED). LLM-as-judge model selection pattern (distinct judge model from agent model) informs Pass 3 verifier model choice (Sonnet for Pass 3 vs Opus for Pass 1/2).
- Cycle 381 — Issue-number cycle IDs (CONVENTION). This cycle follows the new convention; cycle number 451 = issue 451.
AI-Native Manifesto Alignment
| § Principle | How This Cycle Embodies It |
|---|---|
| §0 Uncompromising Quality | Three-pass synthesis with verifier; deterministic slop blocklist; sampling audit; no shortcuts on output quality. The whole cycle exists because terse PR descriptions are not SOTA enough. |
| §1 One Mind, Full Context | Generator reads full diff + cycle doc + CI results + linked issues + screenshots — holistic context per post, not file-by-file. |
| §3 Agentic Architecture | Three Claude passes are agents with distinct roles (Facts extractor, Narrative writer, Verifier judge), not LLM wrappers. Each pass observes (reads inputs), reasons (within prompt rules), acts (produces structured output). |
| §5 Observability-Native | Token cost per post tracked and emitted to GH Action summary; verifier pass-rate tracked; sampling audit results logged to _audit-log.md; Mintlify AI traffic analytics tracks consumption. |
| §8 100% AI-Generated Code with Safety Nets | Generator is itself AI-generated (this cycle and its implementation). Safety nets: deterministic slop guard, verifier pass, sampling audit, human handoff fallback. |
| §10 Spec-Driven Traceability | Cycle doc → implementation PR → autonomous changelog entry → Mintlify-published — full traceability from spec to public surface. The changelog entry frontmatter cites cycle and PR. |
| §11 Cross-Model Review | Pass 1 (Opus) extracts; Pass 2 (Opus, different temperature) writes; Pass 3 (Sonnet) reviews — cross-model verification within a single workflow. |
Notes
- This cycle is meta in a productive way: when the implementation PR for this cycle merges, the resulting changelog entry will be the first autonomous post, generated by the system describing the system. That’s the canonical validation — if Cycle 451’s own changelog post is good, the system works.
- The post is canonical; the PR description is not. Authors can put rough notes in PR bodies (or skip them) and trust the generator to polish the public-facing artifact. This should reduce PR-description toil over time.
- Treat the first 20 posts as a prototype run. Reserve a follow-up cycle for prompt and voice-guide iteration informed by sampling audit results — the slop blocklist will need expansion as new patterns emerge.
AGENTS.mdpolicies open the door for downstream agents (support bot, sales bot) to consume changelog entries via Mintlify’s MCP server to answer “does Flux do X?” — turns the changelog into a queryable product knowledge base. Phase 2.- The dependency on cycle 214’s Mintlify implementation PR is hard. If that PR is delayed, this cycle’s code phase waits. The plan PR (this doc) does not depend on it.
- Cost is not a meaningful constraint. The pipeline costs ~$0.10 per post, which is dwarfed by the human time saved. The only real budget concern is keeping verifier false-positive rates low so engineers don’t spend 10 minutes per handoff PR.