Skip to main content

Cycle 451: Autonomous Release Notes — Per-PR “What’s Changed” Blog Posts in Mintlify

Priority: HIGH (pre-pilot DX investment, AI-native publishing showcase) Status: DONE Domain: infra Wave: 10 (Process & Spec-Driven Dev) Milestone: Pre-pilot DX Owner: @pj Dependencies: Cycle 214 (Mintlify platform foundation, docs.json, root AGENTS.md) Issue: #451 Plan PR: #452 Product: Flux — AI-native hiring platform Organization: Employ Inc. (employ-inc GitHub org)

Overview

Per-PR release communication today is limited to PR descriptions (uneven quality), commit messages (no narrative), and — once cycle 400 lands — a Slack digest of merged PRs (team-facing, daily, multi-PR). What’s missing is a single, polished, navigable changelog surface that evaluators, testers, employer customers, and external readers can use to see what shipped and click through to try it. Rapid SOTA teams (Linear, Resend, Knock, PostHog, Vercel) all publish blog-style changelogs, and almost all of them are human-written — which is why most rapid teams’ changelogs are stale, terse, or missing. Flux has an asymmetric advantage here: cycle docs already carry the “why” (motivation, scope, design), diffs are fully accessible, Playwright is already in the stack, and Mintlify (cycle 214) is the publishing surface. Feed those four inputs into a three-pass Claude pipeline — Facts → Narrative → Verifier, with a hard-coded slop-voice guard — and we can produce SOTA narrative changelog entries on every merge to main with no human in the loop, without sounding like AI slop. This cycle ships the autonomous generator end to end: GitHub Action trigger, context assembly, Playwright screenshots, three-pass synthesis, Mintlify MDX output, and a graceful human-handoff failure mode. The post is the canonical artifact for “what shipped” — not the PR description, not the commit message, not the Slack digest. This is also a deliberate showcase of Mintlify’s full AI-native surface: MDX, <Update> component, section-level AGENTS.md, Autopilot as a secondary reviewer, auto-generated MCP server (so AI agents can answer “what shipped this week?”), llms.txt for downstream agent grounding, and Mintlify’s contextual buttons for one-click handoff to Claude/Cursor.

Current State

What Works

  • Cycle 214 has merged the spec-driven dev model and selected Mintlify as the docs platform. The docs.json config, root AGENTS.md, and Mintlify Autopilot are all part of the platform foundation.
  • Cycle 400 (open implementation PR #418) ships a /changelog Claude Code skill that generates a Slack-formatted digest of merged PRs — different surface, different cadence, different audience.
  • Cycle 380 (UI design skill) establishes Playwright screenshot conventions and docs/design/cycle-{N}/ artifact patterns we can mirror.
  • CI infrastructure: gh api access to PR metadata, check runs, and diff is already wired through GitHub Actions.

What’s Missing

  • No per-PR changelog surface — readers (evaluators, testers, customers) have no canonical place to see what shipped, formatted for human consumption.
  • No Mintlify changelog pagedocs/changelog/ does not exist.
  • No autonomous publishing pipeline — Mintlify has Autopilot for spec drift, but no out-of-the-box “PR → narrative blog post” generator.
  • No voice-quality enforcement — nothing prevents AI-generated marketing-speak slop from being published.
  • No diff → screenshot pipeline — Playwright is wired for cycle 380’s design workflow but not for changelog automation.

Scope

In Scope (Phase 1 — this cycle)

1. GitHub Action autonomous-changelog.yml (~0.5 day)

Triggers (parity with cycle 400’s daily-changelog.yml):
  • on: pull_request: types: [closed] filtered to merged == true against main — primary path
  • on: workflow_dispatch: with pr_number input — manual regeneration / debugging / replaying a PR after prompt iteration
Permissions: contents: write (commit MDX), pull-requests: write (open follow-up PR on failure), issues: write (post status comment). Concurrency: scoped per-PR to prevent duplicate generation on retry.

2. Context assembly module (~1 day)

scripts/autonomous_changelog/context_assembly.py. Pure function assemble_context(pr_number) -> ChangelogContext that gathers, in parallel:
  • PR metadata — title, body, author, labels, conventional commit prefix from title
  • Diffgh api repos/.../pulls/{n}/files for file list; git diff base...merge_sha -U3 for hunks
  • Commits — full commit messages and bodies via gh pr view --json commits
  • Cycle doc — resolved from branch name regex cycle(\d+(\.\d+)?)/...docs/roadmap/cycles/cycle{N}-*.md. Falls back to PR body link extraction if branch doesn’t match.
  • CI resultsgh api repos/.../check-runs for the merge SHA. Captures pass/fail summary per check, not full logs.
  • Linked issues — parse Closes #N / Fixes #N from PR body; fetch issue titles for context.
  • Preview/staging URL — read from existing GH deployment status API (cycle 209.2 preview env) or fall back to staging.
Output: a single ChangelogContext Pydantic model (defined in schemas.py). Heavy diffs are truncated to the first 10 000 changed lines + a one-line summary per truncated file, to bound token cost.

3. Playwright screenshot pass (~1 day)

scripts/autonomous_changelog/screenshot_runner.py. Diff-driven route discovery:
  • Identify changed user-facing routes by scanning web/app/**/page.tsx paths in the diff and mapping to URL paths.
  • Identify changed components by file path; for each, find a containing route via static analysis (best-effort).
  • Spin up Playwright (Chromium, stable) against the already-deployed preview or main URL (no local app spin-up). Capture each route at desktop (1440×900) and mobile (390×844).
  • Save to docs/changelog/images/<slug>/<route-slug>-{desktop,mobile}.png.
  • Capture metadata: route, viewport, response status, capture timestamp.
Failure tolerance: a single bad route never aborts the pass — log and continue. If no user-facing routes are detected (pure backend PR), skip screenshots entirely.

4. Three-pass Claude synthesis (~2 days)

scripts/autonomous_changelog/synthesis/. Implemented per the claude-api skill (Anthropic SDK, prompt caching, structured outputs). Detailed in Three-Pass Synthesis below.

5. Voice & anti-slop guardrails (~1 day)

scripts/autonomous_changelog/synthesis/voice_guard.py. Detailed in Voice & Anti-Slop Guardrails. Includes:
  • Hard-coded regex blocklist (~30 phrases)
  • Soft heuristics (sentence length, adjective density, opener patterns)
  • Few-shot voice samples in docs/changelog/_examples/
  • Voice guide in docs/changelog/_voice-guide.md (consumed by prompts)

6. Mintlify integration (~1 day)

  • scripts/autonomous_changelog/mintlify_writer.py — emits per-PR MDX files using the <Update> component
  • docs/changelog/index.mdx — landing page that aggregates entries (newest first, grouped by month)
  • docs/changelog/AGENTS.md — section-level AI customization (immutability rules, MCP grounding instructions)
  • docs/docs.json — adds a Changelog tab with auto-grouped pages via glob changelog/*
  • Mintlify Autopilot is invoked as a secondary review on the generated MDX (catches markdown/component syntax errors before publish)

7. Failure mode + human handoff (~0.5 day)

scripts/autonomous_changelog/failure_handoff.py. When Pass 3 rejects, slop-guard fires, or any pass errors:
  • Open a follow-up PR titled chore(changelog): handoff for #{N} — {failure reason} containing the draft MDX and a structured comment with flagged issues.
  • Post a non-blocking PR comment on the original merged PR linking to the handoff PR.
  • Never revert the original merge; never block CI on changelog generation.

8. Golden set + voice samples (~1 day)

  • Hand-curate five reference posts in docs/changelog/_examples/ covering: a frontend feature, a backend feature, a bug fix, a refactor with no user-facing change, and a complex multi-component cycle. These are the few-shot exemplars for Pass 2.
  • Hand-write docs/changelog/_voice-guide.md (one page) — voice rules, what to avoid, what good looks like. Prompts cite this guide.
  • Hand-curate the slop blocklist seed list (~30 phrases) from public Linear / Resend / Knock / PostHog / Vercel changelogs (positive examples) versus AI-generated marketing copy (negative examples).

9. Verification, observability, documentation (~0.5 day)

  • Unit tests for voice guard (≥ 50 phrase test cases)
  • Unit tests for context assembly (3 fixture PRs)
  • Integration test: end-to-end on a known-good past PR (e.g., cycle 365 plan PR), output reviewed manually
  • Token cost emitted to GitHub Action summary per run
  • Failure mode tested by injecting a deliberate slop phrase into Pass 2 output
  • Operator/contributor guide: docs/guides/autonomous-changelog.md
Total: ~8.5 engineer-days (≈ 1.5 weeks)

Out of Scope (Phase 2+)

  • Slack notification on publish — adjacent to cycle 400; deferred to keep cycles separate.
  • Weekly AI-synthesized “Shipped” roll-up post — a separate generator that consumes the per-PR posts.
  • Eval harness auto-trigger from generated “Try it” steps — ties to cycle 209.7 (post-merge validation + evals).
  • Internal-only “evaluator notes” section — role-gated content via Mintlify auth tiers.
  • Customer email digest — monthly newsletter sourced from changelog.
  • Multi-PR release-level summaries — group merged PRs in a release window into a single post.
  • Author-edit loop — letting authors comment /changelog edit on a PR to trigger regeneration with hints. Phase 2 if friction emerges.

Architecture

Pipeline Flow

PR merged to main


┌──────────────────────────────────────────────────────────────┐
│ .github/workflows/autonomous-changelog.yml                    │
│   trigger: pull_request closed && merged == true              │
│   concurrency: per-PR (cancel-in-progress: false)             │
└──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────┐
│ scripts/autonomous_changelog/pipeline.py (orchestrator)       │
└──────────────────────────────────────────────────────────────┘

    ├─▶ context_assembly.py       (PR + diff + cycle doc + CI)

    ├─▶ screenshot_runner.py      (Playwright on changed routes)


┌──────────────────────────────────────────────────────────────┐
│ synthesis/                                                    │
│   pass 1: facts.py        (Opus 4.7, T=0, structured JSON)    │
│   pass 2: narrative.py    (Opus 4.7, T=0.4, MDX output)       │
│   pass 3: verifier.py     (Sonnet 4.6, T=0, verdict JSON)     │
│   voice_guard.py          (regex blocklist + heuristics)      │
└──────────────────────────────────────────────────────────────┘

    ├─[verdict: publish]─▶ mintlify_writer.py
    │                          │
    │                          ▼
    │                      docs/changelog/YYYY-MM-DD-<slug>.mdx
    │                          │
    │                          ▼
    │                      git commit + push to main
    │                          │
    │                          ▼
    │                      Mintlify auto-deploys
    │                          │
    │                          ▼
    │                      PR comment: "Published → <Mintlify URL>"

    └─[verdict: human_review]─▶ failure_handoff.py


                               open follow-up PR with draft MDX


                               PR comment: "Handoff PR opened → #{M}"

Repository Layout

.github/workflows/
└── autonomous-changelog.yml          NEW — trigger workflow

scripts/autonomous_changelog/         NEW — Python module
├── __init__.py
├── pipeline.py                       Orchestrator (CLI entrypoint)
├── context_assembly.py               PR/diff/cycle/CI gather (parallel async)
├── screenshot_runner.py              Playwright route discovery + capture
├── mintlify_writer.py                MDX file generation, frontmatter, <Update>
├── failure_handoff.py                Open follow-up PR on rejection
├── schemas.py                        Pydantic models (ChangelogContext, FactList, Verdict)
├── synthesis/
│   ├── __init__.py
│   ├── facts.py                      Pass 1: extract verified facts from diff
│   ├── narrative.py                  Pass 2: write MDX in Linear voice
│   ├── verifier.py                   Pass 3: cross-check + slop check
│   ├── voice_guard.py                Regex blocklist + heuristics
│   └── prompts/
│       ├── facts.md                  System prompt for Pass 1
│       ├── narrative.md              System prompt for Pass 2 (cites voice guide)
│       └── verifier.md               System prompt for Pass 3

tests/autonomous_changelog/           NEW — root-level test suite (matches cycle 400 convention)
├── test_voice_guard.py
├── test_context_assembly.py
├── test_synthesis_e2e.py             Integration test on fixture PR
└── fixtures/
    ├── pr-frontend-feature.json
    ├── pr-backend-only.json
    └── pr-large-refactor.json

docs/changelog/                       NEW — Mintlify-published surface
├── index.mdx                         Landing page (newest first, by month)
├── AGENTS.md                         Section-level AI customization
├── _voice-guide.md                   Voice rules (read by Pass 2 prompt)
├── _examples/                        Few-shot voice samples (5 hand-curated)
│   ├── frontend-feature.mdx
│   ├── backend-feature.mdx
│   ├── bug-fix.mdx
│   ├── refactor.mdx
│   └── multi-component-cycle.mdx
├── images/                           Per-post screenshots (one dir per slug)
└── YYYY-MM-DD-<slug>.mdx             Per-PR posts (generated)

docs/docs.json                        EDIT — add Changelog tab
docs/guides/autonomous-changelog.md   NEW — operator/contributor guide

Three-Pass Synthesis

The core IP of this cycle. Every detail matters because the difference between a great post and AI slop lives in the prompts, model choice, and verifier rigor.

Pass 1 — Facts

SettingValue
Modelclaude-opus-4-7 (deepest reasoning for code-diff understanding)
Temperature0
ToolsNone
OutputStructured JSON, validated against FactList Pydantic schema
CachingSystem prompt + voice guide cached (5-min TTL)
System prompt directs Claude to extract a flat list of factual claims from the diff. Each fact carries:
class Fact(BaseModel):
    claim: str                          # one-sentence factual statement
    evidence: list[FileLineRef]         # ≥1 file:line references
    user_facing: bool                   # affects users vs. internal-only
    surface: Literal["backend", "frontend", "infra", "docs", "config", "test"]
    confidence: float                   # 0.0–1.0
Examples of good facts:
  • claim: "JobGet channel adapter posts jobs to JobGet's /jobs API", evidence: [{file: "backend/domains/hiring/distribution/channels/jobget.py", line: 42}], user_facing: false, surface: "backend", confidence: 0.95
  • claim: "Candidate portal sidebar collapses to icon-only at <768px viewport", evidence: [{file: "web/components/candidate/Sidebar.tsx", line: 87}], user_facing: true, surface: "frontend", confidence: 0.9
Failure modes:
  • Empty fact list → abort, post a PR comment “diff too sparse to summarize” (e.g., dependency bumps with no behavior change).
  • Output fails Pydantic validation → retry once with stricter schema reminder; second failure → abort with handoff.

Pass 2 — Narrative

SettingValue
Modelclaude-opus-4-7 (voice + structure)
Temperature0.4 (some creativity within guardrails)
ToolsNone
InputsFactList (Pass 1 output) + cycle doc text + voice guide + few-shot samples + screenshot URLs + linked issues
OutputRaw MDX (no frontmatter — writer adds frontmatter)
CachingSystem prompt + voice guide + few-shot samples cached
System prompt:
  • Cites docs/changelog/_voice-guide.md verbatim
  • Includes the five _examples/*.mdx posts as in-context few-shot demonstrations
  • Names the slop blocklist explicitly (“never use these phrases: …”)
  • Instructs Claude to lead with the change (not an announcement), use specifics over abstractions, prefer active voice
  • Tells Claude to use Mintlify components (<Frame>, <CardGroup>, <Card>, <CodeGroup>) where appropriate
  • Requires a “Try it” section if preview_url is present
Output structure (target template, not enforced rigidly):
{Hero imagefirst screenshot, or cycle-doc diagram if backend-only}

{Opening paragraph — ≤3 sentences. Lead with the change. Specific.}

{Bodywhat changed, surfaced through screenshots / specifics / numbers.
 Inline screenshots via <Frame>. Avoid sub-headings unless the post is long.}

## Try it
{Link to preview URL or staging environment, with one-line "what to look at".}

## Under the hood
{Terse bullet list with file:line links to GitHub. For curious readers.}
Failure modes:
  • Invalid MDX (component misuse, unmatched tag) → retry once with error feedback; second failure → handoff.
  • Pass 2 ignores few-shot voice → caught by Pass 3 verifier or slop guard.

Pass 3 — Verifier

SettingValue
Modelclaude-sonnet-4-6 (cheaper, faster, sufficient for cross-check)
Temperature0
ToolsNone
InputsFactList (Pass 1) + narrative MDX (Pass 2) + slop blocklist
OutputStructured JSON: Verdict
class Verdict(BaseModel):
    verified_claims: list[ClaimMapping]      # narrative claim → fact ID
    unsupported_claims: list[str]            # claims with no fact backing
    slop_phrases_detected: list[str]         # blocklist hits
    voice_concerns: list[str]                # heuristic violations
    verdict: Literal["publish", "human_review"]
    reasoning: str                            # brief justification
Decision rule:
  • verdict = "publish" iff: len(unsupported_claims) == 0 AND len(slop_phrases_detected) == 0 AND len(voice_concerns) <= 2.
  • Otherwise verdict = "human_review" and the failure handoff PR opens with the verdict JSON included for context.
Slop guard runs as a deterministic regex pass after Pass 3 (defense in depth — Pass 3’s slop detection is LLM-judged, slop guard is regex-judged).

Cost & Caching Strategy

Per the claude-api skill, the implementation must use prompt caching:
  • System prompt + voice guide + few-shot samples are cached across all three passes (same Anthropic API key, 5-min TTL). Pass 2’s call hits the cache established by Pass 1; Pass 3 also hits it.
  • Cycle doc is cached when present (used by Pass 2; also referenced by Pass 1’s reasoning).
  • Diff is the only large per-PR input that cannot be cached — it changes every PR.
Estimated per-PR cost (with caching):
  • Pass 1: ~15k input (mostly diff) + ~2k output, Opus 4.7 → ~$0.05
  • Pass 2: ~5k input (mostly cached) + ~3k output, Opus 4.7 → ~$0.04
  • Pass 3: ~5k input + ~1k output, Sonnet 4.6 → ~$0.01
  • Total per post: ~$0.10
At 100 PRs/week, this costs ~10/week.At1000PRs/week(extreme), 10/week. At 1000 PRs/week (extreme), ~100/week. Cheap relative to the human time saved. Token budget enforcement:
  • Hard cap: 50k input tokens per pass. Diffs over the cap are truncated by file (whole files preserved, tail dropped) with a “(truncated)” marker.
  • If the cap forces truncation of more than 30 % of the diff, the post adds an “Under the hood” disclaimer and links to the full diff on GitHub.

Voice & Anti-Slop Guardrails

This is the single most important section of this cycle. The whole pipeline fails to deliver value if the output reads like AI slop. Three layers of defense:

Layer 1 — Prompt-level (Pass 2 system prompt)

Voice rules embedded in the prompt:
  1. Lead with the change, not the announcement. Bad: “We’re excited to announce a new way to schedule interviews.” Good: “Interview scheduling now suggests time slots based on the candidate’s stated availability.”
  2. One specific over three abstractions. Bad: “powerful, intuitive, seamless experience.” Good: “creates a 30-minute slot in the next 48 hours that fits both calendars.”
  3. Show, don’t tell — screenshots beat adjectives. If you’d reach for an adjective (“clean”, “polished”, “intuitive”), reach for a screenshot instead.
  4. Active voice, present tense. Bad: “A new feature has been added that allows users to…” Good: “The candidate portal now shows pending interview requests at the top.”
  5. Names and numbers > generalities. Bad: “much faster”. Good: “p95 search latency dropped from 1.4 s to 240 ms.”
  6. Say what’s NEW, not what’s “now possible”. Bad: “It’s now possible to filter candidates by skill.” Good: “Candidate list has a Skill filter.”
  7. Don’t editorialize. No “we think this is going to be transformative.” Just say what shipped.

Layer 2 — Few-shot exemplars (Pass 2 in-context)

Five hand-curated reference posts in docs/changelog/_examples/:
ExamplePurpose
frontend-feature.mdxA new user-facing feature with screenshots
backend-feature.mdxA backend capability with no UI, but downstream impact
bug-fix.mdxA reported bug, now fixed — terse, specific
refactor.mdxAn internal refactor with no behavior change — minimal post
multi-component-cycle.mdxA cycle that touched 5+ surfaces — structured, with sections
Each example is reviewed by a human and considered the gold standard for that PR archetype. Pass 2’s prompt selects the closest archetype based on Pass 1’s surface distribution.

Layer 3 — Deterministic slop guard

scripts/autonomous_changelog/synthesis/voice_guard.py runs after Pass 3, regex-only, no LLM. Seed blocklist (sample — full list in code, ~30 entries):
SLOP_PATTERNS: list[re.Pattern] = [
    re.compile(r"\bwe(?:'re| are) (?:excited|thrilled|delighted|pleased) to\b", re.I),
    re.compile(r"\bseamless(?:ly)?\b", re.I),
    re.compile(r"\bsupercharg(?:e|ed|ing)\b", re.I),
    re.compile(r"\bworld[- ]class\b", re.I),
    re.compile(r"\bleverag(?:e|es|ing|ed)\b", re.I),  # except proper noun "Leverage"
    re.compile(r"\bcutting[- ]edge\b", re.I),
    re.compile(r"\bgame[- ]chang(?:er|ing)\b", re.I),
    re.compile(r"\brobust\b", re.I),
    re.compile(r"\bunder the hood\b", re.I),  # except as section title — handled by structural exclusion
    re.compile(r"\bblazing(?:ly)? fast\b", re.I),
    re.compile(r"\bnext[- ]generation\b", re.I),
    re.compile(r"\brevolutioniz(?:e|es|ing|ed)\b", re.I),
    re.compile(r"\bempower(?:s|ing|ed)?\b", re.I),
    # … ~17 more
]
Soft heuristics (warnings, not failures, surfaced in Verdict.voice_concerns):
  • Sentence average length > 28 words
  • Adjective density > 18% of tokens (per nltk POS tag)
  • Opening sentence starts with “We ” (lead with the change, not the team)
  • More than 2 marketing adjectives in any single sentence
  • Use of em-dash chains (3+ in one paragraph — a known Claude tic)

Layer 4 — Sampling audit (post-publish)

Weekly: a human (rotating, owner = cycle owner this iteration) reads the last 5 published posts and rates each on:
  • Specificity (1–5)
  • Voice match to references (1–5)
  • Would I publish this if I’d written it? (yes/no)
Results logged to docs/changelog/_audit-log.md. When patterns emerge (e.g., posts about backend changes are too dry), the voice guide and few-shot examples are updated.

Mintlify Primitives Used (Showcase)

This cycle exercises the full Mintlify AI-native surface. This table is part of the cycle on purpose: the goal is not just “publish a changelog” — it is to demonstrate Mintlify’s AI-native publishing model end to end.
PrimitiveUsage in this cycle
MDX filesNative authoring surface — generator emits MDX directly, no transformation layer
<Update> componentWraps each entry with a date label, description, and content slot — Mintlify’s first-class changelog primitive
<Frame>, <CardGroup>, <Card>Hero images, “Under the hood” file links, “Try it” callouts
<CodeGroup>, <Tabs>Multi-language code samples (rare in changelog, supported when needed)
Frontmattertitle, description, date, tags, pr, cycle, preview_url, authors — drives navigation, search, AI indexing
docs.json navigationAdds a top-level Changelog tab; pages auto-grouped by month via glob pattern changelog/2026-04-*
Root AGENTS.mdAlready configured by cycle 214; we extend with a Changelog section
Section AGENTS.mddocs/changelog/AGENTS.md declares: entries are immutable; Autopilot must not edit them; MCP queries should treat changelog as canonical “what shipped” source
Mintlify AutopilotRuns as a secondary review on each generated MDX — catches markdown/component syntax errors before publish; if Autopilot rejects, generator falls through to human handoff
Auto-generated MCP serverEvaluators ask Claude/Cursor “what shipped this week?” via specs.flux.employinc.io/mcp; changelog entries are first-class MCP resources
llms.txt / llms-full.txtAuto-includes changelog entries; downstream agents (support bot, sales bot) can ground answers in shipped features without a separate KB
contextual buttonsEach entry surfaces “Copy”, “Open in Claude”, “Open in Cursor”, “MCP” buttons (configured in docs.json)
AI traffic analyticsMintlify dashboard reports which agents read which entries and where they 404 — feedback loop for entry quality
Tags + filteringDomain tags (hiring, distribution, frontend, agents, etc.) drive Mintlify’s tag-filter UI; readers can scope to their area of interest
SearchMintlify’s built-in search indexes entries; tagged for relevance boost on cycle-related queries
Bi-directional syncGenerator commits MDX to main; Mintlify auto-deploys within seconds; PMs/engineers can hand-edit a published entry via Mintlify’s web editor and the change syncs back to the repo

Failure Modes & Recovery

StageFailureBehavior
Workflow triggerConcurrent PR mergesPer-PR concurrency group; each PR processed independently
Workflow triggerRe-fire on already-published PR (label change, manual workflow_dispatch, re-merge after revert)Detect existing entry by PR number in frontmatter; overwrite only if both Pass 3 and slop guard pass on the new run; otherwise open a handoff PR with a diff-of-diffs explaining what changed
Context assemblyCycle doc not found by branch regexContinue without cycle doc; log warning; Pass 2 falls back to PR body for “why”
Context assemblyDiff is empty (revert, no-op merge)Skip post entirely; post non-blocking PR comment “no changelog entry — no diff”
Context assemblyDiff > 5 000 lines / > 50 filesGenerate post but flag as “large change — review recommended”; truncate diff input
PlaywrightNo user-facing routes detectedGenerate post without screenshots (backend-only style)
PlaywrightBrowser crash / route 500Capture error-state screenshot; note in narrative; continue
Pass 1 — FactsEmpty fact listAbort; PR comment “diff too sparse to summarize”
Pass 1 — FactsJSON validation failsRetry once with stricter schema reminder; second failure → handoff
Pass 2 — NarrativeInvalid MDX (parse fails)Retry once with error feedback; second failure → handoff
Pass 2 — NarrativeSlop voice (caught by Pass 3)Handoff PR opened with flagged phrases
Pass 3 — VerifierUnsupported claims detectedHandoff PR opened with claim list and fact list for human review
Pass 3 — VerifierPass 3 itself errorsDefault to handoff (fail closed)
Slop guardRegex hitHandoff PR opened with matched phrases highlighted
Mintlify writerMDX file write failsRetry; if persistent, handoff PR with content as artifact
Git commit/pushPush conflictPull latest, retry; on second failure, open handoff PR
Mintlify deployMintlify Autopilot rejects MDXOpen handoff PR with Autopilot feedback included
Invariant: a changelog generation failure never blocks the original PR’s merge. The merge has already happened. Worst case is a follow-up PR for human polish.

Quality Bar — Definition of “Not AI Slop”

A passing post must satisfy all of the following:
  1. Specificity — every benefit claim has a concrete artifact (screenshot, code link, number, named feature)
  2. Brevity — opening paragraph ≤ 3 sentences; full post ≤ 400 words for a typical PR (multi-component cycles get more)
  3. Voice — zero hits on the slop blocklist; ≤ 2 soft heuristic violations
  4. Grounding — every factual claim in the narrative maps to a fact from Pass 1 (verified by Pass 3)
  5. Visual — at least 1 screenshot if the diff touches user-facing routes
  6. Navigability — “Try it” link present and resolves (curl HEAD check at publish time)
  7. Cycle context — if a cycle doc exists, the “why” is reflected (verified by Pass 3 — the narrative must contain at least one phrase semantically aligned with the cycle doc’s overview)
These are tested in tests/test_synthesis_e2e.py against fixture PRs and enforced by Pass 3 + slop guard at runtime.

Implementation Plan

Step 1 — Scaffold + GH Action skeleton (~0.5 day)

  • Create scripts/autonomous_changelog/ package with __init__.py, pipeline.py stub
  • Create .github/workflows/autonomous-changelog.yml with trigger + Python setup, calling pipeline.py --pr-number <N>
  • Wire dry-run mode (no commit, prints MDX to logs) for testing
  • Permissions: contents: write, pull-requests: write, issues: write
  • Concurrency group: changelog-pr-${{ github.event.pull_request.number }}

Step 2 — Context assembly module (~1 day)

  • Pydantic schemas in schemas.py (ChangelogContext, Fact, FactList, Verdict, FileLineRef)
  • context_assembly.py with parallel async fetches via asyncio.gather
  • Branch-name → cycle doc resolution
  • Diff truncation logic (whole-file preservation, tail-drop)
  • Unit tests with 3 fixture PRs (frontend feature, backend-only, large refactor)

Step 3 — Playwright screenshot runner (~1 day)

  • screenshot_runner.py with route discovery from diff paths
  • Playwright Chromium (pinned version), desktop + mobile viewports
  • Screenshot output to docs/changelog/images/<slug>/
  • Failure tolerance: per-route try/except, never aborts the pass
  • Skip when no user-facing routes touched

Step 4 — Three-pass synthesis (~2 days)

  • Anthropic SDK with prompt caching (per claude-api skill)
  • synthesis/facts.py (Pass 1) — Opus 4.7, structured output with Pydantic
  • synthesis/narrative.py (Pass 2) — Opus 4.7, MDX output, few-shot from _examples/
  • synthesis/verifier.py (Pass 3) — Sonnet 4.6, structured Verdict output
  • Prompt files in synthesis/prompts/ — reviewable, version-controlled
  • Token cost emitted per pass to GH Action summary
  • Integration test: end-to-end on cycle 365 plan PR fixture; manual review of output

Step 5 — Voice guard + slop blocklist (~1 day)

  • voice_guard.py with seed regex blocklist (~30 patterns)
  • Soft heuristics (sentence length, adjective density, opener pattern, em-dash chain)
  • Unit tests with ≥ 50 phrase test cases (positive + negative)
  • Integration: voice guard runs after Pass 3, results merged into Verdict

Step 6 — Mintlify integration (~1 day)

  • mintlify_writer.py — emits MDX with frontmatter + <Update> wrapper
  • docs/changelog/index.mdx — landing page with monthly grouping
  • docs/changelog/AGENTS.md — section-level AI customization (immutability, MCP grounding)
  • Edit docs/docs.json — add Changelog tab with auto-glob pages
  • _voice-guide.md — voice rules (consumed by Pass 2 prompt)
  • Verify Mintlify renders generated entries correctly (manual check on a deployed preview)

Step 7 — Failure mode + human handoff (~0.5 day)

  • failure_handoff.py — open follow-up PR with draft MDX + structured comment
  • PR comment integration on the original merged PR
  • Test by injecting deliberate slop into Pass 2 output

Step 8 — Golden set + voice samples (~1 day)

  • Hand-curate 5 reference posts in _examples/
  • Hand-write _voice-guide.md
  • Curate slop blocklist seed (~30 patterns from real changelog corpora)

Step 9 — Verification + observability + documentation (~0.5 day)

  • Operator/contributor guide: docs/guides/autonomous-changelog.md
  • Token cost monitoring (GH Action summary + Mintlify analytics)
  • Sampling audit log: docs/changelog/_audit-log.md template
  • Final E2E test: run pipeline on 3 historical PRs, manually review outputs
Total: ~8.5 engineer-days (≈ 1.5 weeks)

Verification Plan

  • .github/workflows/autonomous-changelog.yml triggers on PR merge to main and only on merge (closed without merge does not fire)
  • Workflow completes in < 10 minutes for a typical PR (≤ 1 000 changed lines)
  • Generated MDX validates against Mintlify schema (Autopilot review passes or workflow rejects)
  • Slop blocklist catches all 50 phrase test cases in test_voice_guard.py
  • Pass 3 verifier catches injected hallucinations in 5 deliberate test cases
  • Generated post for cycle 365 plan PR (test sample) passes voice review by a human
  • Generated post for cycle 401 implementation PR (test sample) passes voice review by a human
  • Generated post for a hypothetical “fixes typo” PR is either suppressed (per quality bar) or is appropriately terse
  • docs/docs.json Changelog tab navigates to entries; entries render with <Update> wrapper
  • docs/changelog/AGENTS.md is detected by Mintlify (verified in Mintlify dashboard)
  • Mintlify MCP server returns changelog entries to a query “what shipped this week?”
  • llms.txt includes changelog entries (verified at the deployed /llms.txt URL)
  • Token cost per PR ≤ $0.20 (caching working — Pass 2 + 3 inputs largely cached)
  • Token cost monitor reports per-pass token usage in GH Action summary
  • Failure handoff opens a follow-up PR within 60 seconds of verifier rejection
  • Failure handoff PR contains the draft MDX and a structured list of flagged issues
  • Non-blocking PR comment posted on the original merged PR (link to either published entry or handoff PR)
  • Sampling audit log template exists at docs/changelog/_audit-log.md
  • Operator guide docs/guides/autonomous-changelog.md exists and explains: configuration, debugging, prompt iteration, audit cadence
  • make quality-gates green (lint + format + typecheck + tests)

Risks and Mitigations

RiskImpactMitigation
Generated narratives still feel AI-written despite the three-layer guardDefeats the whole purposeFew-shot from real human-curated samples; verifier slop check; deterministic regex guard; weekly sampling audit; iterate prompts when patterns emerge from audit
Pass 3 verifier false positives block legitimate postsToil — every PR needs human polishCalibrate verdict threshold against 20 hand-labeled fixtures before launch; track override rate as quality signal; allow author label changelog:approve-handoff to publish a handoff draft as-is
Pass 3 verifier false negatives let slop throughQuality leakSampling audit (weekly, last 5 posts); deterministic regex guard as defense in depth; voice guide updated quarterly based on audit findings
Cost per PR exceeds estimate (large diffs, many PRs)Token spendHard 50k input cap per pass with truncation; Pass 3 uses cheaper Sonnet; cost emitted to GH Action summary; alert if weekly spend > $50
Cycle doc not found for a branch (legacy or non-cycle work)Loss of “why” contextFallback to PR body and linked issues; over time cycle 381 enforces cycle docs per cycle; document the fallback in operator guide
Diff is too large or too unfocused to summarize meaningfullyGeneric postSkip post (diff > 5 000 lines or > 50 files) and open a non-blocking PR comment “large change — manual changelog recommended”; provide a starter template
Race condition: two PRs merge in same minuteFilename collisionFilename uses merge-commit SHA suffix on collision; fall back to YYYY-MM-DD-<slug>-<sha7>.mdx
Mintlify Autopilot accidentally edits historical postsLoss of immutable recorddocs/changelog/AGENTS.md declares entries immutable; entries’ frontmatter contains immutable: true; Autopilot configuration set to ignore the directory by default
Playwright dependency makes CI slow or flakyWorkflow latency / failurePin Playwright Docker image; per-route try/except so single bad route never aborts; investigate Mintlify preview screenshot service as a Phase 2 optimization once cycle 214’s Mintlify implementation lands
Author objects to autogenerated post about their PRProcess frictionchangelog:skip label suppresses generation; changelog:edit label triggers handoff (draft only); Mintlify web editor lets author hand-edit a published entry, which syncs back
Voice guide drift — what feels SOTA today feels stale in 6 monthsLong-term stalenessQuarterly review of _voice-guide.md and _examples/ against current SOTA changelogs (Linear, Resend, etc.); voice guide is version-controlled; refresh is a one-day chore
Generator publishes sensitive details (e.g., security fix details before disclosure)Disclosure risksecurity label on a PR routes to handoff PR (no auto-publish); Pass 1 prompt instructed to never describe vulnerability mechanics
First N posts will need iteration after launchEarly-life messinessReserve a follow-up cycle (after first 20 posts ship) for prompt + voice-guide iteration informed by audit

Phase 2 Roadmap (Future Cycles)

Phase 2 ItemSurfaceLikely Cycle Number
Slack publish notificationCross-post Mintlify URL + hero image to a Slack channel on publishTBD — coordinate with cycle 400’s surface
Weekly “Shipped” roll-upA separate generator that consumes the week’s per-PR posts and writes a narrative weekly summary for external readersTBD
Eval harness auto-triggerGenerated “Try it” steps fed into eval scenario nominatorTies to cycle 209.7 (post-merge validation + evals)
Internal-only “evaluator notes” sectionRole-gated content via Mintlify auth tiers; technical detail for testersTBD
Customer email digestMonthly newsletter sourced from changelog tags; uses email serviceTBD
Multi-PR release-level summariesGroup merged PRs in a release window into a single post (vs per-PR)TBD
Author-edit loop/changelog edit <hint> PR comment triggers regeneration with author hintTBD if friction emerges
Translation (es/pt-BR for pilot regions)Mintlify supports i18n; auto-translate via Claude as a fourth passTBD post-pilot
Visual diff comparisonUse the screenshot pass to capture before/after of the same route across the mergeTBD if value is demonstrated

Relationship to Other Cycles

  • Cycle 214 — Spec-driven dev with Mintlify (REQUIRED dependency). Provides the Mintlify platform, docs.json navigation pattern, root AGENTS.md, and Autopilot configuration that this cycle extends. This cycle cannot ship until 214’s Mintlify implementation PR has merged.
  • Cycle 400 — /changelog Slack digest skill (ADJACENT, complementary). Slack-facing team digest of merged PRs. Different surface (Slack vs Mintlify), different cadence (daily cron vs per-PR merge), different audience (team vs external + evaluators). Phase 2 will tie the two together (publish notification cross-posts to Slack).
  • Cycle 401 — /standup skill (ADJACENT, similar shape). Both cycle 400 and 401 are interactive Claude Code skills. Cycle 451 is fully autonomous (GH Action only) but shares conventions for gh api PR fetching and conventional-commit grouping.
  • Cycle 209.7 — Post-merge validation + evals (PHASE 2 INTEGRATION TARGET). The “Try it” sections this cycle generates can feed back into eval scenario nomination — when the changelog says “candidate portal sidebar collapses at <768px”, that becomes a candidate eval fixture.
  • Cycle 221 — Chief Engineer review (COMPLEMENTARY). CE reviews quality of code; this cycle reports quality of shipped product. Both feed the AI-native quality loop.
  • Cycle 380 — UI design skill (REFERENCED). Screenshot conventions and docs/design/cycle-{N}/ artifact patterns inform this cycle’s docs/changelog/images/<slug>/ structure.
  • Cycle 365 — Pilot evals harness (REFERENCED). LLM-as-judge model selection pattern (distinct judge model from agent model) informs Pass 3 verifier model choice (Sonnet for Pass 3 vs Opus for Pass 1/2).
  • Cycle 381 — Issue-number cycle IDs (CONVENTION). This cycle follows the new convention; cycle number 451 = issue 451.

AI-Native Manifesto Alignment

§ PrincipleHow This Cycle Embodies It
§0 Uncompromising QualityThree-pass synthesis with verifier; deterministic slop blocklist; sampling audit; no shortcuts on output quality. The whole cycle exists because terse PR descriptions are not SOTA enough.
§1 One Mind, Full ContextGenerator reads full diff + cycle doc + CI results + linked issues + screenshots — holistic context per post, not file-by-file.
§3 Agentic ArchitectureThree Claude passes are agents with distinct roles (Facts extractor, Narrative writer, Verifier judge), not LLM wrappers. Each pass observes (reads inputs), reasons (within prompt rules), acts (produces structured output).
§5 Observability-NativeToken cost per post tracked and emitted to GH Action summary; verifier pass-rate tracked; sampling audit results logged to _audit-log.md; Mintlify AI traffic analytics tracks consumption.
§8 100% AI-Generated Code with Safety NetsGenerator is itself AI-generated (this cycle and its implementation). Safety nets: deterministic slop guard, verifier pass, sampling audit, human handoff fallback.
§10 Spec-Driven TraceabilityCycle doc → implementation PR → autonomous changelog entry → Mintlify-published — full traceability from spec to public surface. The changelog entry frontmatter cites cycle and PR.
§11 Cross-Model ReviewPass 1 (Opus) extracts; Pass 2 (Opus, different temperature) writes; Pass 3 (Sonnet) reviews — cross-model verification within a single workflow.

Notes

  • This cycle is meta in a productive way: when the implementation PR for this cycle merges, the resulting changelog entry will be the first autonomous post, generated by the system describing the system. That’s the canonical validation — if Cycle 451’s own changelog post is good, the system works.
  • The post is canonical; the PR description is not. Authors can put rough notes in PR bodies (or skip them) and trust the generator to polish the public-facing artifact. This should reduce PR-description toil over time.
  • Treat the first 20 posts as a prototype run. Reserve a follow-up cycle for prompt and voice-guide iteration informed by sampling audit results — the slop blocklist will need expansion as new patterns emerge.
  • AGENTS.md policies open the door for downstream agents (support bot, sales bot) to consume changelog entries via Mintlify’s MCP server to answer “does Flux do X?” — turns the changelog into a queryable product knowledge base. Phase 2.
  • The dependency on cycle 214’s Mintlify implementation PR is hard. If that PR is delayed, this cycle’s code phase waits. The plan PR (this doc) does not depend on it.
  • Cost is not a meaningful constraint. The pipeline costs ~$0.10 per post, which is dwarfed by the human time saved. The only real budget concern is keeping verifier false-positive rates low so engineers don’t spend 10 minutes per handoff PR.