The first version of my blog's review pipeline ran agents in parallel. Tone, structure, technical accuracy, naturalness. Everything came back at once. When a post was structurally a file-by-file code tour instead of an argument, no agent flagged it as such. What came back were surface findings on a structurally broken draft.
I polished the surface. Paragraph rhythm, word choices, a broken link. The post didn't get better. The structure was untouched, and the next review round returned fresh surface findings on the same broken foundation.
When nits are easier than structure
This failure mode has a recognisable shape:
- A function in the wrong module receives cleaner variable names.
- A UI built on the wrong workflow gets a better loading spinner.
- An index lands on a denormalised table.
The surface improves. The foundation doesn't move.
Structural and surface feedback have a dependency order, and flat lists destroy it. Renaming a variable takes thirty seconds and feels productive. Reconsidering which module a function belongs in requires stepping back from the code entirely. When both sit in the same list, the thirty-second fix wins every time.
The verdict that blocks everything downstream
I split the pipeline into two phases. The first is a single agent evaluating one question: does this post have a transferable argument, or is it a code tour? The verdict is binary: PASS or REWRITE. A REWRITE blocks the surface review entirely. No tone feedback, no reference checks, no pattern scan. The author gets one finding: the structure doesn't hold.
The first post through the gate got a REWRITE. A file-by-file code tour, no transferable principle. No nits to hide behind this time. The structural problem was the only feedback, and it got fixed.
When structural and surface feedback arrive together, the surface wins. Not because reviewers lack discipline, but because the thirty-second fix is always right there. The gate works because it removes the option. The downstream work doesn't exist until the structure passes.