How parallel AI execution compressed an 8-to-12-hour backlog refactor into 70 minutes — a case study in agentic workflow leverage for operations leaders.

A week ago I needed to make a small policy decision stick across 8 repositories and 719 already-open tickets. The honest estimate to do it by hand was 8 to 12 hours. We did it in 70 minutes.

This post is about how — and why that gap matters for anyone running a business where execution capacity, not strategy, is the bottleneck.

The work — what does 719 tickets even mean?

Every business has a version of this problem. There’s a way you want things tagged, classified, named, formatted — a small consistency that pays off in real ways downstream. Reports stop lying. Searches stop missing things. New people stop having to ask. The catch is that the consistency only pays off when it’s retroactive — applied not just to the tickets you’ll create tomorrow, but to the 719 that already exist.

The specific decision in our case was a lifecycle classification: every open ticket across our engineering workspace was supposed to carry exactly one label answering “what kind of work is this?” — a next action, a multi-step project, something we’re waiting on someone else for, a someday/maybe, or pure reference material. Five buckets. One label per ticket. Mandatory. (For readers familiar with David Allen’s Getting Things Done methodology, this is the standard GTD lifecycle applied to a software backlog.)

The classification itself isn’t hard. A senior person can read a ticket title and a one-line description and pick the right bucket in about 30 seconds. The hard part is doing it 719 times. Across 8 different repositories. Without burning out, drifting in judgment quality somewhere around ticket 300, or losing the will to keep going.

Most operations leaders I talk to have a backlog shaped exactly like this. Not because the work is hard. Because there’s so much of it.

Why work like this normally doesn’t get done — the execution sandwich

A senior person earning real money cannot spend a Tuesday afternoon picking from five labels for 719 tickets. The math doesn’t work, and they’d quietly hate every minute of it. So the work gets pushed to junior staff — except a junior person doesn’t yet have the judgment to make consistent calls on ambiguous tickets, which means the senior person ends up reviewing the output anyway, and the cost goes up not down.

The other escape route is to give up on consistency for the existing tickets and just start doing it right on tomorrow’s tickets. This is what most teams pick. It feels pragmatic. It is also why every business eventually has search results full of unlabeled, half-categorized old work that nobody knows whether to trust. The retroactive job stays undone forever.

The version of this problem in your business probably isn’t tickets. It might be 4,000 customer records with inconsistent industry codes. Or 600 product photos that never got the right alt-text. Or three years of expense receipts where the GL category was guessed at the time and never corrected. Same shape, different surface. A senior judgment call, repeated thousands of times, that’s too expensive to pay senior wages for and too important to delegate to junior staff cold.

The reason this kind of work historically doesn’t get done is that the unit cost of human judgment is roughly fixed. You can’t pay someone less to make the 600th decision than you paid them for the first. So the budget runs out before the backlog does.

What an hour looked like — parallel work pretending to be serial

What changed is that the unit cost of some kinds of judgment is no longer fixed. Specifically: judgment that an AI can render well enough to be reviewed in batches instead of one at a time.

The shape we ran was a fan-out: dispatch the work to nine parallel workers, each handling a different slice. Five workers in the first wave, four in the second. Each worker got a written brief — three paragraphs describing exactly what to do, what counts as success, and what to skip. Each worker was given its own isolated copy of the workspace (a Git worktree, for the engineering-curious) so it could read and write without colliding with the others. Each worker reported back when it was done, with an audit trail of what it had touched.

The first wave kicked off at 9:44 AM. By 10:53 — sixty-nine minutes later — every ticket across every repository was correctly classified, the supporting infrastructure was built and merged, and the work was permanently system state: future tickets would self-classify at creation, and we’d never have to do this retroactive job again.

It wasn’t smooth. Something broke in the first ten minutes. The command that launched each worker into its own window had a quoting bug — an apostrophe in one of the instructions confused the shell, and several of the workers came up with mangled instructions instead of clean ones. We caught it on screen, fixed the launcher, and re-dispatched. The workers that had already finished kept their good output; the ones that hadn’t started over with corrected briefs. Total cost of the bug: about four minutes and one duplicate dispatch.

Another worker hit a safety gate partway through. Our setup has a guardrail that prevents the AI from modifying its own configuration without explicit human authorization, and the worker hit it. The guardrail did its job — it stopped, surfaced what it was trying to do, and asked for a yes-or-no. I said yes; the worker resumed. That guardrail wasn’t decorative. The kind of work AI is now doing on your behalf needs friction in the right places, not just speed.

A third worker, on a different slice of the job, applied the wrong default label to about 17 tickets in a corner case its brief hadn’t anticipated. A later worker — auditing the first one’s output — caught it and corrected it. That’s not graceful failure handling. That’s the system working as designed. You don’t trust any single agent to be correct; you build the audit step into the next wave.

By the end of the 70 minutes, two infrastructure changes had been merged, all 719 tickets carried their lifecycle label, and the system had been permanently upgraded to prevent the retroactive problem from ever recurring. The bug was logged. The guardrail event was logged. The auditor’s correction was logged.

The serial-equivalent estimate — 719 tickets at roughly 30 seconds each, plus the infrastructure work, plus the context-switching tax of doing the same judgment call over and over for hours — comes in around 8 to 12 hours of focused senior time, almost certainly spread across two or three days. We didn’t take a day. We took an hour.

What this means for your business — the bottleneck isn’t where you think

The interesting question isn’t “how did the AI do it faster?” That answer is technically interesting but operationally cheap: it ran in parallel, it didn’t get tired, and the unit cost of its judgment is near zero. Anyone in your industry can buy the same baseline capability inside the next 12 months.

The interesting question is “why hasn’t this kind of leverage shown up in your operations yet?” And the honest answer, almost always, is that the shape of the work isn’t ready for it. The judgment isn’t written down anywhere reviewable. The data is scattered across systems that don’t talk. There’s no audit trail to catch a mistake. There’s no isolation between workers — they’d corrupt each other’s output if you ran them in parallel. The brief that tells a worker what “success” looks like has never been written, because the senior person doing the work for the last decade just knew.

Almost every business is going to discover, in the next year or two, that the bottleneck on AI-leveraged execution isn’t the AI. It’s the operational scaffolding around the AI. Briefs. Isolation. Audit. Guardrails. The unglamorous infrastructure that makes the difference between “the AI did a day’s work in an hour” and “the AI did a day’s worth of damage in an hour.”

The fan-out pattern we ran isn’t unique to engineering tickets. It travels to anything where the work is judgment-heavy, parallelizable, and has a clear definition of done. Categorize 4,000 records. Audit 600 photos. Reclassify three years of receipts. Review 1,200 lead notes for follow-up flags. Every one of these is a 70-minute job dressed up as a multi-week project. The classic Eisenhower decision matrix stops being the limiting factor on getting the not-urgent-but-important quadrant done, because the cost of working that quadrant has collapsed.

It only stays dressed up that way until somebody puts in the scaffolding. After that, the work disappears in an hour.

That’s the leverage. It’s not coming. It’s here. The question is whether your operations are shaped to receive it.

Our automation services — the kind of operational scaffolding (briefs, isolation, audit trails) that makes AI fan-out safe to deploy on real business processes.
Our AI development services — when the scaffolding itself is the build (custom agents, RAG systems, multi-step workflows).
The three DNS records that decide if your email lands — sister piece in our knowledge base; same shape of post (unglamorous infrastructure, outsized impact) applied to email deliverability.

The Hour That Cleared the Backlog

The work — what does 719 tickets even mean?

Why work like this normally doesn’t get done — the execution sandwich

What an hour looked like — parallel work pretending to be serial

What this means for your business — the bottleneck isn’t where you think

Talk about your backlog.

The Hour That Cleared the Backlog

The work — what does 719 tickets even mean?

Why work like this normally doesn’t get done — the execution sandwich

What an hour looked like — parallel work pretending to be serial

What this means for your business — the bottleneck isn’t where you think

Related reading

Talk about your backlog.