发布于2026年5月19日

I Almost Built a Custom AI Agent Framework. Then I Found Out Anthropic Already Shipped One.

A real story of building four layers of custom AI agent infrastructure for a multi-client agency, then discovering Anthropic and a half-dozen GitHub repos had already solved the same problem. The lesson cost a few weekends of code. The fix is hybrid adoption, not from-scratch reinvention.

作者：Frank Yao

摘要

Spent six months building custom infrastructure for an autonomous AI Chief of Staff across seven clients. Hit the honest gap — my system was reactive, not truly autonomous. Searched GitHub before building another layer. Found Anthropic's Claude Code /loop (production since March 2026), the anthropics/cwc-long-running-agents reference harness, and five community frameworks that already solve the persistent-agent-loop problem. The pragmatic answer is hybrid — native primitives plus your existing data layer plus selected community patterns. Don't build what's already shipped.

I Almost Built a Custom AI Agent Framework. Then I Found Out Anthropic Already Shipped One.

Quick Check

对还是错：AI 工具将在 2 年内完全取代 SEO 的需求。

This is a story about building the wrong thing for six months and the moment I stopped.

The setup

I run a small agency. Seven active clients, each with their own SEO, content, e-commerce, and operations stack. As the agency grew, the operational coordination load grew with it. I started building an AI Chief of Staff persona to absorb the scheduling, status-tracking, and routine execution work.

This Chief of Staff lives across multiple substrates. There's a local file system with structured event logs, a Neon Postgres database that mirrors those events, a Vercel-hosted dashboard with role-based access for our operator team, and a layer of cron jobs for routine work like blog publishing, client pulses, and inbox monitoring. It works — until it doesn't.

The thing it didn't do well: stay autonomous across days. I had built a scheduled-trigger system that fires Claude when a cron tick says so. What I had not built was something that thinks in between ticks.

The four-layer stack I had built

Before the realization, my system was:

Windows Task Scheduler entries for local polling — checking IMAP for client replies, running monthly scorecards. Reliable when my laptop is awake. Dies when it sleeps.
Vercel Cron jobs for cloud-side automation — weekly client digests, daily site-health checks, six-hour blog autopublish cycles. Always-on. Production-stable.
A JSONL event log plus a Neon mirror table — every decision, dispatch, delivery, and blocker appended in real time. Cross-session memory.
Session-start bootstrap files — every fresh Claude session reads the bootstrap, which pulls in commitments, pending asks, and recent events. No more amnesia at the start of every chat.

Four layers. Roughly six months of building. It works in the narrow sense — nothing gets lost, every routine task runs on schedule, and the operator side of the agency has clear visibility into what is happening.

The honest gap

The gap is what nobody talks about when they show you their AI workflow demo. My system was reactive. Every cron tick fires a function. Every function does its job and exits. Aaron does not think between ticks. There is no continuous attention pointed at the next-best-action question.

This matters most when work spans days. A reporter pitch goes out Monday. The reply comes Thursday. In between, no version of the agent is looking at the calendar and saying: if no reply by Wednesday, draft a follow-up. The cron either fires the reminder on the schedule I pre-baked, or it does not.

It also matters when work hits unexpected branches. A catalog rewrite I shipped this week assumed every product had structured spec data in tags. Some did not. A truly autonomous system would notice the gap, fall back to a different parser, log the gap as a finding, and email me only if the gap exceeds a threshold. My system did a binary success-or-fail and moved on.

The honest cost of this gap is the question my partner kept asking me in different forms: do I have to nag you about this?

The discovery moment

I sat down to spec out the next layer. The plan looked something like: build a per-client pending-items registry, run a master sweep twice a day to poll readiness and act on ripe items, define escalation thresholds that bubble stuck work back up to me by email.

Halfway through writing the spec, I asked the right question: is anyone in the open-source community already solving this? Twenty minutes of searching changed the plan.

What I found

1. Claude Code has a native autonomous loop

Anthropic shipped /loop as a built-in Claude Code feature in March 2026. The mechanic is exactly what I was about to build: tell Claude what to check, how often, for how long, and it runs the task on repeat. Default interval is ten minutes. It supports dynamic re-pacing — the agent can self-schedule its next wake-up based on what it found.

This is not a wrapper around cron. It is a first-class autonomous-agent primitive in the runtime. The Pragmatic Engineer survey of about a thousand engineers in March 2026 reported that 55 percent of developers are now running AI agents as part of their primary workflow, not just autocomplete. Claude Code is the leading AI coding tool, having overtaken both GitHub Copilot and Cursor over the prior eight months.

2. Anthropic publishes the reference architecture

anthropics/cwc-long-running-agents on GitHub

This repository is Anthropic's own example harness for the long-running-agent pattern, built directly on two research papers they published — Effective Harnesses for Long-Running Agents in November 2025 and Harness Design for Long-Running Application Development in March 2026. If you want to know how Anthropic thinks about agent durability, context-window management, and stateful work spanning days, this is the canonical reference.

3. The community has shipped multiple production-ready frameworks

seolcoding/nonstop-agent — a 24/7 autonomous agent harness for Claude, production-ready for continuous coding work across multiple sessions
Bande-a-Bonnot/Boucle-framework — loop runner plus persistent memory (Broca) plus MCP server for multi-agent memory sharing plus approval gates
eddiearc/long-running-harness — a Claude Code skill focused on maintaining continuity across context windows, based on Anthropic's research
shareAI-lab/learn-claude-code — a nano agent harness where agents schedule their own future tasks. This is the self-scheduling pattern in its purest form
ruvnet/ruflo — multi-agent orchestration platform with swarms, self-learning, and RAG integration

Why this matters

Six months of building got me roughly seventy percent of what /loop ships natively. The remaining thirty percent is the hard part — the agent loop itself, the dynamic re-pacing, the context-window continuity. These are not weekend builds. They are research-grade primitives that Anthropic has spent the last year designing, testing, and shipping.

The right read is not that I wasted six months. The data layer I built — event logs, cross-session memory, per-client structured state — still has real value because it is shaped to my agency. What I did waste was every hour spent designing the loop primitive itself when it was already on the shelf.

The hybrid architecture I am adopting

Going forward, the stack looks like this:

Claude Code /loop is the always-on agent heartbeat. Native primitive, no custom code. Fires once at session start, self-paces from there.
My existing per-client JSONL pending registry is the shared state across ticks. The loop reads this file on every wake-up and decides what to act on.
anthropics/cwc-long-running-agents patterns apply to checkpoint and handoff strategy on top of my existing handoff files.
eddiearc/long-running-harness contributes the context-window-pressure auto-summary pattern. When a tick gets close to context limit, the agent writes a handoff summary that the next tick reads on wake-up.
Bande-a-Bonnot/Boucle approval-gate pattern handles the cases where I must stay in the loop for sign-off — client-facing emails, spend over a threshold, anything destructive.

Result: native primitives for the parts Anthropic does best. Custom data layer for the parts my agency does uniquely. Community patterns for the connective tissue. No reinvented wheels.

Lessons — when to build, when to adopt

If this story has a generalizable lesson, it is the one I should have applied earlier: every time you are about to build infrastructure for an AI workflow, spend twenty minutes searching the open-source ecosystem first.

Three rules I am working with now:

If the primitive is research-grade — agent loop, memory architecture, planning algorithms — almost certainly adopt. The compounding research effort behind these is years, not weekends.
If the primitive is data-shaped to your business — client registries, vertical-specific schemas, your operator's actual workflow — almost certainly build. Nobody else has your shape.
If the primitive is connective tissue — webhooks between two systems, format adapters, simple polling loops — adopt the shape from open source but write the code yourself. The cost of debugging someone else's adapter often exceeds the cost of writing your own.

What I am shipping now

Tonight I committed to the hybrid plan and started the build. Three sessions of work to get there:

Session one — wire up Anthropic's /loop with my existing per-client pending registry. Run it on one client first to prove the integration.
Session two — apply the eddiearc/long-running-harness continuity pattern. Auto-generated handoffs at context-window pressure, replacing my manual session-start ritual.
Session three — adopt the Boucle approval-gate pattern. Roll out to all seven clients. The agent now runs continuously across the portfolio, interrupts me only when a client-facing or high-stakes decision lands.

After three sessions, the Chief of Staff is a real autonomous worker, not a scheduled reactive function. The nag burden on me drops to roughly zero. The decisions that genuinely need my input show up in my inbox in real time, not buried in a chat history I forgot to read.

The bigger picture

There is a wave of build-your-own-autonomous-AI-agent content circulating right now. Most of it is from people who have not yet read Anthropic's research papers. The harness work — the part that makes an agent stay coherent across days and not drift, not fabricate, not lose track of what it was doing — is genuinely hard. It is also already done.

If you are building AI workflows for your own work, your team, or your clients, the most leveraged hour you can spend this week is reading the two Anthropic research papers and skimming the cwc-long-running-agents repository. Then build the data layer that is unique to your business on top. That is the line between AI tools you use and AI systems that actually replace work.

My honest report card from this week: a hard lesson in restraint, a corrected plan, and an AI Chief of Staff who is finally on a path to genuinely autonomous behavior. Not bad for a Sunday.

Where Are You Right Now?

你的业务目前在 AI 方面最大的挑战是什么？

常见问题

What is Claude Code's /loop?

Claude Code's /loop is a built-in slash command Anthropic shipped in March 2026 that turns Claude Code from a reactive assistant into an autonomous agent. You tell it what task to perform, how often to check, and for how long, and it self-schedules its own next wake-up. Default interval is ten minutes. It supports dynamic re-pacing — the agent decides how long to wait before its next tick based on what it found on the current tick.

How is this different from a regular cron job?

A cron job runs a script on a fixed schedule and exits. Claude Code's /loop runs the same Claude model you use interactively, with access to your full context — files, tools, memory, prior decisions — and decides on each tick what to do next. It can change its own schedule, escalate to you when it hits a blocker, and remember what it did across many ticks. It is a stateful autonomous agent, not a stateless task runner.

Is this safe to use in production or for client work?

Yes, with the standard safety patterns. Run it under your existing approval rules — any spend over a threshold, any client-facing communication, any destructive operation should still gate through human approval. The loop itself is safe; the actions inside it need the same governance you would apply to any AI workflow. Anthropic's research papers cover the safety patterns in depth, and the community frameworks like Boucle ship explicit approval-gate primitives.

What does this cost in compute?

Every tick of /loop runs a Claude model call. At ten-minute intervals, that is 144 ticks per day per agent. Even with conservative token usage per tick, this adds up. The mitigation is to make each tick check first whether anything is ready to act on — if nothing is, the tick exits cheaply, and the loop continues. Pair /loop with a structured pending-items file and the cost stays bounded to the work that actually needs doing.

Where do I start if I want this for my own workflow?

Three steps. First, read Anthropic's two research papers — Effective Harnesses for Long-Running Agents and Harness Design for Long-Running Application Development. Second, clone the anthropics/cwc-long-running-agents repository and run the examples until the mental model clicks. Third, identify the smallest unit of routine work in your day that has a clear ready-when-X signal, and wire that as your first /loop task. Expand from there. The community frameworks are good references once the native primitive is grounded.

Contour Studio: Inside the AI-Powered Digital Build for a Vancouver Fashion Brand

RAG 系统开发：如何打造一个真正懂你业务的 AI

我如何在自己的 AI 流水线里抓到一次跨客户的博客质量漂移（并在一个下午内堵上）

准备好付诸行动?

让我们聊聊 AI 自动化和智能数字策略如何为你的业务带来实际成果。

免费咨询返回博客