What it takes to run a fleet in production, not just demo one.
One agent is an afternoon. The day after the demo you own ten of them, on real data, for real users, and somebody has to run them.
Naive and blackbox both stop at the demo. The wall is everything after: versions, permissions, traces. No builder does that part for you.
A demo proves the task can be done once. Answer these four per role and most of the architecture falls out on its own.
This one runs in production today. Same job for fifty people, each with private email, calendar, and files the others can never see.
The inbound prompt is one input among several: a task. Six parts decide what happens to it, and every one is config you own.
The role is a folder: skills, scripts, soul, seed memory. The yaml is the spec that wires it together. Diff it, version it, audit it, clone it.
role: executive_assistant autonomy_band: moderate model_tiers: classify: cheap draft: workhorse judge: frontier permissions: write_autonomous: [create_draft, label_or_file_email, ...] write_requires_approval: [send_email, schedule_or_move_meeting, ...] budgets: max_daily_cost_usd: 25 per_run_ceiling_usd: 1.50
Give it a real account: its own email, its own scoped tokens, its own rows in the audit log. Firing it means revoking tokens, same as offboarding a person.
One role file, bound per user at run time to their tokens, data, and memory. Fifty employees is one codebase and fifty config rows.
A trigger is how the work arrives: a message, a cron row, a webhook, another agent. All four enqueue the same job shape against the same role.
Planner, coder, tester: those could be skills inside one loop. A worker earns its own agent when it needs a fresh context window or its own credentials.
Make it a typed payload: the intent, the durable context B needs to act, a compressed summary of the rest. Everything else gets dropped on purpose.
The binding is data: which tokens this instance holds. The policy is code: which actions run alone and which wait in the approval queue. Neither lives in the prompt.
Instrument the runtime once and every role inherits it. Each run writes one trace: inputs, model, tool calls, cost, outcome. Replay reruns a failed step with the same config and context.
A deterministic router tries the cheap rungs first. A rule that works costs nothing. The loop only sees what nothing cheaper could handle.

Config on a shared runtime. Its own account and audit trail.
Chat, schedule, hook, agent. Role × trigger × task.
Identity binds what it sees. Policy binds what it does.
Every run recorded, or it's a slot machine.
One agent is a prompt. A fleet is an operating model. Get those four right and it scales. Get them wrong and you have fifty demos.
Questions are welcome. Stories are better.
Single loop, swarm, a builder, your own runtime.
The thing you'd ship again tomorrow.
The failure that taught you something.
Anything you'll go change this week.