Building with AI Agents, Building for AI Agents

I spent months building the Torrust Tracker Deployer without writing a single line of code directly — every line came from a GitHub Copilot agent. But this post is about something bigger: the shift in thinking that came from that experience and what it taught us about designing tools that AI agents can actually use well.

Jose Celano - 27/02/2026
Building with AI Agents, Building for AI Agents

Introduction

I spent several months building the Torrust Tracker Deployer without writing a single line of code directly. Every line of Rust, every configuration file, every refactoring plan — all of it came from a GitHub Copilot agent. I reviewed, directed, steered, and approved. But I did not type the code.

That experience turned out to teach me something more interesting than how to use agents effectively as a developer. It changed how we think about who our software is for.

This article tells two stories. The first is about building with AI agents: how I worked with them day to day, what went wrong with remote agents, and what I learned from micromanaging them in a complex infrastructure project. The second is about building for AI agents: the gradual realisation that agents — not humans — would become the primary operators of the deployer, and everything we changed in the product because of that.

If you're not yet familiar with the deployer itself, you may want to read the earlier articles in this series first: Introducing the Torrust Tracker Deployer and Deploying Torrust to Production.

Part 1 — Building the deployer with AI agents

How I worked with agents

I used a single tool throughout: GitHub Copilot, running inside Visual Studio Code. Never more than three agents active simultaneously. No Claude Code, no Cursor, no remote pipelines beyond what Copilot offers out of the box.

Over time I settled into three distinct working modes:

Mode 1: Pair programming for planning and design

When I needed to plan a feature, define an architecture, or design a refactoring strategy, I would work conversationally with a single agent. This is the mode that feels most like having a senior colleague to think out loud with. I describe the problem, the agent suggests approaches, we debate trade-offs, and eventually we arrive at a plan. The agent is not executing anything here — it is just helping me think.

This mode worked extremely well. Agents are patient, thorough, and will consider angles you might miss. Architecture sessions that would have taken me days of reading and thinking alone got compressed into focused conversations.

Mode 2: Local agents with limited permissions

For actual implementation tasks, I assigned work to local agents but deliberately limited what they were allowed to do autonomously. The agent could write code, run tests, and propose commits — but I reviewed every step before it proceeded. I treated the agent like a junior developer working in a pair: capable and fast, but not yet trusted with full autonomy.

This ended up being the natural working mode for two reasons. First, it is simply the easiest way to work — having the agent local means no external runner setup, no shared environment limitations, and no delays. Second, the nature of this project meant most tasks required constant steering regardless. The agent would drift, take a wrong assumption, or produce code that compiled but missed the intent. Staying close to the work and reviewing each step was not overhead — it was the actual job.

I do allow the agent to interact with LXD directly — creating and destroying virtual machines, running provisioning commands — but under supervision. I control which commands the agent is permitted to run. Because LXD runs locally on my development machine, VMs can be created and destroyed quickly with no cloud cost and no lasting consequences. That makes it safe to grant the agent meaningful permissions while still retaining oversight.

Mode 3: Remote Copilot (tried, abandoned)

At the start of the project, I also tried GitHub Copilot's remote agent mode — assigned tasks run on GitHub's shared runners without requiring my machine to be involved. It seemed appealing as a way to parallelize work.

It did not work well for this project, for two distinct reasons.

Why remote Copilot did not work for this project

Problem 1: Nested virtualisation on shared runners

The integration tests launch real virtual machines using LXD. This requires nested virtualisation — running a hypervisor inside the already-virtualised environment that GitHub's shared runners provide. Shared runners do not support this. The LXD VMs would start but network connectivity inside the VM would fail, causing tests to error in ways that never occur locally.

This is a hard constraint, not a configuration problem. There is no way to make nested virtualisation work on standard shared runners without switching to self-hosted runners or a different test strategy.

Problem 2: The pre-commit timeout loop

The project has thorough pre-commit hooks: Rust compilation, clippy, tests, linters, spell checking. On my local machine these take over five minutes to complete. On the shared runners they take even longer.

GitHub Copilot's remote task timeout is shorter than that. The agent would trigger a commit, the pre-commit hooks would start running, the timeout would fire before they finished, and Copilot would conclude the commit had failed. It would then try again. And again. An infinite retry loop, each attempt interrupted mid-commit, making no forward progress.

The root cause is that the timeout fires before the pre-commit hooks complete, but the agent interprets this as a failure rather than a timeout — so it retries rather than waiting or reporting the problem.

Lesson: Remote agents have real constraints for infrastructure projects. If your tests require specialised environments (VMs, hardware, unusual networking), or if your pre-commit checks are slow, local agents with human review will serve you better than remote pipelines. The alternative is self-hosted runners or your own agent infrastructure — but that is a significant investment in itself.

Part 2 — The shift: agents are the users

The project started in September 2025. The initial decision — to build a console CLI — was obvious. Infrastructure tooling lives in the terminal. You provision a server by running commands. A CLI is the natural fit.

Over the following months, something shifted in how we thought about the tool. Not in how it worked, but in who we imagined using it.

The realisation came gradually: in a world where people interact with their computers through AI agents, nobody is going to use this CLI directly. They will ask their agent to deploy the tracker. The human remains the end user — the person who wants a working tracker — but the operator of the deployer is increasingly likely to be an agent.

Once you frame it that way, a new question becomes urgent: what does an agent need from this tool? And more importantly: are we giving it those things?

We added a dedicated section to the project roadmap: section 11, "Improve AI agent experience". It was the first time we had explicitly committed to treating agents as a first-class user group. You can follow the full roadmap at GitHub issue #1.

Part 3 — Rethinking UX for AI agents

Before getting into the specific changes we made, it is worth explaining the thinking that shaped them. Some of our assumptions about UX turned out to be correct; others needed revising.

What human UX and agent UX have in common

The deployer's CLI was already built with careful attention to usability. Every command was designed to be:

  • Self-explanatory — commands and options describe what they do
  • Self-discoverable — the app tells you what you can do next
  • Observable — output shows what is happening at every step
  • Explanatory on failure — errors describe what went wrong and why
  • Helpful after failure — errors suggest how to continue
  • State-aware — the user always knows where they are in the workflow

All of these principles benefit agents just as much as humans. Clear errors, structured output, helpful hints — an agent consuming these is in a much stronger position than one left to guess from cryptic messages. As the Tinybird team put it later (more on this in Part 5): "When an AI agent runs your CLI and gets a cryptic error, it hallucinates a fix. That's bad for agents, and it's equally bad for humans debugging at 2am."

The cost of self-discovery

This is not actually a new problem. Software development has always had a high onboarding cost — for humans, the currency is time and cognitive load rather than tokens, but the tension is the same. Good engineering teams have always wrestled with two competing pressures:

  • Write documentation to reduce the time new contributors spend learning how things work.
  • Keep that documentation maintained — because stale docs that no longer reflect the code are worse than no docs at all.

The classic answer has always been: give people the exact context they need for the specific task they are about to do, rather than front-loading everything or leaving them to discover it themselves. Good onboarding is targeted, not exhaustive.

Agents make this tension sharper because the budget is explicit. Every round-trip — running a command to discover what options it takes, reading the man page, trying a flag — consumes tokens. If an agent has to explore the interface to learn how to use the app, it burns through context before doing any real work. The cost of self-discovery is measurable in a way that it never quite was for human developers.

The implication is the same one good engineering teams have always acted on: a compact, targeted reference loaded at the start of a task is more efficient than relying on self-discovery — whether the learner is a new hire or an AI agent.

The documentation maintenance problem, now worse

Working with AI agents is generating a new category of documentation — skills, prompt files, schema exports, context files — much of it written for agents to consume rather than humans to read. This creates a version of the old documentation problem, but potentially worse: documentation that humans struggle to review, produced faster than any team can meaningfully maintain.

If an agent can change code faster than any human, but nothing forces it to update the related documentation, we will end up exactly where we have always ended up: outdated docs that nobody trusts. The speed advantage of agents makes this worse, not better.

Some teams are now arguing that agents should learn to use software the same way a senior developer would — by reading the code itself, exploring the interface, building understanding from first principles rather than from documentation. There is something appealing about this: it sidesteps the maintenance problem entirely. No docs, no stale docs.

But this argument has never worked at scale for humans, and it is unlikely to work for agents either. A senior developer joining a large project does not learn it by reading every file — they get onboarded, they get context, they get pointed at the right parts. "Code over documentation" from the Agile Manifesto never meant no documentation. It meant: prefer working software over comprehensive documentation, and write documentation that earns its maintenance cost.

The right answer for agents is the same as for humans: write documentation that is targeted, close to the code, and easy to update alongside the code it describes. Skills stored in the repository and referenced from AGENTS.md are one concrete attempt at this — they live where the code lives, so the same pull request that changes a feature can update the skill that describes it.

The interface question: CLI, REST, or GraphQL?

We discussed whether a CLI was even the right kind of interface for agents to interact with. Three options were on the table:

  • CLI: natural fit — agents run in terminals, and the deployer already had one. But agents have to parse text output and infer meaning from exit codes.
  • REST API: a standard programmatic interface with structured JSON responses.
  • GraphQL: self-describing schema — an agent can query the schema itself to discover what the API can do, similar in spirit to an MCP server.

GraphQL's self-describing quality was attractive. But all three options share a fundamental limitation.

Interfaces do not convey workflows by default

A GraphQL schema and a REST API reference list available operations with no inherent ordering. A CLI --help output works the same way by default — but a CLI has more room to do better. A well-designed CLI can include a top-level help section that describes the deployment workflow explicitly, output state-aware hints after each command ("next step: run provision"), or fail with errors that name the missing prerequisite step. The deployer does this.

The deployer has a mandatory deployment sequence:

create → provision → configure → run

You cannot provision a server before creating it. You cannot run services before configuring them. The point is not that no interface can express this — it is that expressing sequential workflows requires deliberate design effort on top of whatever interface you choose. It does not come for free with GraphQL's self-describing schema, a REST reference, or a default --help output. An agent using any of these interfaces without that extra design work still needs separate documentation to understand the required sequencing.

Where agents genuinely differ: they can program

So far the picture has been mostly one of continuity: agents benefit from the same UX principles as humans, they share the same tension between documentation and self-discovery, and they need workflow context just as a new developer would. In many ways, designing for agents means applying good software design principles more rigorously, not inventing new ones.

But there is one capability that sets agents meaningfully apart from human users: they can parse structured output instantly, call APIs, and — most importantly — write and execute code as part of completing a task. A human using a CLI reads the output and decides what to do next. An agent can do that too, but it can also write a small program on the spot to automate the next ten steps, handle errors programmatically, and compose the tool with other systems — all without leaving the task.

When a human uses a CLI, they run commands one by one and read the output. When an agent needs to accomplish something, it often writes a small program — a bash script, a Python script, a Rust binary — to compose operations and handle results systematically. Some projects are even exploring REPL environments for agents: stateful shells with variables and memory, letting agents build up automation incrementally.

This capability changes what the most useful interface actually is. The natural conclusion: give agents an environment where they can interact with your application through a program. Don't just make your commands easier to parse — give them a library they can code against.

That is the core reason we built the SDK.

How do you "train" an agent to use your app?

During our weekly development meetings, this question came up repeatedly. The options we considered:

  • RAG (Retrieval-Augmented Generation): feed the agent relevant documentation at query time from a vector database
  • Custom knowledge bases: curate a structured corpus the agent queries

Both are valid approaches. But we kept returning to one observation: LLMs are evolving fast, and new models are released frequently. New models are likely being trained on public GitHub repositories — including ours.

The simpler bet: just wait for the next model. As long as your project is public and your documentation is good, the LLMs will eventually incorporate your changes through their regular training cycles. No custom pipeline required.

This works for us because:

  • The repository is public
  • Development pace on the deployer is measured — LLMs can keep up with our changes

It would break down if you were evolving faster than new model releases. But for this project, the cadence works in our favour.

Part 4 — What we actually built

The thinking in Part 3 led to a series of concrete changes. Here is what we built and why.

4.1 — Skills for focused context

As the project grew, AGENTS.md — the file that tells agents about the project — became too long. In long conversations, loading the entire file at the start pollutes the context window with information unrelated to the current task. Agent quality degrades.

We adopted the agentskills.io specification. The idea: move per-task instructions out of AGENTS.md and into individual skill files. When an agent starts a task, it loads only the skill relevant to that task — getting focused, actionable instructions without the noise of everything else.

For example: a skill for "add a new command", a skill for "write an integration test", a skill for "deploy using the SDK". Each one contains exactly what an agent needs for that specific task.

An honest caveat: it doesn't always work perfectly. Agents sometimes don't trigger the right skill, or forget to search for available skills before starting. Human steering is still required. But when it works, the quality improvement is noticeable.

4.2 — AI-discoverable headers in template files

There are three distinct scenarios in which an agent might interact with this project:

  1. Contributor agent: working on the deployer codebase itself — adding features, writing tests, refactoring
  2. Deployment agent: helping an end-user deploy the Torrust Tracker using the deployer
  3. Maintenance agent: helping an end-user maintain a server after the tracker has already been deployed

The third scenario is the tricky one. Once the deployer has run, it leaves behind rendered artifacts: Docker Compose files, environment files, Ansible configuration. A maintenance agent only has those files as context. It has no idea where they came from, what tool generated them, or where to find documentation.

Our solution: embed a small documentation header in every template file. When the deployer renders a template, the header is included in the output. The header tells the agent:

  • This file was generated by the Torrust Tracker Deployer
  • Where to find the source template
  • Where to find documentation and support

This allows an agent encountering a rendered file for the first time to discover the source, fetch the relevant documentation, and continue with full context — without any human having to explain the provenance.

4.3 — Giving agents tools to understand the configuration

The deployer's create command takes a JSON file as input: the full configuration for the tracker environment (server specs, domain, enabled services, networking, and so on). Getting that configuration right is the most demanding part of a deployment.

We built three complementary tools to help agents handle this:

Questionnaire skill

A structured skill that guides an agent through the right questions to ask the user before generating a configuration: what domain will the tracker run on? Which cloud provider? Which services should be enabled? The agent interviews the user, collects the answers, and then produces a valid config file — rather than guessing at values and generating something that fails at deployment time.

JSON schema for the configuration

The environment configuration has a formal JSON schema that can be generated and injected directly into the agent's context. The agent immediately knows which fields are required versus optional, what types each field expects, and what values are valid — without needing to read prose documentation or iterate through trial and error.

Machine-readable CLI documentation

The deployer uses the Clap crate for CLI argument parsing. We added a command to export the entire CLI documentation as structured JSON, generated automatically from the Clap definitions. An agent can load this JSON at the start of a conversation and immediately know every command, every subcommand, every flag and option — with no exploration required.

This solves the learning problem: injecting the JSON docs is cheaper in tokens than exploring the interface. But it does not solve the integration problem: the agent knows how to run commands, but cannot easily build pipelines that compose the deployer with other systems. For that, we needed the SDK.

4.4 — JSON output for all commands

Every command in the deployer now accepts a --json flag that switches its output from human-readable text to structured JSON. This covers create, provision, configure, show, list, run, test, and more.

For agents using the CLI path, this eliminates text parsing entirely. The agent gets a typed document it can read directly, not a wall of formatted output designed for human eyes.

4.5 — The SDK: a library for agents who write code

The SDK is the most significant addition for agent users. The design principle is stated directly in the pull request that introduced it:

"The CLI is designed for humans; the SDK is designed for programs and AI agents that need reliability, composability, and type safety."

Instead of shelling out to CLI commands and parsing text output, an agent can write a Rust program that calls the SDK directly. The benefits:

  • No text parsing — operations return typed Rust values, not strings
  • No exit code inference — errors are typed Result variants with domain-specific names like EnvironmentAlreadyExists and EnvironmentNotFound
  • Structured progress events — long-running operations like provision and configure emit step-by-step progress events via a CommandProgressListener, so the agent knows what is happening without scraping stdout
  • Compiler-enforced correctness — the type system catches mistakes before execution
  • Config builder — the EnvironmentCreationConfigBuilder guides the agent to a valid configuration through method chaining and type constraints, rather than requiring a correctly-shaped JSON file

The SDK also enables something the CLI cannot: integrations. Sometimes an agent does not just want to invoke the deployer — it wants to fetch information from another system and pipe it into a deployment workflow. With the Rust SDK, the agent can write a program that talks to multiple systems and composes them with the deployer using the full power of a typed programming language. That is impossible with a pure CLI approach.

An agent can read just the SDK's public interface — the types, structs, traits, and method signatures — and understand how to use the deployer in a few minutes. This is the same self-describing quality that made GraphQL attractive, but expressed through Rust's type system, with the additional benefit of compiler support.

The full picture for an agent starting from scratch:

ArtefactWhat it gives the agent
JSON CLI docsComplete command reference, zero exploration needed
JSON schema for the configExact structure and valid values for the create input file
Questionnaire skillGuidance on what to ask the user to produce a valid config
Rust SDK + config builderTyped programmatic control, enabling integrations across systems

4.6 — Typed, machine-readable errors

The deployer's error output for agents goes beyond a message and an exit code. Errors are typed — both in the Rust SDK (as enum variants like EnvironmentAlreadyExists and EnvironmentNotFound) and in the CLI's JSON output mode, where each error carries a machine-readable code alongside the human-readable explanation.

This matters because an agent that receives a typed error can make an informed decision about how to recover — retry with different parameters, report a specific problem to the user, or branch to an alternative path — without having to parse free-form text and guess at the cause. Structured errors are a first-class part of the agent-friendly interface, not an afterthought.

4.7 — Deferred for post-v1

Some ideas were recognised as valuable but deferred to keep v1 scope manageable:

  • MCP server: expose the deployer's capabilities as MCP tools, letting any LLM interact with it natively without needing a CLI or SDK wrapper
  • Dry-run mode: already partially addressed by the existing validate and render commands

Part 5 — Others reached the same conclusions

While we were working through these ideas, other teams were independently arriving at the same place.

In February 2026, Tinybird published a post titled "We built our own AI coding agent. Here's why we're sunsetting it." The story is striking: they built a full-featured custom agent for their CLI — capable of schema design, query optimisation, testing, and deployment — and then deprecated it.

Their conclusion:

"We shouldn't be building an agent. We should be making Tinybird work with every agent."

And their final principle:

"Don't build custom agents. Make your platform work with all of them."

What they found actually works — and it maps almost exactly to our list:

  • A well-designed CLI with clear, machine-parseable error messages
  • A typed SDK with discoverable methods
  • Skills that encode domain expertise and load on demand
  • Documentation that is both human-readable and machine-parseable
  • An MCP server to expose platform capabilities to any LLM

The Next.js team arrived at the same place when they sunset their in-browser agent: treat agents as first-class users of your platform and meet them where they are.

The convergence matters. Multiple teams, working independently, went through the phase of building custom agent wrappers — and all concluded the same thing: invest in good primitives, not custom agents. Different paths, different lessons learned along the way — teams who built and sunset their own agents likely learned things we did not. But seeing the same conclusions validated from multiple directions independently is what gives those conclusions weight.

Conclusion

Two stories, one insight: building with agents teaches you how to build for them.

Using GitHub Copilot agents to build the deployer forced us to confront what it actually means for software to be agent-friendly. The friction we experienced as developers — the context window problems, the need for focused instructions, the struggle with opaque errors — was exactly the friction our users' agents would experience when trying to use the tool.

The primitives that matter are not exotic. They are the same things that make tools good for humans, with a few additions specific to how agents work: skills for focused context, machine-readable schemas and documentation, structured output, typed SDKs that encode workflows, and errors that explain rather than confuse.

The underlying principle is simple:

Agents are not a special case. They are users who can program. Design your tools accordingly.

Next steps for the deployer: an MCP server, expanded SDK coverage, and eval frameworks to measure how well agents perform end-to-end deployment tasks. If you are interested in contributing or following the progress, the roadmap lives at torrust/torrust-tracker-deployer.