I use AI DevKit to develop AI DevKit features

I built AI DevKit because I wanted a workflow that makes AI coding feel less random and more efficient. But I also know from experience that AI looks great when you demo it. The problem only surfaces when you rely on them to ship something non-trivial.

I made a rule for myself. I would use AI DevKit to build the features inside AI DevKit. If the workflow breaks, the product is not ready.

About AI DevKit:

If you have not seen AI DevKit before, think of it as a way to work with AI without giving up control. Instead of treating AI as a black box that generate code, AI DevKit forces clear intent, explicit constraints, and tight feedback loops, so AI can execute well-defined work while humans stay responsible for direction and decisions. It can work well with any AI Coding Agents that you like, such as Cursor, Claude Code, Codex, Antigavity, OpenCode, etc.

This post is a build log of that experience, and what it taught me about where AI helps and where humans still need to steer.

My goal is to turn AI DevKit into a tool that makes building with AI more effective through clearer intent and sharper inputs. By reducing the amount of steering engineers need to do, we can progressively hand off execution to AI, while humans stay focused on deciding what to build and defining the right constraints.

The problem I was trying to solve

Every time I asked an agent to implement something, I retyped the same rules:

Always return Response DTOs for APIs.
Validate input at the boundary.
Follow our folder structure.
Avoid introducing new libraries without a reason.

I could paste these into prompts, or even add them as rules, but neither felt right.

Rules tend to trigger only in very specific contexts, often tied to certain files or patterns. You cannot realistically cover every case that way. Adding more rules also does not scale well. It increases complexity and still leaves gaps.

Some of these rules are also personal preferences. They make sense for how I work, but I would not want to enforce them at a project level. Others are not really rules at all. They are knowledge about the product, its constraints, or the tradeoffs we have already made.

Prompts and rules both fall short here. Prompts are ephemeral and disappear after each task. Rules are rigid and incomplete. Neither is a good place to store evolving engineering knowledge that needs judgment and context.

What I wanted was much more mechanical:

store small, precise rules once
retrieve them automatically when relevant
apply them consistently across tasks

Not memory as chat history. Memory as engineering guidelines that actually get used.

That framing did not come from AI. It came from me being annoyed at repeating myself.

First assumption: memory is just storage and retrieval

At the beginning, I thought the feature would be straightforward.

store something
search it later
inject it into context

But once I started writing the requirements, I hit the first hard question.

What exactly are we storing?

If I store everything, retrieval becomes noisy. If I store long explanations, context gets polluted. If I store vague advice, the agent behaves vaguely.

This is where human steering mattered.

The agent was very good at expanding the scope. Chat history, flexible memory, richer schemas, future extensibility. All of it sounded reasonable.

But it did not answer the real question I had while working: What do I actually want the agent to remember while I am coding?

So I forced an explicit model:

each memory item is one actionable rule
short title
content with rationale and optional examples
tags and scope to guide retrieval

That was the moment where dogfooding mattered. I was not designing for a theoretical “memory feature”. I was designing for what I personally wanted the agent to recall in the middle of real work.

Second assumption: smarter retrieval is better

Once the memory model was clear, retrieval was the next decision.

The obvious modern answer these days is embeddings. The agent suggested it. It made sense technically.

But I kept asking a different question.

When a rule is retrieved, how do I know why it appeared?

If the system cannot explain itself, I cannot trust it. And if I cannot trust it, I will stop using it.

So I deliberately chose determinism over cleverness:

SQLite with FTS5
BM25 as the primary ranking signal
title weighted higher than content
predictable boosts from tags and scope

This was a tradeoff.

I accepted that lexical search is not as “smart” as embeddings. But I gained something more important for this product stage: repeatability. If I run the same query tomorrow, I want similar results. That is how you build trust in a tool that runs inside your development loop.

That judgment does not come from prompts. It comes from knowing how it feels to work inside a tool every day.

Using the tool exposed quality problems

Once I started “using the memory service while building the memory service”, problems showed up.

The biggest one was quality.

If you let vague knowledge in, the agent pulls it out later and behaves vaguely. If you let duplicate rules in, ranking gets worse, and context gets messy.

So I moved quality checks to write time:

knowledge must be specific and actionable
generic advice is rejected
deduplication by normalized title and scope
deduplication by content hash
strict content length limits

This is boring engineering work. But it is the difference between “a memory feature exists” and “memory actually helps”.

AI will happily generate more. Humans need to decide what is allowed.

Some problems only appear when a human is in the loop

A good example is the MCP inspector breaking because dev scripts printed logs to stdout.

From a technical perspective, nothing was wrong. The code worked. The server ran. From a usage perspective, the tool was broken.

The agent could not tell me this mattered. It could help debug it once I pointed it out, but it did not experience the failure.

Another example was integration tests failing because the SQLite connection was shared globally. The agent helped write the tests, but it did not anticipate how the global state would behave once everything was wired together.

Fixing this required stepping back, understanding how the system would actually be used, and redesigning test isolation. This was not about making the code more correct. It was about making the system safe to depend on. And it is the human responsibility.

This is a recurring pattern with AI-assisted development. AI helps you move faster once you see the problem. But a human still needs to notice what actually feels broken.

What this build changed in how I think about AI tools

My thoughts about AI tools stay the same, since the day I wrote “AI changes the Tools, you still own the Craft“:

The best engineers of the future won’t just know how to use AI, they’ll know when not to.

So, as this era unfolds, don’t fear the tools. Learn them. Use them. Stretch them. Break them. But never forget: It’s your thinking, your judgment, your care that turns software into something worth shipping.

The craft is still yours. Own it.

By the end, the memory feature looked very different from my initial idea.

It was smaller, stricter, and more boring. There is no fancy tech that was applied.

This build reinforced something I strongly believe.

AI can accelerate implementation. It cannot define intent, constraints, or quality on its own.

Humans still need to steer:

what should exist
what should be rejected
which tradeoffs matter
when predictability beats capability

Using AI DevKit to build AI DevKit features forced that discipline. It exposed assumptions I did not know I had. It turned fuzzy ideas into concrete constraints.

That is why I like dogfooding. It does not validate your ideas. It challenges them until they become real.

AI DevKit is open source. If this way of working with AI resonates with you, contributions, experiments, and critical feedback are very welcome.

If you want to read more about software engineering in the AI era, subscribe to my blog. I will keep sharing what I learn while building systems with AI in the loop.

Discover more from Codeaholicguy

Subscribe to get the latest posts sent to your email.

The problem I was trying to solve

First assumption: memory is just storage and retrieval

Second assumption: smarter retrieval is better

Using the tool exposed quality problems

Some problems only appear when a human is in the loop

What this build changed in how I think about AI tools

Discover more from Codeaholicguy

If you enjoyed this post, feel free to share it!

Comment Cancel reply

Discover more from Codeaholicguy