After finding some motivation at the Pragmatic Summit last month, I decided to try my luck at pushing to level 4 on the AI coding scale. I currently spend most of my work days somewhere between 2 and 3 with one AI-free day per week to remind me what life was like in 2023.

The project I chose for this experiment looked like this:

  • K8s operator in go
  • greenfield; written by AI & human (level 2/3 style)
  • strong automated testing: unit / envtest / Kind-based E2E

The project went through a lot of iterations trying out different scenarios, which led to cruft and inconsistencies that had built up over time. So, let the bot clean it up!

The Flow

I opted for a fairly standard flow (as much as a “standard” is possible in this constantly changing area):

Agentic flow setup

In step one, I used a powerful model (claude-4.6-opus-max) to gather best-practices and documentation around K8 operator design1 and run a deep code review. The outcome was a list of potential improvements. This list was, as expected, only semi-useful, as the AI had no context about priorities and real-world constraints.

Then I paired with the AI to go through each task one-by-one, adding details, weighing options and deciding if implementation was worth it. For all tasks deemed useful, the model created individual markdown task files within the repo. Each task was self-contained, and the markdown file contained all the details. This step took the most time.

Third, I used a cheaper model (composer-1.5) and Cursor’s subagent feature to run the implementation. The orchestrator agent analyzed the tasks to identify non-overlapping batches that could be run in parallel without leading to merge conflicts.

Then, batch by batch, the orchestrator kicked off one subagent per task to implement the change, verify it via the test suite, and finally delete the markdown file and commit the changes. The result was a clean git history. The orchestrator ran a final validation and handed back to me for a final check.

And it did work!

The Aftermath

While I was surprised by how smooth everything went, I did run into a few rough edges.

Firstly, I spent a lot of time setting up the testing infrastructure in the project, paying close attention to make it usable for both the human and the AI. The tasks themselves were small improvements & refactors - not big features. So the setup was perfect for an agentic workflow.

Secondly, some of my instructions didn’t make it into the subagents. The orchestrator somehow “forgot” to pass those on, and I had to explicitly call out that rules like “verify changes via the test suite” should also apply to the subagents. Frustrating. Putting this flow into Skills and iterating on the details should resolve this and similar issues, though.

I also experimented with doing PRs instead of commits and integrating Jira tickets, but it made the scope creep and sucked the fun out of the whole exercise.

In the end, I felt a lot less in control, yet couldn’t point to a concrete drawback that irked me. Maybe with more experiments like this, it will feel more normal.

  1. context7 is helpful here.