Test Trees — April 22, 2026

Test Trees

My closest equivalent to “specs” are test trees.

Test trees are seriously battle-tested. They are already used to describe and test the behaviour of popular FinTech apps. I’ve personally spent thousands of hours working with test trees and other engineers to describe, reason about, and verify our software.

Test trees are a living, verifiable description of all behaviour. They are best explained by the test tree for test trees in my own Claude Code plugin, which are at the bottom of this post.

If you think test trees should work differently, you can propose a change:
> "then the tree must exist before implementation starts"
No, TDD is dumb! "then Claude should just go nuts and I'll figure it out later"

We can reason about changes to behaviour very easily this way, which is important when changing real products with millions of users over _years_.

Also, they are verifiable – they prove that the software has upheld them, because they map to tests that in turn _drove_ the implementation. Here I go again with the TDD 😂

test-trees-as-requirements (unit: test/test-trees-as-requirements.bats)
when a project uses contree
then CLAUDE.md identifies TEST_TREES.md as the definition of functional and cross-functional requirements
and TEST_TREES.md defines functional requirements using EARS syntax
and each behavioural unit has its own tree in TEST_TREES.md
and trees are flat subsections — not grouped by kind or layer
and every tree reifies exactly one test file
and every test file reifies exactly one tree
and every tree names its coverage in parenthesised labelled pairs on the tree-name line, covering the categories src, unit, integration, functional
and gaps are declared explicitly — "none" for expected-but-uncovered categories, omission for not-applicable ones
and the EARS rule is embedded in skills that use it
when a behaviour change is needed
then the tree must exist before implementation starts
when implementation reveals new understanding
then the tree is updated to reflect reality

Tight harness improvement loops — April 16, 2026

Tight harness improvement loops

If you can:
1. Watch the models thinking stream
2. Detect errors in thinking
3. Correctly intuit in-harness solutions
4. Repeat on a tight loop
You have a big advantage.

We can improve our harnesses very quickly, but this requires great intuition, quick thinking, and an appetite for tight feedback loops. That’s a high bar – too high for many people and perhaps for current coding agents. But will it be too high for Claude 5, or ChatGPT 6?

I can imagine running a skill that analyses Claude Code session transcripts, Git changes, and code to identify systemic problems and make targeted harness improvements.

Trunk Sync for Vibe Coders — April 15, 2026

Trunk Sync for Vibe Coders

I realised Trunk Sync could help vibe coders *a lot*. I built it for parallel local agents mixed with a remote agent (that’s my setup lately) but it works well for many situations, including vibe coders who are super confused about Git.

Many people have their coding agents managing Git, which is a waste of tokens and time. It’s also dangerous!

If you know a vibe coder, *tell them about Trunk Sync*.

AI and “Brittle” Code — April 13, 2026

AI and “Brittle” Code

Your project is brittle because your coding agent knows too much!

At some point coding agents began gathering huge amounts of context. This reintroduced a wonderful human failure mode I wrote about here. At the time I called it behavioural coupling but in AI dev people call it “brittleness”.

When a coding agent (or any programmer!) can see beyond a unit’s declared interfaces, it will sometimes make implementation choices that implicitly and invisibly depend on another part of the system behaving just so. Eventually you can’t change anything without something else breaking.

Code Mak Naa is a 2-year-old research project that I polished up because its context management demonstrates a solution to brittleness that I wanted to share. (Visit here to see its other interesting ideas)

Code is generated only with knowledge of what is strictly guaranteed, and then discarded when those guarantees change. Anything that is not guaranteed is unknowable to the generator, such that no relevant coupling is possible.

Concisely: The context boundary of the generator must match the invalidation boundary of the unit being generated.

Code Mak Naa achieves this using outside-in graph traversal. Generators for each function only see what is guaranteed by the harness, working outside-in from a partial design, namely:

  • Their own specification.
  • Interfaces of any already-specified children (but not their implementations).
  • Implementations of any ancestors (the chain of functions that will actually consume them).
  • Current database schema.

This approach worked reliably all the way back at Claude 3.5 pre-reasoning, because the context boundary of the generator matches the invalidation boundary of the unit being generated.

Real example of this approach to context management here.

Milla, MemPalace, and Gralkor — April 8, 2026

Milla, MemPalace, and Gralkor

Like I’ve always said, more people should be comparing me to Milla Jovavich 😂

Ask your agent to compare MemPalace and Gralkor: “MemPalace https://lnkd.in/gqFDyAKK vs Gralkor https://lnkd.in/gM52faGe” (it is not close)

If you’re in these spaces, you are used to seeing several crazy new memory projects every day. People are diving in with their own experience, creativity, and quite often a lot of LLM-egged-on pseudoscience 😅 There are many good ideas, but you never see those projects because their creators are not usually celebrities.

Gralkor is not one of those projects. It incorporates my own experience and creativity, but critically it also leverages the latest research, builds thoughtfully atop prior work, and thereby pushes the frontier of agentic memory forward.

Put yourself in your software — April 7, 2026

Put yourself in your software

I mentioned elsewhere about my move away from autonomous software factories.

You need to control the behaviour of the software you build right to the edges. It’s less about initial build speed and more about the long tail of adapting software to the market, and extending it in response to reality.

Also, you probably need to build software at a level of abstraction low enough to invest something unique about yourself and your business into it, otherwise you will lose to factories making “slop” – software built from a generic spec (perhaps very well!) and differentiated by nothing.

Gralkor — April 6, 2026

Gralkor

I needed a better memory plugin for OpenClaw, so I made one – Gralkor (https://lnkd.in/gQyn2HTA)

I don’t mean better than the default, I mean better than the top OpenClaw memory plugins.

I started with the best open source, temporally-aware memory available – Graphiti (https://lnkd.in/gpRn5SXC). I’ve worked with many graph and vector memory systems and Graphiti still amazes me. Graphiti’s strengths are perfect for a long-running personal agent – I really appreciate Zep sharing it with us.

On top of Graphiti, I’ve put a lot of myself and the latest research into Gralkor.

I was quite surprised at how other memory plugins work. Typically they just capture individual question and answer pairs – not much to extract context from! What about ideas that come together slowly over the course of a whole conversation?

Instead, I learned heaps about OpenClaw’s hooks and figured out how to ingest whole episodes that make sense as tasks and conversations. More context, richer extraction, deeper understanding.

Did you know that most memory plugins for OpenClaw only remember dialog? When your agent tells you it did this or that last week, it doesn’t remember doing it – it remembers saying it did. Ask how and it will extrapolate confidently and the error compounds in memory. Your agents mostly don’t remember what they thought either, including how they solved their last problem – I sure couldn’t work under those conditions!

Instead, I built a distillation process to ingest thoughts and actions in context with dialog, tuning for the highest fidelity possible without crowding the graph with tool call parameters.

Gralkor provides a simple platform to experiment with memory consolidation and learning. You’ve got cron, just add Thinker CLI and Gralkor to start your quest for recursive self-improvement. We can learn together – ask me for my reflection cron! This is showing up in research a lot now as ERL.

Finally, custom ontologies! You can define your own entities and relationships, using a configuration scheme designed for accurate classification.

You could focus on standard domain language, or structure your agents memory around your model of the world. This is another one starting to come up in research.

So, enjoy Gralkor (https://lnkd.in/g79xCK2V). Star it, let me know what you think, tell your friends – all those nice things. Great trees need strong roots.

Coding agents as quality of life enablers — April 5, 2026

Coding agents as quality of life enablers

I’m really looking forward to an increase in software quality as every developer realises they can finally have the testing, refactoring, and other engineering practices they’ve been fighting for.

Also a great increase in quality of life improvements as businesses run out of ideas / operational capacity / marketing spend. We will have to fight harder for our users.

Harnesses to engineer harness engineers that engineer harnesses — April 4, 2026

Harnesses to engineer harness engineers that engineer harnesses

Our harness engineering days are limited because – of course – it is not too difficult to automate. I don’t build project harnesses anymore – I write one Claude user plugin that harness-engineers for a living.

The harness engineer in turn has a harness which functionally tests how well it engineers harnesses. The title of this post is a real thing that I just made, and I can even have it self-improve on a loop.

Ultimately this will be less about dev and more about how fast we can figure out what to make, how to measure it, and how to sell it. If you’re harness engineering now, start reaching *up* into product management and strategy, *in” to the business, and *out* to your customers.

Coding agents as quality enablers — April 3, 2026

Coding agents as quality enablers

Incredible that I’m only now doing mutation testing regularly. If you code with AI, your should be automatically mutation testing your code and killing mutants. It’s crazy how much leverage AI is giving existing techniques like this.

People are worried about software quality with coding agents, but they can already produce great code with the right operator. Later what we’re doing will be reciprocally trained into the models and facilitated by the harnesses so that _everybody_ can produce great code.