Notes from the dead letter box - Specifying Software for Teams and Agents

June 30, 2026 · 12 min read

Head of Intergalactic Mischief

✨ This could be your product’s story! We bring together strategy, design, and development to launch products that perform. Do you have a similar idea? Wondering how this would work for your application? Let’s talk!

How to describe an application?

Imagine the following: The dev team receives the envelope with the application specs from a dead letter box in the park at night. Two months later, they deliver the code. Their client deploys their app, and everything works as expected.

The question: What was in that envelope?

Product managers, engineers, and procurement teams have tried to deliver this perfect blueprint for a software product. The ones that did not fail have mostly succeeded by sheer luck. After decades of engineering, there is still no commonly accepted standard for software requirements that guarantees, when followed, the desired outcome. There is IEEE 29148, but that carefully avoids specifying the how: It tells you a requirement must be unambiguous and verifiable, but not how to express it so that it actually is.

We were faced with this problem while we were working on Surveyor, a tool to extract specifications from running (legacy) systems: What is a good format for the description, once you discovered its features?

Surveyor is currently in early access testing (you can take part to), and we already learned a lot during the first sessions applying it to other peoples' legacy software. A good moment to share what we have found out, and to ask for feedback. We have run analytics on our beta testers’ codebases, and here is what has worked.

The Premise

Our clandestine team needs freedom of choice. We don't want to force them to use TINYINT if their favorite database does not support that. If an implementation detail is important in a way that does not become clear from the context, we need specify that in the tech choices, or describe the observable outcome in a spec. Otherwise, we should aim to specify independent from technology / solution. Here are our picks to do that.

Architectural overview

The C4 model is a way to visualize software architecture at four levels of abstraction, created by Simon Brown. The "C4" refers to its four diagram types, each zooming in further than the last:

System Context shows your system as a single box surrounded by the users and external systems it interacts with. It answers "what does this system do and who uses it" without any internal detail.
Containers zooms into the system to show its major deployable or runnable pieces, like web apps, mobile apps, databases, APIs, and file systems, along with how they communicate. Here "container" means a separately running unit, not specifically a Docker container.
Components breaks a single container down into its main building blocks and their responsibilities, showing how the code is organized into logical groupings within that container.
Code is the most detailed level, but mostly unused or generated from the code.

The core idea is that you start broad and progressively zoom in, so different audiences can engage at the level of detail that suits them, much like zooming in on a map.

Our Beta testers have found these diagrams most helpful. Even engineers who have worked with their codebase for years welcomed this organised visualisation. Surveyor is generating C4 diagrams in the form of Structurizr DSL.

This solves two purposes:

It offers a standard for generating charts with a well known systematic
It can serve as an input for subsequent steps

C the fourth

The fourth C is “Code”, but it’s not often used as code changes are frequent. Re-aligning UML diagrams each time the code changes adds friction we don’t want. Instead, we leave the diagram level and move on to describing the behavior of the system.

Describing Behavior

In 2006, a movement called Behaviour-Driven Development (BDD) emerged. The idea was to write specifications in natural language that could be parsed and executed as tests simultaneously. This enabled product teams to create specifications that engineers could use directly as automated tests. The engineers would start by creating the bindings between feature language and test code. The implementation would begin with a failing test (red), which the actual feature code would then make pass (green).

Gherkin is a plain-text language for writing executable specifications in a Given-When-Then format, Cucumber is the framework that maps these specifications to code (step definitions) so they run as automated tests.

A feature like the following:

Feature: Monster battle

  Scenario: Battle

    Given there is a monster

    When I attack it

    Then it should die

Would map to a binding like this:

# test/features/step_definitions/monster_steps.exs
defmodule MonsterSteps do
  use Cucumber.StepDefinition
  import ExUnit.Assertions

  step "there is a monster", context do
    Map.put(context, :monster, Monster.new())
  end

  step "I attack it", context do
    Map.put(context, :monster, Monster.take_hit(context.monster))
  end

  step "it should die", context do
    refute context.monster.alive?
    context
  end
end

BDD’s peak popularity came in the early 2010s, a period when teams worldwide embraced it and equivalents appeared in every major language with implementations like Behave and pytest-bdd for Python, SpecFlow/Reqnroll for .NET, Behat for PHP, and Godog for Go.

Tooling is broad too, including IDE plugins for IntelliJ, Visual Studio, and VS Code offering syntax highlighting and step navigation, plus reporting tools and CI/CD integrations

Interest has cooled down some years after that because developers grew tired of maintaining large chunks of BDD code and bindings. Luckily, we can leave that part to the LLMs now.

The Case for BDD

We are revisiting BDD today, because AI assisted coding has a problem: Plans and free formats like OpenSpec are not parsable. That means they can not be translated to automated tests directly. You need an agent step to build the testing, with all uncertainty agentic coding brings.

The relatively rigid syntax of Gherkin is an advantage here: It maps mechanically to test code. And it can also be parsed to produce other output formats: Manual QA instructions, OpenSpec or Jira tickets.

But for that, it’s missing the meta information that embeds a feature description into its application context.

The Assay format

We have used the C4/structurizr DSL to express how our system interfaces with the outside world. We also used it to describe the inner composition of our application. When we use Gherkin alone, this context would be lost. But we can preserve it with syntax-aware comments like this:

# ---
# component: orderLifecycle
# container: api
# schema: schemas/order_lifecycle.ex
# workspace: architecture/workspace.tsp
# definitions:
#   a valid customer:
#     a Customer with status Active, verified email,
#     and a credit limit greater than zero
#   in stock:
#     the product has available quantity greater than
#     the requested amount
# invariants:
#   - total must not exceed customer credit limit
#   - at least one line item required
#   - cannot cancel an order that has shipped
# ---

With this, we can access the information in the structurizr DSL to get context information. This allows us to write parsers and plugins for coding agents that can not only read the features, but also access context information we gathered before. But what's purpose of the schema: definition?

Schema Definitions

Schema work as a shared contract about the business objects of a component. Having them speeds up onboarding. Developers don't need to scan through all features to detect what fields are needed, and agents are less likely to invent fields. It's a token-efficient way to inject exactly the ground truth into context.

For specifying schemas, three obvious candidates come to mind:

Protobuf — https://protobuf.dev — Google's schema and wire format. Compact and fast, with code generators across many languages, but its scalar types are physical commitments, so it presumes the most about storage and encoding.
```
syntax = "proto3";

message Post {
  string title = 1;
  int32 views = 2;
  bool published = 3;
  repeated string tags = 4;
}
```

JSON Schema — https://json-schema.org — A decade-plus standard for describing the shape and constraints of JSON. Stays loose where you want, needs no toolchain for consumers, and fits the premise most directly; the cost is verbosity. It can be authored directly, e.g. as YAML and be compiled to JSON:

$schema: "https://json-schema.org/draft/2020-12/schema"
$id: "https://assay.example/post"
title: Post
type: object
required: [title]
additionalProperties: false
properties:
  title: { type: string }
  views: { type: integer }
  published: { type: boolean }
  tags:
    type: array
    items: { type: string }

TypeSpec — https://typespec.io — Microsoft's TypeScript-like language that compiles to JSON Schema, OpenAPI, and Protobuf from one source. Compact and readable, at the price of a build step and a single-vendor toolchain.
```
model Post {
  title: string;
  views?: integer;
  published?: boolean;
  tags?: string[];
}
```

Under the premise that the implementing team picks the field types and specifics the three differ in how much they presume. Protobuf fits worst: its scalar types (int32, bytes) are physical commitments that pre-decide the storage details you meant to leave open. JSON Schema is the opposite — it describes shape and constraints without binding to any storage type, and hands the team a directly-usable artifact with no toolchain; its cost is verbosity and $ref-wired files. For our purpose, TypeSpec is the golden spot:

A compact, TypeScript-like authoring that compiles to JSON Schema (or Protobuf later, if needed), at the price of a build step and a younger, single-vendor toolchain. This is not so relevant in this context as we use it for documentation rather than compiling. It adds, however, the cross compilation benefits.

Architectural Decision

Until this point, we avoided to describe the “how” of our application. Which database, which web framework, what datamodel. This is what Markdown Architectural Decision Records (MADR) cover. Their goal is to document not only the technical decision, but what options were considered and what the problem statement was. Because MADRs contain that background, it allows to effectively reconsider decisions once the Circumstances have changed.

They typically look like this:

## **Use PostgreSQL for primary data store**

* Status: accepted
* Date: 2026-06-29

### **Context**

We need a primary database for a new service with relational data and strong
consistency requirements.

### **Considered Options**

* PostgreSQL
* MongoDB

### **Decision**

Use **PostgreSQL** — the team has operational experience with it, and it
provides ACID transactions plus JSONB for flexible fields. MongoDB was rejected
due to weaker consistency guarantees and less team familiarity.

### **Consequences**

* Good: mature ecosystem, existing team expertise
* Bad: harder to scale horizontally than MongoDB

Non Functional Requirements

By now we have described our application well in terms of how it is structured, what it does, and what tools it uses (or should use). But we left out the quality of service.

A non-functional requirement (NFR) specifies how well a system must operate, as opposed to what it does (functional requirements). Where a functional requirement says "users can book an appointment," an NFR constrains the qualities of that behavior — performance, scalability, availability, reliability, security, maintainability, usability, and so on.

They are often written in a simple table format:

Here's a fuller set, functional requirement paired with its non-functional counterpart:

Quality attribute	Functional requirement (what)	Non-functional requirement (how well)
Performance	User searches for available slots	Results return in under 500ms for 95% of requests
Scalability	Users book appointments	System handles 1,000 concurrent bookings without degradation
Availability	The booking system is accessible	99.9% uptime, measured monthly (≈43 min downtime/month)
Reliability	A confirmed booking is saved	Zero confirmed bookings lost; durable before confirmation returns
Security	Users log in to their account	Passwords stored hashed; data encrypted in transit (TLS) and at rest
Usability	User completes a booking	A new user can book without instructions in under 2 minutes

Each row shows the same feature twice: the left column is what the system does, the right is the measurable constraint on how well it must do it — the NFR.

The End-Boss: Design

Specifying UI is a daunting task because it happens at the end of the rendering pipeline where the amount of moving parts is highest. Most tools for web therefore use the web technologies themselves to avoid drifting.

The Image

The simplest, (and arguably the worst) medium to specify design is the image:

Then the login page should match "fixtures/login.png"

This allows features to specify design alongside the functional feature requirements.

The code

If the frontend code is using components and maybe even a component library, it might be worth carrying over. A connection to a design system in Figma or similar is the gold standard here, but might be violating our "letter in envelope" constraint. Our goto strategy at bitcrowd is Storybook. If the target system supports it, teams can work very effectively.

Visual Regression Testing

If you happen to have a legacy system with a decent design, visual regression testing with Playwright or similar might be your tool of choice. It gives teams a quick feedback loop if it comes to single scenarios, and the option to recheck the whole interface. The great thing in combination with Gherkin is that you can switch screenshot processing on and off. The difficulty with visual regression tests is their potential flakiness. However, image processing models have made this a lot more reliable.

So, what is in that envelope?

So far we have C4/structurizr diagrams for how the system is build and how it interacts, Gherkin/Assay for its behaviour, TypeSpec for schemas, MADR for the technological choices and NFR for its quality of service. We go for storybook for the visuals. Obviously, this is not the final solution for every stack. It would be hard to describe a pacemaker OS with it, or a web application with pristine UI effects. The final version of Surveyor will use plugins to cater for that, but for many of the projects bitcrowd has encountered, this gives a good start.

Fortunately, we have clients who enjoy talking to us, so we haven’t had to resort to espionage technology - yet. Let us know if you have some clandestine system you need to build in the dark, or that legacy application needs a rewrite.

Notes from the dead letter box - Specifying Software for Teams and Agents

How to describe an application?

The question: What was in that envelope?

The Premise

Architectural overview

C the fourth

Describing Behavior

The Case for BDD

The Assay format

Schema Definitions

Architectural Decision

Non Functional Requirements

The End-Boss: Design

The Image

The code

Visual Regression Testing

So, what is in that envelope?

Christoph Beck

We’re hiring

How to describe an application?​

The question: What was in that envelope?​

The Premise​

Architectural overview​

C the fourth​

Describing Behavior​

The Case for BDD​

The Assay format​

Schema Definitions​

Architectural Decision​

Non Functional Requirements​

The End-Boss: Design​

The Image​

The code​

Visual Regression Testing​

So, what is in that envelope?​

Christoph Beck

We’re hiring

How to describe an application?

The question: What was in that envelope?

The Premise

Architectural overview

C the fourth

Describing Behavior

The Case for BDD

The Assay format

Schema Definitions

Architectural Decision

Non Functional Requirements

The End-Boss: Design

The Image

The code

Visual Regression Testing

So, what is in that envelope?