Skip to main content

Notes from the dead letter box - Specifying Software for Teams and Agents

· 8 min read
Christoph Beck portrait

✨ This could be your product’s story! We bring together strategy, design, and development to launch products that perform. Do you have a similar idea? Wondering how this would work for your application? Let’s talk!

How to describe an application?

Imagine the following: The dev team receives the envelope with the application specs from a dead letter box in the park at night. Two months later, they deliver the code. Their client deploys their app, and everything works as expected.

The question: What was in that envelope?

Product managers, engineers, and procurement teams have tried to deliver this perfect blueprint for a software product. The ones that did not fail have mostly succeeded by sheer luck. After decades of engineering, there is still no commonly accepted standard for software requirements that guarantees, when followed, the desired outcome. There is IEEE 29148, but that carefully avoids specifying the how: It tells you a requirement must be unambiguous and verifiable, but not how to express it so that it actually is.

We were faced with this problem while we were working on Surveyor, a tool to extract specifications from running (legacy) systems: What is a good format for the description, once you discovered its features?

Surveyor is currently in early access testing, and we already learned a lot during the first sessions applying it to other peoples legacy software. A good moment to share what we have found out, and to ask for feedback. We have run analytics on our beta testers’ codebases, and here is what has worked:

Architectural overview

The C4 model is a way to visualize software architecture at four levels of abstraction, created by Simon Brown. The "C4" refers to its four diagram types, each zooming in further than the last:

  • System Context shows your system as a single box surrounded by the users and external systems it interacts with. It answers "what does this system do and who uses it" without any internal detail.
  • Containers zooms into the system to show its major deployable or runnable pieces, like web apps, mobile apps, databases, APIs, and file systems, along with how they communicate. Here "container" means a separately running unit, not specifically a Docker container.
  • Components breaks a single container down into its main building blocks and their responsibilities, showing how the code is organized into logical groupings within that container.
  • Code is the most detailed level, but mostly unused or generated from the code.

The core idea is that you start broad and progressively zoom in, so different audiences can engage at the level of detail that suits them, much like zooming in on a map.

Our Beta testers have found these diagrams most helpful. Even engineers who have worked with their codebase for years welcomed this organised visualisation. Surveyor is generating C4 diagrams in the form of structurizr.com DSL.

This solves two purposes:

  1. It offers a standard for generating chars with a well known systematic
  2. It can serve as an input for subsequent steps

C the fourth

The fourth C is “Code”, but it’s not often used as code changes are frequent. Re-aligning UML diagrams each time the code changes adds friction we don’t want. Instead, we leave the diagram level and move on to describing the behavior of the system.

Describing Behavior

In 2006, a movement called Behaviour-Driven Development (BDD) emerged. The idea was to write specifications in natural language that could be parsed and executed as tests simultaneously. This enabled product teams to create specifications that engineers could use directly as automated tests. The engineers would start by creating the bindings between feature language and test code. The implementation would begin with a failing test (red), which the actual feature code would then make pass (green).

Gherkin is a plain-text language for writing executable specifications in a Given-When-Then format, Cucumber is the framework that maps these specifications to code (step definitions) so they run as automated tests.

A feature like the following:

Feature: Monster battle

Scenario: Battle

Given there is a monster

When I attack it

Then it should die

Would map to a binding like this:

# test/features/step_definitions/monster_steps.exs
defmodule MonsterSteps do
use Cucumber.StepDefinition
import ExUnit.Assertions

step "there is a monster", context do
Map.put(context, :monster, Monster.new())
end

step "I attack it", context do
Map.put(context, :monster, Monster.take_hit(context.monster))
end

step "it should die", context do
refute context.monster.alive?
context
end
end

BDD’s peak popularity came in the early 2010s, a period when teams worldwide embraced it and equivalents appeared in every major language with implementations like Behave and pytest-bdd for Python, SpecFlow/Reqnroll for .NET, Behat for PHP, and Godog for Go.

Tooling is broad too, including IDE plugins for IntelliJ, Visual Studio, and VS Code offering syntax highlighting and step navigation, plus reporting tools and CI/CD integrations

Interest has cooled some years after that because developers grew tired of maintaining large chunks of BDD code and bindings. Luckily, we can leave that part to the LLMs now.

The Case for BDD

We are revisiting BDD today, because AI assisted coding has a problem: Plans and free formats like OpenSpeck are not parsable. That means they can not be translated to automated tests directly. You need an agent step to build the testing, with all uncertainty agent coding brings.

The relatively rigid syntax of Gherkin is an advantage here: It maps mechanically to test code. And it can also be parsed to produce other output formats: Manual QA instructions, OpenSpec or Jira tickets.

But for that, it’s missing the meta information that embeds a feature description into its application context.

The Assay format

We have used the the C4/stucturizr DSL to express how our system interfaces with the outside world. We also used it to describe the inner composition of our application. When we use Gherkin to describe, this context is lost. We add that to Gherkin in Syntax aware comments like this:

# ---
# component: orderLifecycle
# container: api
# schema: schemas/order_lifecycle.ex
# workspace: architecture/workspace.dsl
# definitions:
# a valid customer:
# a Customer with status Active, verified email,
# and a credit limit greater than zero
# in stock:
# the product has available quantity greater than
# the requested amount
# invariants:
# - total must not exceed customer credit limit
# - at least one line item required
# - cannot cancel an order that has shipped
# ---

With this, we can access the information in the structurizr DSL to get context information. This allows us to write parsers and plugins for coding agents that can not only read the features, but also access context information we gathered before.

Architectural Decision

Until this point, we avoided to describe the “how” of our application. Which database, which web framework, what model. This is what Markdown Architectural Decision Records (MADR) cover. Their goal is to document not only tech decision, but what options were considered, what the problem statement was. Because MADR contain that background, it allows to effectively reconsider decisions if the circumstances have changed.

They typically look like this:

## **Use PostgreSQL for primary data store**

* Status: accepted
* Date: 2026-06-29

### **Context**

We need a primary database for a new service with relational data and strong
consistency requirements.

### **Considered Options**

* PostgreSQL
* MongoDB

### **Decision**

Use **PostgreSQL** — the team has operational experience with it, and it
provides ACID transactions plus JSONB for flexible fields. MongoDB was rejected
due to weaker consistency guarantees and less team familiarity.

### **Consequences**

* Good: mature ecosystem, existing team expertise
* Bad: harder to scale horizontally than MongoDB

Non Functional Requirements

By now we have described our application well in terms of how it is structured, what it does, and what tools it uses (or should use). But we left out the quality of service.

A non-functional requirement (NFR) specifies how well a system must operate, as opposed to what it does (functional requirements). Where a functional requirement says "users can book an appointment," an NFR constrains the qualities of that behavior — performance, scalability, availability, reliability, security, maintainability, usability, and so on.

They are often written in a simple table format:

Here's a fuller set, functional requirement paired with its non-functional counterpart:

Quality attributeFunctional requirement (what)Non-functional requirement (how well)
PerformanceUser searches for available slotsResults return in under 500ms for 95% of requests
ScalabilityUsers book appointmentsSystem handles 1,000 concurrent bookings without degradation
AvailabilityThe booking system is accessible99.9% uptime, measured monthly (≈43 min downtime/month)
ReliabilityA confirmed booking is savedZero confirmed bookings lost; durable before confirmation returns
SecurityUsers log in to their accountPasswords stored hashed; data encrypted in transit (TLS) and at rest
UsabilityUser completes a bookingA new user can book without instructions in under 2 minutes

Each row shows the same feature twice: the left column is what the system does, the right is the measurable constraint on how well it must do it — the NFR.

So, what is in that envelope?

So far we have C4/structurizr diagrams for how the system is build and how it interacts, Gherkin/Assay for its behaviour, MADR for the technological choices and NFR for its quality of service. Obviously, this is not the final solution for every stack. It would be hard to describe a pacemaker OS with it, or a web application with pristine UI effects. But for many of the projects bitcrowd has encountered, this gives a good start.

Fortunately, we have clients who enjoy talking to us, so we haven’t had to resort to espionage technology - yet. Let us know if you have some clandestine system you need to build in the dark, or that legacy application needs a rewrite.

Christoph Beck portrait

Christoph Beck

Head of Intergalactic Mischief

We’re hiring

Work with our great team, apply for one of the open positions at bitcrowd