Skip to main content

Reading a 200k-line codebase you didn't write: a field guide

Let's explore understanding a codebase like it's 1999, and compare it with how coding agents work.

· 26 min read
Christoph Beck portrait

✨ This could be your product’s story! We bring together strategy, design, and development to launch products that perform. Do you have a similar idea? Wondering how this would work for your application? Let’s talk!

Let's do an exercise in understanding a codebase like it's 1999, before the rise of coding agents, and compare this with how coding agents to that work. This is a field guide for reading inherited code alongside an agent.

Before you read any code

The instinct is to start opening files. Don't. Read the shape of the project first. Four things, in order: the directory tree, the dependency manifest, the CI config, and the git log. Thirty minutes here saves days.

The directory tree

tree -L 3 -I 'node_modules|.git|dist|build'

Glance at the output - you're noticing absences and asymmetries. A legacy/ directory, a missing tests/ folder, three directories that look like they do the same thing. These are facts about the project's history before you read any logic.

How a coding agent would read the directory tree

An agent asked "describe this repo" will summarize what's there. To get the interpretive layer, ask it the question you'd want a senior engineer to answer if they'd just glanced at the tree:

What's notable about this directory structure - not what's there, but what's missing, asymmetric, or unusual compared to a healthy project in this stack? What might each observation imply about the team or the codebase's history?

Behind the scenes, the agent will list directories recursively (LS + Glob), read the top-level manifest (package.json, Cargo.toml, mix.exs, pyproject.toml) to identify the stack, and compare the layout against the conventional shape of a healthy project in that ecosystem. That shape comparison - what's missing versus what's expected - is the work you're outsourcing.

What to look out for: The agent's hypotheses about history are speculation - useful as questions to verify, not as facts to build on.

Watch for: a legacy/ or old/ directory (migration started and abandoned?); empty or missing test directories; duplicate-purpose directories (utils/, helpers/, lib/); top-level files suggesting a rename or restructure.

The dependency manifest

cat package.json
# or pyproject.toml, Gemfile, pom.xml, go.mod

Skim it. The framework version, anything forked, anything that doesn't belong with the rest of the stack - these are usually visible in under a minute and worth more than reading any individual source file.

How a coding agent would read the dependency manifest

An agent will list dependencies if asked, but the list isn't the point. The point is the interpretation:

For each dependency in this manifest, tell me: (1) whether the version is current, moderately outdated, or dangerously outdated; (2) any dependencies that look out of place given the rest of the stack; (3) any that suggest this codebase has been through a partial migration. Skip dependencies that are fine.

Behind the scenes, the agent will Read the manifest plus the lockfile, and for anything that looks stale it will hit npm / PyPI / Hex / crates.io via WebFetch to verify the actual latest version against its training cutoff. For forks pinned to a specific commit, it'll resolve the fork URL to see how far the fork has drifted from upstream. The judgement it can't make for you is whether a given dependency is appropriate for your business context - that part stays human.

What to look out for: the agent's knowledge cutoff is often the bottleneck here - verify "current" claims against the actual latest version on npm or PyPI for anything flagged.

Watch for: forks pinned to a specific commit (almost always load-bearing - someone fixed something upstream wouldn't accept); a dependency that doesn't fit the stack (a queue library in a project that "doesn't use queues" is often the entry point to a subsystem you didn't know existed).

The CI config

ls .github/workflows/ .gitlab-ci.yml .circleci/
cat .github/workflows/*.yml

Read it once. You're looking for what this team tests, what they skip, and how often they deploy. The CI config is one of the few places in a codebase where the team's actual habits, rather than their intentions, are recorded.

How a coding agent would read the CI config

An agent will summarize the pipeline accurately. The interpretive ask:

What does this team test, what do they skip, and what does the test-to-deploy pattern suggest about their confidence in different parts of the codebase? Treat the absence of test coverage as information, not an oversight.

Behind the scenes, the agent will Glob for workflow files across .github/workflows/, .gitlab-ci.yml, .circleci/config.yml, Jenkinsfile, etc., Read each one, then Grep for continue-on-error, conditional if: expressions, and deploy-step names. It'll cross-reference the test commands it finds against the actual test directories to flag mismatches - a test suite that CI never invokes is one of the most common findings. Skipped categories show up cleanly; what they mean for your project is still your call.

Watch for: jobs marked continue-on-error: true or conditionally skipped - usually a flaky test the team gave up on. Deploy steps that bypass the test suite. Environment matrices that test against runtime versions that don't match production. The gap between what CI runs and what production runs is often where the surprises live.

The git log

git log --oneline | wc -l
git log --format="%an" | sort | uniq -c | sort -rn | head -20
git log --since="12 months ago" --format="%ad" --date=format:%Y-%m | sort | uniq -c

Three small aggregations: total scale, who carried this, cadence over the last year. None of these require reading a single commit message.

How a coding agent would read the git log

The agent's role here is to read the aggregations as a pattern:

Here are aggregations from the git log of a legacy codebase. What does this suggest about its history? Who carried it, when did velocity change, and what's the current state of activity? Speculate where the data supports it.

Behind the scenes, the agent will run the aggregation commands above via its shell tool, plus a few more it thinks of (commits by directory, average diff size over time, longest gaps between commits per author). It can't see commits older than git log shows - if the repo was squashed or migrated from another VCS, the history is structurally incomplete and you should ask the agent to flag that. Suspiciously low commit counts in early years are usually where the real archaeology is buried.

Watch for: a single author with disproportionate commits who hasn't committed recently - that person was the system, and they're gone. A sharp cadence drop without a corresponding completion usually means the team got pulled to other work, not that they finished. Recent activity from authors who weren't historically involved often means the codebase was handed off to whoever was available.

The pattern across all four: small commands, fast interpretation, a short list of things the interpretation might miss. Thirty minutes and you have a model of the project as an artifact - its shape, its history, its current state - before you've opened any source code.


The four entry points

Once you know the shape of the project, you start with its entry points. Every system has four kinds, and reading them in this order saves you from getting lost in the middle: the boundary, the data model, the configuration, the tests.

The boundary

# pick whichever applies to your stack
grep -rE "(get|post|put|patch|delete)\s*['\"]" --include="*.rb" --include="*.ex"
grep -rE "@(Get|Post|Put|Patch|Delete)Mapping" --include="*.java"
rg "router\.(get|post|put|patch|delete)" -t js -t ts

The boundary is where the outside world touches the system: HTTP routes, message queue consumers, scheduled jobs, CLI entry points, webhooks. You want the complete list, not a sample. The shape of the list - how many, how named, how distributed across modules - is the first real fact about what this system is.

How a coding agent would read the boundary list

Agents are good at enumerating routes when asked, but a flat list isn't the deliverable. The deliverable is the structure of the list:

Here's the complete list of HTTP routes for a legacy codebase. Group them by what they appear to do, flag any that look inconsistent with the rest (naming, verb usage, path structure), and tell me which groups suggest the system grew organically versus which look designed. Identify any routes that seem to do something unrelated to the others.

Behind the scenes, the agent will Grep for route definitions using framework-specific patterns (get '/...', @RequestMapping, router.post, live "/..."), Read matching files for context, and where the framework supports it, shell out to the route-listing command (rake routes, mix phx.routes, python manage.py show_urls). Enumeration is mechanical. The second pass - grouping and pattern-matching across the full list - is the part you're actually asking it to do.

Eighty-seven routes with no consistent naming convention tells you the system grew without anyone owning it. Twelve clean REST routes tells you someone cared, at least once. A cluster of routes named legacy_* or v1_* next to a smaller modern cluster tells you a migration that didn't finish. The grouping matters more than any individual route.

The data model

find . -name "schema.rb" -o -name "*.sql" -path "*/migrations/*"
ls db/migrate/ priv/repo/migrations/ migrations/

The data model is the system's memory of what it thinks is important, and it changes more slowly than code. That makes it the most reliable guide to what the system actually is - code can lie about intent, but a table either exists or it doesn't.

How a coding agent would read the schema and migrations

Here's the schema and migration history of a legacy codebase. Tell me: which tables are clearly the core of the domain; which look like they were added later or for one-off needs; any tables whose names suggest duplicate purpose (User vs Users vs Account); migrations that were partially applied or rolled back; columns marked deprecated, nullable-but-shouldn't-be, or with default values that look like workarounds.

Behind the scenes, the agent will Glob migration directories (db/migrate/, priv/repo/migrations/, migrations/), Read the schema file in order if there is one, and traverse migrations chronologically to reconstruct history. It'll Grep for add_column, drop_table, change_column_null, add_foreign_key to find places where a column's semantics changed mid-history. A capable agent will also cross-check foreign keys that exist in the schema but aren't reflected in the model code (or vice versa) - those gaps are the archaeological evidence.

Three tables called user, users, and accounts tells you a migration was abandoned. A column called legacy_id tells you a system this one replaced. Tables with no foreign keys to the rest of the schema are often integrations that were planned and never connected, or subsystems that were retired but not removed. The data model is where the system's archaeological layers are most visible.

The configuration

find . -name "*.env*" -o -name "config*.{yml,yaml,json,toml}" -o -name "application.{yml,properties}"

Configuration is where you find out what's actually turned on in production, which is often not what the code suggests. A feature flag at 100% for two years is a fact about the system; the disabled branch is a fossil. An environment variable that's set in production but undocumented anywhere else is load-bearing by definition.

How a coding agent would read the configuration surface

Here's the configuration surface of a legacy codebase (env vars referenced in code, config files, feature flag definitions). Tell me: which settings look like they control major behavioral switches; which look like they're permanently on or permanently off; which appear in code but not in any committed config (suggesting they're set in production only); any flags or settings whose names suggest dead or abandoned features.

Behind the scenes, the agent will Grep for every env-var read pattern that matches your stack (ENV[, process.env., System.get_env, os.environ, @Value("${...}")), then cross-reference each hit against Glob'd config files (.env*, config/*.yml, application.properties, feature-flag definition files). The interesting output is the diff: variables read in code with no committed entry, or committed entries that nothing reads. That gap is what "set in production only" - and "configured but forgotten" - look like from the inside.

Watch for the gap between "config that exists in the repo" and "config that the code reads." Code that reads ENV["LEGACY_PAYMENT_MODE"] with no corresponding entry in any committed config file means someone, somewhere, has that variable set in production, and removing it will break something nobody can currently predict.

The tests

find . -path "*/test/*" -o -path "*/spec/*" -o -path "*/__tests__/*" | head -50
find . -path "*/test/*" -type f | wc -l

Tests aren't useful here for correctness - legacy tests are often wrong, stale, or skipped. They're useful for intent. A well-named test tells you what someone, at some point, thought the code was supposed to do. The gap between the test names and the current behavior is exactly the gap between the system's designed reality and its lived one.

How a coding agent would read the test inventory

Here are the test file names and test descriptions from a legacy codebase. Without reading the test bodies, tell me: which areas of the system are heavily tested versus barely tested; which test names suggest features that may no longer exist; any tests whose names contradict each other or describe overlapping responsibilities; areas where the test names suggest the team was unsure of expected behavior (lots of "should probably," "edge case," "fix for bug").

Behind the scenes, the agent will Glob for test files using your framework's convention, then Grep (or selectively Read with minimal scanning) just the describe / it / test / it_behaves_like strings without parsing the bodies. This is one of the few cases where you specifically don't want the agent to read the implementations - the labels are the closest thing to a behavioral spec the team ever wrote, and the bodies are noise at this stage.

Heavily tested modules are where the team was nervous. Barely tested modules are either trivial or terrifying - and you usually can't tell which without reading them. Test files for modules that no longer exist in the codebase are common and informative: someone removed the code but kept the test, or vice versa, and either way it's a story.

The pattern across the four: for each entry point, get the complete inventory, then ask for the shape of the inventory rather than its contents. The shape is the information; individual items are usually noise at this stage. You're still not reading code. You're building a map of what the code is for, which is the prerequisite for reading any of it usefully.


Reading code, finally

By now you have a map: the project's shape, its boundaries, its data model, its configuration, its test surface. You haven't read any logic. That changes here, but not in the way most engineers do it. The mistake is to open the most interesting-looking file and start reading downward. The better move is to pick a thread and follow it, with the agent walking alongside.

A thread starts at a boundary. Pick one HTTP route, one queue consumer, one scheduled job. Just one. The goal isn't to understand the whole system - it's to understand one path through it deeply enough that you can predict what the second path will look like before you trace it.

How a coding agent would walk a single thread

The opening prompt for a thread:

I'm tracing a single request through this codebase to understand how the system works. The entry point is [paste the route handler or job definition]. Walk me through what this code does, step by step, following the call graph. When you hit a database operation, an external API call, or a piece of logic that seems unusual, stop and tell me. Don't summarize - narrate.

Behind the scenes, the agent will Read the entry file, Grep for each called function to locate its definition, Read that file, and recurse. Expect 10–30 Read calls per thread depending on depth, plus opportunistic Greps when it encounters something polymorphic. When you ask it to "narrate, not summarize," you're explicitly trading tokens for resolution - forcing it to spend output describing each step instead of collapsing them. That's the trade you want, because the collapse is where understanding goes to die.

What to look out for: "Narrate, don't summarize" is the key instruction. Summaries flatten the system. A narration preserves the order of operations, which is where the load-bearing weirdness lives. You want to know that the request validates input, hits cache, falls through to the database, applies a transformation that doesn't appear in the route handler, and then - and this is the part that matters - does something the agent's narration glosses over.

The glossed-over parts are the whole game. Agents narrating code do a specific thing: when they hit something they don't fully understand, they describe it at a higher level of abstraction. "This function handles the legacy payment flow" is what you get when the agent doesn't actually understand what the function does. A real narration would say "this function checks if payment.legacy_mode is true, and if so, calls process_v1_payment instead of the normal path, passing the original request object rather than the parsed one." If the agent gives you the first version, that's a signal - not that the agent is failing, but that there's something in that function worth looking at directly.

The follow-up:

You said this function "handles the legacy payment flow." Show me the actual logic - what does it check, what does it call, and what's different from the non-legacy path?

Behind the scenes, the agent will Read the specific function you pointed at (it may not have been in context yet), Grep for related helpers and callees, and produce a line-by-line narration. If it still hedges after that, the function probably depends on runtime state or configuration the agent can't observe from source - feature flags, env vars, database content. That's your cue to either supply that context or accept that this branch needs the human eye on the live system.

What to look out for: This is the move that separates reading with an agent from being read to by one. You're not accepting the abstraction the agent reached for. You're forcing it down a level, and you keep forcing it down until either the explanation becomes concrete or the agent's narration breaks down in a way that tells you to read the file yourself.

A few specific patterns worth watching for as you trace:

Conditionals that look unremarkable but exist anyway. A line like if user.created_at < Date.new(2019, 4, 1) is almost never trivial. Run git blame on it. The commit message is often the entire story: "fix INC-1247." That's an incident in 2019 that nobody on the current team remembers, and the conditional exists because removing it broke production once. The agent will explain the conditional accurately - "this checks if the user was created before April 2019" - without flagging that the existence of the check is the interesting part.

Defensive code with no obvious threat. Try/catch around what looks like a pure function. Nil checks on values that are clearly never nil. Retries on calls that shouldn't fail. These are scars. The thing that caused the scar may or may not still exist, but the code is treating it as if it does.

How a coding agent would investigate defensive code

Ask the agent:

This function wraps [X] in a try/catch and falls back to [Y]. What scenarios would cause [X] to fail? Search the codebase for any test or comment that explains why this defensive code exists.

Behind the scenes, the agent will Read the function, Grep for tests that exercise the defensive path (including tests that mock [X] to throw), Grep for comments mentioning the function or its callees, and inspect git blame plus the commit history of the offending line. The answer is often in a 2019 commit message nobody remembers writing - and the agent is genuinely better at finding it than you are, because it can sweep the whole history, every nearby comment, and every related test in a single pass.

What to look out for: The agent searching the codebase for context is a much better use of its capabilities than asking it to speculate. Often there's a comment three files away that explains everything.

Dead-looking code that isn't dead. The function that looks unused. The module nobody imports. Before you assume it's safe to ignore, check.

A lot of "dead" legacy code is called by reflection, scheduled by a cron config nobody opened, or invoked by a feature flag that's set in production-only config.

How a coding agent would check whether code is really dead

Find all references to [function/module name] in this codebase, including: string-based references (reflection, dynamic dispatch, configuration files), references in test fixtures, references in commented-out code, and references in non-source files (YAML, JSON, migrations).

Behind the scenes, the agent will Grep not just for direct calls but also for stringified references (reflection, dynamic dispatch, polymorphic dispatch tables, config-driven invocation), references in YAML / JSON / cron files, and even references inside commented-out code or test fixtures. "Find every way this could be invoked, across heterogeneous file types" is the canonical case where the agent outperforms a human reading file by file - make it do the exhaustive sweep so you don't have to.

What to look out for: The agent's reach across the whole repo is genuinely better than yours here - use it.

How many threads to trace before stopping: three to five. Not the whole boundary. By the third thread, you should be able to predict what the fourth will look like before you trace it - same patterns, same idioms, same kinds of weirdness in roughly the same places. When you can do that, you've absorbed the system's grammar, and continuing to trace is procrastination dressed as preparation.

A heuristic that's worth more than its size suggests: read the bug fixes before the features. A six-line commit that fixes a production issue teaches you more about a system's actual behavior than a six-hundred-line commit that adds a feature. Bug fixes are where the system's observed reality intrudes on its designed reality. Find the last twenty bug-fix commits - git log --grep="fix\|bug\|incident" --oneline | head -20 - and read the diffs. You'll learn more in thirty minutes than you would in a day of reading feature code.


The artifacts you produce while reading

Reading without writing is forgetting. By the end of your first few days you should have three artifacts on disk, all rough, all evolving. None of them are deliverables in the formal sense - they're tools for your own thinking that happen to be useful to the next person too.

The map.

A C4 model is what you want: a structured way of describing a system at four zoom levels - system context, containers, components, code. For legacy work, the top two levels are where the value is. The system context diagram names what this system talks to in the world; the container diagram names the major deployable parts and how they communicate. Both fit on one screen.

How a coding agent would help bootstrap the map

The agent can help you bootstrap it from what you've already gathered:

Based on the boundaries, data model, and configuration we've discussed, sketch out a high-level architecture diagram of this system in text form. Show the major modules, their dependencies, and where data flows between them. Mark anything that's unclear or speculative with a question mark.

Behind the scenes, the agent has already accumulated most of the inputs - boundaries, data model, configuration - in its prior context. The new work here is synthesis, not retrieval, though it may Read a few more files to fill in module-to-module relationships it didn't trace earlier. The output is usually ASCII boxes-and-arrows or a Mermaid block. Resist the urge to ask for a "real" diagram in some specific tool - the value is in your redraw, not its render.

What to look out for: Take its output and translate it into your own sketch - paper, whiteboard, Excalidraw, whatever. Don't accept the agent's diagram directly.

The act of coding the diagram is what builds the model in your head. structurizr is a great DSL for it.

The questions list

Every "why does this exist," every "I don't understand this conditional," every "this can't be right" - write it down. Don't try to answer in the moment. The list itself is the deliverable for week one.

Half the questions will answer themselves by week two - you'll be reading something else and suddenly understand the thing that confused you on Monday. The half that don't are exactly what you need to ask the remaining humans about, in one focused conversation rather than forty Slack messages spread across two weeks.

Try to formulate your findings in BDD stile syntax. If you are able to write them down in that strict framework, you have a defined mental model of what is going on.

How a coding agent would triage your questions list

The agent's role here is to triage, not to answer:

Here's my running list of questions about this codebase. Group them by: questions that are likely answerable from the code itself (and where in the code to look); questions that require historical context only a human would know; questions that suggest something might be wrong rather than just unfamiliar.

Behind the scenes, the agent will Grep and selectively Read to verify which questions are answerable directly from source. The third bucket - potential bugs - is where it does its most distinctive work: cross-referencing each suspicious pattern against tests, comments, and similar idioms elsewhere in the codebase to decide whether the smell is real or just unfamiliar. That's a comparison task at repo scale, which is the thing agents are genuinely good at.

What to look out for: The agent is decent at flagging which is which, because the third category usually has telltale shapes: defensive code with no defended-against condition, branches that can't be reached, types that don't line up.

The third category is the one to pay attention to. Questions that started as "I don't understand this" sometimes turn out to be "this might actually be a bug nobody's noticed."

The glossary

Every domain term, every internal name, every acronym. Legacy codebases are full of words used in non-obvious ways. "Account" and "User" and "Customer" mean three different things and the difference matters. "Order" might mean a purchase, or it might mean a sequence - depends on the module. Write them down as you encounter them, with where you saw them and what you currently think they mean. Update entries as you learn you were wrong.

How a coding agent would cross-reference glossary terms

Looking across this codebase, identify domain terms that appear to be used with different meanings in different modules. For each term, list the modules where it appears and how the usage differs. Flag any terms where the same word seems to refer to clearly different concepts.

Behind the scenes, the agent will Grep each candidate term across the codebase, Read the surrounding context for each hit, and cluster usages by module to detect semantic drift. This is one of the more token-expensive tasks on the list - for high-frequency terms like Account or Order you may want to scope the search to specific directories per pass. The output is most useful as input to a follow-up conversation with the team, not as the final glossary in itself.

What to look out for: This is one of the highest-leverage things the agent can do for you, because it requires looking across the whole repo and noticing inconsistency - a comparison task at scale, which is what agents are genuinely good at.

The output is often surprising. A codebase where Account means three different things in three different modules is a codebase where every conversation between teams has been slightly broken for years, and nobody noticed because everyone was using the same word.

The glossary is the artifact that pays off latest but most. The map fades as you learn the system. The questions list shrinks as you answer them. The glossary keeps mattering forever, because every new person who joins the team needs it, and you're the only person who has just been through the experience of building it from scratch.


When to stop

The trap at the other end is reading forever. Engineers who enjoy this work can spend weeks "understanding the codebase" and never ship anything. The agent makes this worse, not better - there's always one more module to ask about, one more thread to trace, one more cross-reference to run. Infinite patience on the agent's side meets infinite curiosity on yours and the result is a beautifully understood system you haven't actually done any work on.

You're done when you can answer three questions without going back to the agent or grepping:

  • If a user reports a bug in feature X, where would I look first?
  • If I had to add a new field to entity Y, where are the four places I'd have to change?
  • If this system stopped working at 3am, what are the three things most likely to be wrong?

If you can answer those, you have enough.

Conclusion

Reading legacy code well is one of the most undertaught skills in this field. Engineers spend years learning to write code and approximately zero hours learning to read it.

It might have become clear that there is some aspects where agents have the advantage over humans: Whenever there are many code paths to explore or to document, they never tire, and never become bored (at least that's what we like to think).

This is why we started to work on Surveyor, which extracts specifications from legacy systems and turns them into runnable BDD tests. That's a different conversation if you're curious. The reading skill matters either way.

Tags:

✨ This could be your product’s story! We bring together strategy, design, and development to launch products that perform. Do you have a similar idea? Wondering how this would work for your application? Let’s talk!

Christoph Beck portrait

Christoph Beck

Head of Intergalactic Mischief

We’re hiring

Work with our great team, apply for one of the open positions at bitcrowd