The Problem: Why Your LLM Can't Just "Read" Your Repo
You have a big project with features you want to replicate on a new app.
How do you teach an AI coding assistant (like Cursor, Gemini, or ChatGPT) to reproduce some of those features?
Suppose you take the whole project structure of the original and feed it to a prompt, you will hit context limit and context rot.
Suppose you take the snippet of the file feature and ask it to replicate just the file, you run into couple of problems:
- you would need to know the file and all of its dependencies.
- you would also need to know all of the references for the file: in terms of file tree placement and data architecture.
Attempting to teach an AI coding assistant (like Cursor, Gemini, or ChatGPT) about a giant project by pasting the entire codebase, or even just large snippets, fails predictably due to three constraints:
- Context Limit: The AI hits the maximum token limit, leaving most of your project unseen.
- Context Rot (Semantic Drift): Even if the code fits, the volume dilutes the signal, making the AI lose focus on the core task. 1
- Dependency Hell: A single file snippet is useless without all its dependencies, references, architectural patterns, and file tree placement.
What to do?
Never dump raw project context (aka repo2txt) unless the repo is small enough (almost never the case)
The scalable, industry-grade method is to create declarative "contracts" of knowledge, not monolithic source code snapshots.
Part 1: Formalizing Architectural Awareness
The "teaching Cursor" step should be reframed from a prompt problem to a knowledge encapsulation problem.
What this means is you should define the system's intent and boundaries before showing the code. Let me show you what I mean by that:
1. Formalize the Feature Capsule
A feature should be taught to the AI as a high-level contract, not a raw implementation. This capsule carries the semantic boundaries without the raw code volume.
Layer | Content | Analogy |
---|---|---|
Interface Specification | The feature's inputs, outputs, and side effects. | An API specification. |
Dependencies | What external modules, services, or libraries it calls. | The list of external services in a microservice contract. |
Implementation Notes | A brief summary of why it behaves that way (e.g., "uses Redux for state," "implements optimistic locking"). | The design document. |
The resulting artifact should be stored as plain text, like /ai_docs/features/login.md
, ready for AI consumption.
2. Create a Code Map Manifest
OK. How you gonna do that?
One way to do it is to use static analysis tools (e.g., depcruise
for JS/TS) to generate a dependency graph out of your repo.
Then, pass the generated image to a multimodal AI that can handle images:
PROMPT:
Convert this graph into a simplified JSON manifest that summarizes:
* Which modules import which (the structural relationships).
* Which files are pure utilities versus those that touch the DB or UI.
This manifest gives the AI context on placement logic and dependency paths without needing the full source code.
It is just an approximation and will certainly not guarantee it captures all dependency, nor will it understand the sensibility of those dependencies for each component and how it will fit into your new project, but it will be a start.
3. Extract Architectural Pattern Rules
Manually define your project's conventions in a single file, eg: /architecture.md
.
- Example Rules:
- UI components live in
/src/components
. - Business logic is isolated in
/src/core
. - All features expose
useFeatureX()
entrypoints.
- UI components live in
This allows the AI to correctly infer where a new feature goes, preventing long-term maintenance problems (entropy from ad hoc additions, if you even understand what it means).
4. Encode Codegen Policies
Create a dedicated policy file (e.g., .cursorrules
or .prompt-template
) that injects declarative guidelines:
rules:
- prefer functional components with hooks
- follow existing folder conventions defined in architecture.md
- all features must have matching tests under /__tests__
context:
- architecture.md
- deps.json
This transforms the AI from a code copier into a procedural generator that adheres to your established internal standards.
Part 2: Sequential Extraction for Scale
Generating the knowledge base is itself a scaling problem. You can attempt to summarize everything manually by breaking each component for more control, but for more experienced developers, you should look into automating the process of progressive decomposition.
1. Programmatic Boundary Detection
Instead of manual guesswork, use static analysis to detect cohesive module clusters. Basically, going back to the previously suggested idea of using depcruise on repo.
npx depcruise src --output-type json > deps.json
# Run a script to cluster files based on intra-dependencies (e.g., 60% internal links).
Each resulting cluster represents a feature capsule candidate, marking a clear functional boundary.
2. Auto-Summarize Each Cluster (The Sequential Loop)
For each detected cluster (a small, cohesive set of files), run an automated process:
-
Isolate: Load only the files belonging to the current cluster.
-
Extract: Pull top-level function signatures and module exports (the "interface surface").
-
Feed the extracted interface and a fixed template into a powerful LLM:
Prompt:
Summarize this module group into a Feature Capsule. Include the Name, Purpose, Main Input/Output, Key Dependencies, Side Effects, and an Example Usage Pattern.
- Store: Save the resulting markdown to
/ai_docs/features/<slug>.md
.
The intention is to let the LLM encodes system knowledge without embedding the code itself. Generating code later is the easier part. The hard part is getting the system knowledge and design right and within context limits.
3. Continuous Maintenance and Indexing
-
Indexing: Generate an
index.json
file that maps each feature capsule ID to its corresponding source files and dependencies. This allows the AI tooling to quickly search and retrieve only the necessary capsules for any given task. -
Verification: Manual verification passes are performed only on the most critical capsules (auth, billing). Mark these as verified: true in their frontmatter.
CI/CD Integration: Integrate the capsule refresh process into your Continuous Integration (CI/CD) pipeline.
jobs:
update-capsules:
runs-on: ubuntu-latest
steps:
- name: Generate Dependency Graph
run: npx depcruise src --output-type json > deps.json
- name: Update Feature Capsules
run: node updateFeatureCapsules.js
This transforms your AI documentation layer into a dynamic knowledge graph that evolves automatically with your codebase, ready to be queried by an LLM at any time without overloading its context.
Disclaimer
AI is used to assist with the writing of this article. Mainly its purpose to to help me with more precise descriptions for the audience when it comes to writing a technical article.
Certain practices here are suggested by AI (eg: depcruise) and I have not personally simulated enough projects to know the technical limits of those suggestions. All I know is that it has helped me with my problem, and I'm writing mainly as a reference to myself if I ever encounter the same problem again in the future.
Experiment to test out this method's power
Suppose you want to truly test out the power and limits of this method, my suggestion is to:
- go to Youtube, search for app tutorial
- go to the final repo, download it and follow the instructions suggested here to generate
/ai_docs/features
andarchitecture.md
. - from
/ai_docs/features
andarchitecture.md
that is generated by AI, attempt to reproduce the app WITHOUT referencing Youtube. - Verify if the vibe coded app matches in functionality and ease of maintenance when compared to the app that is referenced directly from Youtube step-by-step.
Footnotes
-
If you ever had any doubts in the past, now computer science theory has confirmed that even for bots, multi-tasking is as unproductive for it as it is for humans. ↩