System Design: Local-First AI Coding Assistant
1. System Architecture & Methodology
This design outlines a comprehensive, local-first system for an AI coding assistant. The architecture prioritizes data privacy, low latency, and offline capability by running all components—inference, embedding, and storage—on the user's machine[1].
1.1 High-Level Architecture
The system follows a split-process architecture to decouple the user interface from heavy computational tasks. It consists of three primary subsystems[2][3]:
- IDE Client (Frontend): A lightweight extension (VS Code/JetBrains) that handles user interactions, renders the chat interface, and captures editor events (cursor movement, file changes).
- RAG & Analysis Engine (Middleware): A local server (often running as a sidecar or LSP) that manages the context. It handles file indexing, vector retrieval, and prompt engineering.
- Inference Layer (Backend): A dedicated model server (e.g., Ollama, specialized Python server) that hosts the LLMs for chat and autocomplete.
1.2 Detailed Data Flow
A. Indexing Pipeline (Background Process)
To achieve "codebase awareness," the system maintains a real-time index of the project.
- Trigger: File watcher detects changes (save/create/delete).
- Processing: The RAG Engine reads the file and parses it (AST-based parsing is preferred over regex for code)[4].
- Embedding: Code chunks are sent to a local embedding model (e.g.,
nomic-embed-text-v1.5orbge-m3)[1]. - Storage: Vectors and metadata (filepath, line numbers, symbols) are upserted into the local vector store (e.g., LanceDB)[12].
B. Retrieval-Augmented Generation (Chat Flow)
When a user asks a question about the codebase:
- Query Processing: The IDE sends the user's query to the RAG Engine.
- Retrieval: The engine converts the query to a vector and searches the local database for semantically relevant code snippets.
- Reranking (Optional): A lightweight local cross-encoder reranks results to filter out irrelevant matches.
- Prompt Construction: A prompt is built including:
- System instructions (persona, style guide).
- Retrieved context (snippets with file paths).
- Current open file and cursor position.
- Conversation history.
- Inference: The prompt is sent to the local Inference Server (e.g., DeepSeek-Coder-6.7B)[10].
- Response: The generated answer is streamed back to the IDE client.
C. Autocomplete (FIM Flow)
Autocomplete requires sub-50ms latency and typically bypasses the heavy RAG pipeline.
- Context: Uses "Fill-In-the-Middle" (FIM) training. The context window is populated with the text before (prefix) and after (suffix) the cursor.
- RAG-Light: Uses a "sliding window" of recently accessed files or imports for context, rather than a full vector search, to minimize latency.
1.3 Communication Layer: LSP vs. Sidecar Pattern
A critical design decision is how the IDE extension communicates with the local AI backend. While the Language Server Protocol (LSP) is the standard for code intelligence, it is ill-suited for the streaming, stateful nature of AI chat.
| Feature | Language Server Protocol (LSP) | HTTP/WebSocket Sidecar (Recommended) |
|---|---|---|
| Primary Use Case | Deterministic tasks (Go to Definition, Hover, Diagnostics)[16]. | Generative tasks (Chat, Streaming, Long-running Context)[17]. |
| State Management | Stateless (mostly); relies on opening/closing files. | Stateful; maintains conversation history and retrieval cache. |
| Streaming | Not natively supported; requires custom extensions. | Native; Server-Sent Events (SSE) or WebSocket streaming[18]. |
| Adoption | Used by LSP-AI (experimental). | Used by Continue.dev, GitHub Copilot (Chat). |
Recommendation: Use a hybrid approach. Use standard LSP for the "Autocomplete/Ghost Text" feature (where low latency <50ms is critical) to hook into the editor's native typing events. Use a dedicated HTTP Sidecar (running on localhost) for the Chat Interface to handle RAG processing, heavy retrievals, and LLM streaming without blocking the editor's main thread[17].
1.4 Frontend Implementation Specifications (VS Code)
To bridge the high-level architecture with actual code, the VS Code extension should utilize the following specific APIs:
- Chat Interface: Use the
WebviewViewProviderAPI to render a React/Vue application in the Primary Sidebar. Communication uses the `postMessage` protocol to send user queries to the Sidecar[29]. - Ghost Text: Use the
InlineCompletionItemProviderAPI. This provider is triggered on every keystroke. It must return an array of `InlineCompletionItem` objects containing the `insertText` generated by the Autocomplete Model[30].
2. Codebase Awareness: Indexing Methodology
The system's ability to understand the entire repository ("codebase awareness") depends entirely on how code is parsed, chunked, and indexed. In 2024-2025, simple text processing is considered insufficient for production-grade coding assistants[6][7].
2.1 Comparison: AST-Based vs. Text-Based Splitting
| Feature | Text-Based Splitting (Legacy) | AST-Based Splitting (State-of-the-Art) |
|---|---|---|
| Methodology | Splits by fixed character/token count (e.g., every 512 tokens). | Parses code into an Abstract Syntax Tree (AST) to identify logical boundaries[4]. |
| Context Integrity | Often cuts through functions or classes, severing context. | Preserves complete semantic units (functions, classes, structs)[6]. |
| Retrieval Quality | Lower precision; retrieves fragmented code. | High precision; retrieves executable/logically complete blocks[7]. |
| Tooling | RecursiveCharacterTextSplitter (LangChain). | Tree-sitter, specialized language parsers. |
2.2 Recommended Approach: Structural & Graph-Based Indexing
For a robust local-first system, we recommend a hybrid approach that combines AST parsing with dependency graph analysis. This ensures that when a function is retrieved, its relevant context (what it calls, what calls it) is also understood.
A. The Parsing Pipeline (Tree-sitter)
Tree-sitter is the industry standard for this task due to its speed and incremental parsing capabilities. The pipeline works as follows[4][6]:
- Parse: Generate a concrete syntax tree for each file.
- Traverse: Walk the tree to extract nodes of interest (e.g.,
function_definition,class_declaration). - Scope Resolution: Identify the parent scope for each node to attach metadata (e.g., "Function
process_datainside ClassDataHandler").
B. Functional Chunking Strategy
Instead of arbitrary chunks, create "functional chunks":
- Granularity: One chunk per function or method.
- Augmentation: If a function is too small (<5 lines), merge it with its parent class context. If too large (>100 lines), use a "hierarchical summary"—embed the function signature and docstring, but keep the body as a separate retrieval target.
C. Dependency Graphing (Advanced)
To solve the "missing context" problem where an LLM sees a function call but not its definition:
- Symbol Table: Maintain a lightweight local symbol table mapping
Symbol Name->File Path/Line[7]. - Graph Traversal: During retrieval, if the retrieved chunk contains calls to other internal functions, perform a secondary lookup in the symbol table to fetch their signatures or bodies.
2.3 Tooling for Dependency Graphing
To implement the "Symbol Table" and "Graph Traversal" effectively without building a full compiler frontend, we recommend a dual-tool approach:
| Component | Tool | Role |
|---|---|---|
| Symbol Index | Universal Ctags | Generates a lightweight tags file (JSON/Exuberant format) mapping symbol names to file paths. It supports 40+ languages out of the box and is significantly faster than LSP indexing[27]. |
| Call Graph | Tree-sitter | Parses the current file to identify function calls. These calls are then looked up in the Ctags index to resolve their definitions[28]. |
3. Model Selection & Hardware Strategy
Running state-of-the-art coding models locally requires balancing model capability (parameters) with hardware constraints (VRAM). For a responsive experience in 2025, we recommend a dual-model approach: a larger "Instruct" model for chat and a smaller "Base" model for low-latency autocomplete.
3.1 Recommended Models (2024-2025)
A. Primary Chat Model (7B Class)
For the "brain" of the assistant (answering questions, refactoring code), Qwen2.5-Coder-7B-Instruct is currently the top performer in the sub-10B category, consistently outperforming previous leaders like DeepSeek-Coder-V2-Lite and StarCoder2 on benchmarks (HumanEval, MBPP)[8][9].
| Model | Parameters | Best For | VRAM (4-bit Quant) |
|---|---|---|---|
| Qwen2.5-Coder-7B-Instruct | 7B | Overall Best (Reasoning, Multi-lang)[8] | ~5.5 GB[10] |
| DeepSeek-Coder-V2-Lite | 16B (MoE) | Complex Logic (if hardware permits) | ~10 GB[10] |
| Llama-3.1-8B-Instruct | 8B | General Conversation + Code | ~6 GB |
B. Autocomplete Model (1B-3B Class)
Autocomplete ("ghost text") requires execution in under 50ms. Using a 7B model here often introduces noticeable lag on consumer GPUs. We recommend smaller, specialized models that support FIM (Fill-In-the-Middle).
- Recommendation: Qwen2.5-Coder-1.5B or StarCoder2-3B.
- Why: These run comfortably in <2GB VRAM, allowing them to coexist with the chat model, and deliver near-instant completions[9].
3.2 Hardware Requirements & Quantization
To run these models locally, Quantization (reducing precision from 16-bit to 4-bit) is essential for consumer GPUs. Modern "GGUF" format quantizations (e.g., Q4_K_M) retain ~95% of performance while halving memory usage[10][11].
| Hardware Tier | VRAM | Recommended Configuration |
|---|---|---|
| Entry (Consumer) | 8 GB | Chat: Qwen2.5-7B (Q4_K_M)
Autocomplete: Qwen2.5-1.5B (Q4_K_M) Note: Tight fit; may need to offload some layers to CPU. |
| Mid-Range | 12-16 GB | Chat: Qwen2.5-7B (Q8_0) or DeepSeek-V2-Lite (Q4)
Autocomplete: StarCoder2-3B (FP16) |
| High-End | 24 GB+ | Chat: Qwen2.5-32B (Q4) or Codestral-22B
Autocomplete: StarCoder2-7B |
3.3 Inference Concurrency Strategy
Running two distinct models (Chat and Autocomplete) on a single consumer GPU presents a scheduling challenge: a long-running chat generation can block the GPU, causing autocomplete latency to spike beyond the 50ms target.
A. Dual-Server Architecture
Instead of a single server process, we recommend running two separate instances of the inference engine (e.g., llama-server) on different ports:
- Port 8080 (Chat): Loads the 7B Instruct model. Configured with a higher context window (e.g., 8192).
- Port 8081 (Autocomplete): Loads the 1B FIM model. Configured with a minimal context (e.g., 2048) and forced
f16orq4_0KV cache to save VRAM.
B. Request Cancellation & Debouncing
To prevent the Autocomplete model from flooding the queue:
- Debounce: The IDE Client should wait for 300ms of user inactivity before sending an FIM request.
- AbortController: If the user types a new character while a request is pending, the client must immediately send an HTTP
abortsignal. The server must support request cancellation to free up the GPU slot immediately[26].
4. Data & Retrieval Layer
The retrieval pipeline connects the raw code to the LLM. It consists of the vector store, the embedding model, and the reranking step. All components must run locally.
4.1 Vector Database Selection
The vector database stores the embeddings of the codebase. For a local-first application, the database must be embedded (running in the same process as the application) to avoid the operational complexity of managing a separate server container (like Docker)[12].
| Feature | LanceDB (Recommended) | Chroma | SQLite-vec (Lightweight) |
|---|---|---|---|
| Architecture | Serverless, File-Based (Lance format)[12]. | Client-Server (often requires Docker/Python process)[13]. | SQLite Extension (C-based)[14]. |
| Performance | High (Zero-copy access, Rust-based). | Medium (Python overhead in local mode). | High (for small datasets)[15]. |
| Memory Usage | Low (Disk-based index, efficient caching)[13]. | High (Often loads collection to memory). | Very Low. |
| Hybrid Search | Native support (Vector + Full Text)[12]. | Limited. | Requires complex SQL queries. |
4.2 Recommendation: LanceDB
We recommend LanceDB for this system architecture. Its ability to perform vector search directly on disk-based files without loading the entire index into RAM makes it ideal for indexing large repositories on consumer hardware[12][13].
- Zero-Setup: It runs as a library within the RAG Engine (Python/Node.js), requiring no external background service.
- Multimodal: Future-proofs the system if you decide to index images (UI screenshots) alongside code.
4.3 Embedding Model Selection
The embedding model determines the retrieval quality. For code, general-purpose models often fail. We recommend models trained specifically on code or with large context windows.
| Model | Context Window | Size (Params) | Best For |
|---|---|---|---|
| nomic-embed-text-v1.5 (Recommended) | 8192 | 137M | Long Context; specific search_document prefix for retrieval[19]. |
| snowflake-arctic-embed-m | 512 | 110M | High Precision on benchmarks (MTEB)[20]. |
| all-MiniLM-L6-v2 | 512 | 22M | Extreme Speed (Legacy option, lower quality). |
4.4 Reranking Strategy (Local)
To improve precision without running a massive LLM, use a Cross-Encoder to re-score the top 20-50 results from the vector search. This step is computationally expensive compared to vector search, so the model must be lightweight.
- CPU-Only Setup: Use
cross-encoder/ms-marco-TinyBERT-L-2-v2. It is extremely fast (<50ms on CPU) and provides a significant boost over raw vector similarity[21]. - GPU Setup: Use
BAAI/bge-reranker-base(or its quantized GGUF version). It offers state-of-the-art reranking quality but requires ~200-300ms on CPU, making it better suited for GPU execution[22].
5. Prompt Engineering & Context Management
Retrieving the right code is only half the battle; the retrieved context must be presented to the LLM in a format that maximizes adherence to instructions. In 2025, the standard for high-performance coding agents is XML-structured prompting, which clearly delineates instructions from data[23].
5.1 Context Construction Strategy
Naive concatenation of code files often confuses models about where one file ends and another begins. The recommended strategy is to wrap each retrieved snippet in semantic XML tags containing metadata.
5.2 Recommended XML System Prompt Template
The following structure is optimized for models like Qwen2.5-Coder and DeepSeek-Coder, which are trained to recognize structured data:
<system_instructions>
You are an expert software engineer specialized in Python and Rust.
Answer the user's request using the provided Context.
If the answer is not in the context, say so.
</system_instructions>
<context_data>
<retrieved_snippet index="1" score="0.89" path="src/main.rs">
fn main() {
println!("Hello, world!");
}
</retrieved_snippet>
<retrieved_snippet index="2" score="0.85" path="src/utils.rs">
pub fn helper() { ... }
</retrieved_snippet>
</context_data>
<user_query>
{USER_INPUT}
</user_query>
- Explicit Delimiters: Tags like
<retrieved_snippet>prevent "prompt injection" from the codebase content itself[24]. - Metadata Injection: Including
pathandscorehelps the model cite its sources ("As seen insrc/main.rs...")[25].