System Design: Local-First AI Coding Assistant

1. System Architecture & Methodology

This design outlines a comprehensive, local-first system for an AI coding assistant. The architecture prioritizes data privacy, low latency, and offline capability by running all components—inference, embedding, and storage—on the user's machine^[1].

1.1 High-Level Architecture

The system follows a split-process architecture to decouple the user interface from heavy computational tasks. It consists of three primary subsystems^[2][3]:

IDE Client (Frontend): A lightweight extension (VS Code/JetBrains) that handles user interactions, renders the chat interface, and captures editor events (cursor movement, file changes).
RAG & Analysis Engine (Middleware): A local server (often running as a sidecar or LSP) that manages the context. It handles file indexing, vector retrieval, and prompt engineering.
Inference Layer (Backend): A dedicated model server (e.g., Ollama, specialized Python server) that hosts the LLMs for chat and autocomplete.

1.2 Detailed Data Flow

A. Indexing Pipeline (Background Process)

To achieve "codebase awareness," the system maintains a real-time index of the project.

Trigger: File watcher detects changes (save/create/delete).
Processing: The RAG Engine reads the file and parses it (AST-based parsing is preferred over regex for code)^[4].
Embedding: Code chunks are sent to a local embedding model (e.g., nomic-embed-text-v1.5 or bge-m3)^[1].
Storage: Vectors and metadata (filepath, line numbers, symbols) are upserted into the local vector store (e.g., LanceDB)^[12].

B. Retrieval-Augmented Generation (Chat Flow)

When a user asks a question about the codebase:

Query Processing: The IDE sends the user's query to the RAG Engine.
Retrieval: The engine converts the query to a vector and searches the local database for semantically relevant code snippets.
Reranking (Optional): A lightweight local cross-encoder reranks results to filter out irrelevant matches.
Prompt Construction: A prompt is built including:
- System instructions (persona, style guide).
- Retrieved context (snippets with file paths).
- Current open file and cursor position.
- Conversation history.
Inference: The prompt is sent to the local Inference Server (e.g., DeepSeek-Coder-6.7B)^[10].
Response: The generated answer is streamed back to the IDE client.

C. Autocomplete (FIM Flow)

Autocomplete requires sub-50ms latency and typically bypasses the heavy RAG pipeline.

Context: Uses "Fill-In-the-Middle" (FIM) training. The context window is populated with the text before (prefix) and after (suffix) the cursor.
RAG-Light: Uses a "sliding window" of recently accessed files or imports for context, rather than a full vector search, to minimize latency.

1.3 Communication Layer: LSP vs. Sidecar Pattern

A critical design decision is how the IDE extension communicates with the local AI backend. While the Language Server Protocol (LSP) is the standard for code intelligence, it is ill-suited for the streaming, stateful nature of AI chat.

Feature	Language Server Protocol (LSP)	HTTP/WebSocket Sidecar (Recommended)
Primary Use Case	Deterministic tasks (Go to Definition, Hover, Diagnostics)^[16].	Generative tasks (Chat, Streaming, Long-running Context)^[17].
State Management	Stateless (mostly); relies on opening/closing files.	Stateful; maintains conversation history and retrieval cache.
Streaming	Not natively supported; requires custom extensions.	Native; Server-Sent Events (SSE) or WebSocket streaming^[18].
Adoption	Used by LSP-AI (experimental).	Used by Continue.dev, GitHub Copilot (Chat).

Recommendation: Use a hybrid approach. Use standard LSP for the "Autocomplete/Ghost Text" feature (where low latency <50ms is critical) to hook into the editor's native typing events. Use a dedicated HTTP Sidecar (running on localhost) for the Chat Interface to handle RAG processing, heavy retrievals, and LLM streaming without blocking the editor's main thread^[17].

1.4 Frontend Implementation Specifications (VS Code)

To bridge the high-level architecture with actual code, the VS Code extension should utilize the following specific APIs:

Chat Interface: Use the WebviewViewProvider API to render a React/Vue application in the Primary Sidebar. Communication uses the `postMessage` protocol to send user queries to the Sidecar^[29].
Ghost Text: Use the InlineCompletionItemProvider API. This provider is triggered on every keystroke. It must return an array of `InlineCompletionItem` objects containing the `insertText` generated by the Autocomplete Model^[30].

2. Codebase Awareness: Indexing Methodology

The system's ability to understand the entire repository ("codebase awareness") depends entirely on how code is parsed, chunked, and indexed. In 2024-2025, simple text processing is considered insufficient for production-grade coding assistants^[6][7].

2.1 Comparison: AST-Based vs. Text-Based Splitting

Feature	Text-Based Splitting (Legacy)	AST-Based Splitting (State-of-the-Art)
Methodology	Splits by fixed character/token count (e.g., every 512 tokens).	Parses code into an Abstract Syntax Tree (AST) to identify logical boundaries^[4].
Context Integrity	Often cuts through functions or classes, severing context.	Preserves complete semantic units (functions, classes, structs)^[6].
Retrieval Quality	Lower precision; retrieves fragmented code.	High precision; retrieves executable/logically complete blocks^[7].
Tooling	RecursiveCharacterTextSplitter (LangChain).	Tree-sitter, specialized language parsers.

2.2 Recommended Approach: Structural & Graph-Based Indexing

For a robust local-first system, we recommend a hybrid approach that combines AST parsing with dependency graph analysis. This ensures that when a function is retrieved, its relevant context (what it calls, what calls it) is also understood.

A. The Parsing Pipeline (Tree-sitter)

Tree-sitter is the industry standard for this task due to its speed and incremental parsing capabilities. The pipeline works as follows^[4][6]:

Parse: Generate a concrete syntax tree for each file.
Traverse: Walk the tree to extract nodes of interest (e.g., function_definition, class_declaration).
Scope Resolution: Identify the parent scope for each node to attach metadata (e.g., "Function process_data inside Class DataHandler").

B. Functional Chunking Strategy

Instead of arbitrary chunks, create "functional chunks":

Granularity: One chunk per function or method.
Augmentation: If a function is too small (<5 lines), merge it with its parent class context. If too large (>100 lines), use a "hierarchical summary"—embed the function signature and docstring, but keep the body as a separate retrieval target.

C. Dependency Graphing (Advanced)

To solve the "missing context" problem where an LLM sees a function call but not its definition:

Symbol Table: Maintain a lightweight local symbol table mapping Symbol Name -> File Path/Line^[7].
Graph Traversal: During retrieval, if the retrieved chunk contains calls to other internal functions, perform a secondary lookup in the symbol table to fetch their signatures or bodies.

2.3 Tooling for Dependency Graphing

To implement the "Symbol Table" and "Graph Traversal" effectively without building a full compiler frontend, we recommend a dual-tool approach:

Component	Tool	Role
Symbol Index	Universal Ctags	Generates a lightweight `tags` file (JSON/Exuberant format) mapping symbol names to file paths. It supports 40+ languages out of the box and is significantly faster than LSP indexing^[27].
Call Graph	Tree-sitter	Parses the current file to identify function calls. These calls are then looked up in the Ctags index to resolve their definitions^[28].

3. Model Selection & Hardware Strategy

Running state-of-the-art coding models locally requires balancing model capability (parameters) with hardware constraints (VRAM). For a responsive experience in 2025, we recommend a dual-model approach: a larger "Instruct" model for chat and a smaller "Base" model for low-latency autocomplete.

3.1 Recommended Models (2024-2025)

A. Primary Chat Model (7B Class)

For the "brain" of the assistant (answering questions, refactoring code), Qwen2.5-Coder-7B-Instruct is currently the top performer in the sub-10B category, consistently outperforming previous leaders like DeepSeek-Coder-V2-Lite and StarCoder2 on benchmarks (HumanEval, MBPP)^[8][9].

Model	Parameters	Best For	VRAM (4-bit Quant)
Qwen2.5-Coder-7B-Instruct	7B	Overall Best (Reasoning, Multi-lang)^[8]	~5.5 GB^[10]
DeepSeek-Coder-V2-Lite	16B (MoE)	Complex Logic (if hardware permits)	~10 GB^[10]
Llama-3.1-8B-Instruct	8B	General Conversation + Code	~6 GB

B. Autocomplete Model (1B-3B Class)

Autocomplete ("ghost text") requires execution in under 50ms. Using a 7B model here often introduces noticeable lag on consumer GPUs. We recommend smaller, specialized models that support FIM (Fill-In-the-Middle).

Recommendation: Qwen2.5-Coder-1.5B or StarCoder2-3B.
Why: These run comfortably in <2GB VRAM, allowing them to coexist with the chat model, and deliver near-instant completions^[9].

3.2 Hardware Requirements & Quantization

To run these models locally, Quantization (reducing precision from 16-bit to 4-bit) is essential for consumer GPUs. Modern "GGUF" format quantizations (e.g., Q4_K_M) retain ~95% of performance while halving memory usage^[10][11].

Hardware Tier	VRAM	Recommended Configuration
Entry (Consumer)	8 GB	Chat: Qwen2.5-7B (Q4_K_M) Autocomplete: Qwen2.5-1.5B (Q4_K_M) Note: Tight fit; may need to offload some layers to CPU.
Mid-Range	12-16 GB	Chat: Qwen2.5-7B (Q8_0) or DeepSeek-V2-Lite (Q4) Autocomplete: StarCoder2-3B (FP16)
High-End	24 GB+	Chat: Qwen2.5-32B (Q4) or Codestral-22B Autocomplete: StarCoder2-7B

3.3 Inference Concurrency Strategy

Running two distinct models (Chat and Autocomplete) on a single consumer GPU presents a scheduling challenge: a long-running chat generation can block the GPU, causing autocomplete latency to spike beyond the 50ms target.

A. Dual-Server Architecture

Instead of a single server process, we recommend running two separate instances of the inference engine (e.g., llama-server) on different ports:

Port 8080 (Chat): Loads the 7B Instruct model. Configured with a higher context window (e.g., 8192).
Port 8081 (Autocomplete): Loads the 1B FIM model. Configured with a minimal context (e.g., 2048) and forced f16 or q4_0 KV cache to save VRAM.

B. Request Cancellation & Debouncing

To prevent the Autocomplete model from flooding the queue:

Debounce: The IDE Client should wait for 300ms of user inactivity before sending an FIM request.
AbortController: If the user types a new character while a request is pending, the client must immediately send an HTTP abort signal. The server must support request cancellation to free up the GPU slot immediately^[26].

4. Data & Retrieval Layer

The retrieval pipeline connects the raw code to the LLM. It consists of the vector store, the embedding model, and the reranking step. All components must run locally.

4.1 Vector Database Selection

The vector database stores the embeddings of the codebase. For a local-first application, the database must be embedded (running in the same process as the application) to avoid the operational complexity of managing a separate server container (like Docker)^[12].

Feature	LanceDB (Recommended)	Chroma	SQLite-vec (Lightweight)
Architecture	Serverless, File-Based (Lance format)^[12].	Client-Server (often requires Docker/Python process)^[13].	SQLite Extension (C-based)^[14].
Performance	High (Zero-copy access, Rust-based).	Medium (Python overhead in local mode).	High (for small datasets)^[15].
Memory Usage	Low (Disk-based index, efficient caching)^[13].	High (Often loads collection to memory).	Very Low.
Hybrid Search	Native support (Vector + Full Text)^[12].	Limited.	Requires complex SQL queries.

4.2 Recommendation: LanceDB

We recommend LanceDB for this system architecture. Its ability to perform vector search directly on disk-based files without loading the entire index into RAM makes it ideal for indexing large repositories on consumer hardware^[12][13].

Zero-Setup: It runs as a library within the RAG Engine (Python/Node.js), requiring no external background service.
Multimodal: Future-proofs the system if you decide to index images (UI screenshots) alongside code.

4.3 Embedding Model Selection

The embedding model determines the retrieval quality. For code, general-purpose models often fail. We recommend models trained specifically on code or with large context windows.

Model	Context Window	Size (Params)	Best For
nomic-embed-text-v1.5 (Recommended)	8192	137M	Long Context; specific `search_document` prefix for retrieval^[19].
snowflake-arctic-embed-m	512	110M	High Precision on benchmarks (MTEB)^[20].
all-MiniLM-L6-v2	512	22M	Extreme Speed (Legacy option, lower quality).

4.4 Reranking Strategy (Local)

To improve precision without running a massive LLM, use a Cross-Encoder to re-score the top 20-50 results from the vector search. This step is computationally expensive compared to vector search, so the model must be lightweight.

CPU-Only Setup: Use cross-encoder/ms-marco-TinyBERT-L-2-v2. It is extremely fast (<50ms on CPU) and provides a significant boost over raw vector similarity^[21].
GPU Setup: Use BAAI/bge-reranker-base (or its quantized GGUF version). It offers state-of-the-art reranking quality but requires ~200-300ms on CPU, making it better suited for GPU execution^[22].

5. Prompt Engineering & Context Management

Retrieving the right code is only half the battle; the retrieved context must be presented to the LLM in a format that maximizes adherence to instructions. In 2025, the standard for high-performance coding agents is XML-structured prompting, which clearly delineates instructions from data^[23].

5.1 Context Construction Strategy

Naive concatenation of code files often confuses models about where one file ends and another begins. The recommended strategy is to wrap each retrieved snippet in semantic XML tags containing metadata.

5.2 Recommended XML System Prompt Template

The following structure is optimized for models like Qwen2.5-Coder and DeepSeek-Coder, which are trained to recognize structured data:

<system_instructions>
        You are an expert software engineer specialized in Python and Rust.
        Answer the user's request using the provided Context.
        If the answer is not in the context, say so.
        </system_instructions>
        <context_data>
        <retrieved_snippet index="1" score="0.89" path="src/main.rs">
        fn main() {
        println!("Hello, world!");
        }
        </retrieved_snippet>
        <retrieved_snippet index="2" score="0.85" path="src/utils.rs">
        pub fn helper() { ... }
        </retrieved_snippet>
        </context_data>
        <user_query>
        {USER_INPUT}
        </user_query>

Explicit Delimiters: Tags like <retrieved_snippet> prevent "prompt injection" from the codebase content itself^[24].
Metadata Injection: Including path and score helps the model cite its sources ("As seen in src/main.rs...")^[25].

^{[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}