System Design: Local-First AI Coding Assistant

1. System Architecture & Methodology

This design outlines a comprehensive, local-first system for an AI coding assistant. The architecture prioritizes data privacy, low latency, and offline capability by running all components—inference, embedding, and storage—on the user's machine[1].

1.1 High-Level Architecture

The system follows a split-process architecture to decouple the user interface from heavy computational tasks. It consists of three primary subsystems[2][3]:

  1. IDE Client (Frontend): A lightweight extension (VS Code/JetBrains) that handles user interactions, renders the chat interface, and captures editor events (cursor movement, file changes).
  2. RAG & Analysis Engine (Middleware): A local server (often running as a sidecar or LSP) that manages the context. It handles file indexing, vector retrieval, and prompt engineering.
  3. Inference Layer (Backend): A dedicated model server (e.g., Ollama, specialized Python server) that hosts the LLMs for chat and autocomplete.

1.2 Detailed Data Flow

A. Indexing Pipeline (Background Process)

To achieve "codebase awareness," the system maintains a real-time index of the project.

B. Retrieval-Augmented Generation (Chat Flow)

When a user asks a question about the codebase:

  1. Query Processing: The IDE sends the user's query to the RAG Engine.
  2. Retrieval: The engine converts the query to a vector and searches the local database for semantically relevant code snippets.
  3. Reranking (Optional): A lightweight local cross-encoder reranks results to filter out irrelevant matches.
  4. Prompt Construction: A prompt is built including:
    • System instructions (persona, style guide).
    • Retrieved context (snippets with file paths).
    • Current open file and cursor position.
    • Conversation history.
  5. Inference: The prompt is sent to the local Inference Server (e.g., DeepSeek-Coder-6.7B)[10].
  6. Response: The generated answer is streamed back to the IDE client.

C. Autocomplete (FIM Flow)

Autocomplete requires sub-50ms latency and typically bypasses the heavy RAG pipeline.

1.3 Communication Layer: LSP vs. Sidecar Pattern

A critical design decision is how the IDE extension communicates with the local AI backend. While the Language Server Protocol (LSP) is the standard for code intelligence, it is ill-suited for the streaming, stateful nature of AI chat.

Feature Language Server Protocol (LSP) HTTP/WebSocket Sidecar (Recommended)
Primary Use Case Deterministic tasks (Go to Definition, Hover, Diagnostics)[16]. Generative tasks (Chat, Streaming, Long-running Context)[17].
State Management Stateless (mostly); relies on opening/closing files. Stateful; maintains conversation history and retrieval cache.
Streaming Not natively supported; requires custom extensions. Native; Server-Sent Events (SSE) or WebSocket streaming[18].
Adoption Used by LSP-AI (experimental). Used by Continue.dev, GitHub Copilot (Chat).

Recommendation: Use a hybrid approach. Use standard LSP for the "Autocomplete/Ghost Text" feature (where low latency <50ms is critical) to hook into the editor's native typing events. Use a dedicated HTTP Sidecar (running on localhost) for the Chat Interface to handle RAG processing, heavy retrievals, and LLM streaming without blocking the editor's main thread[17].

1.4 Frontend Implementation Specifications (VS Code)

To bridge the high-level architecture with actual code, the VS Code extension should utilize the following specific APIs:

2. Codebase Awareness: Indexing Methodology

The system's ability to understand the entire repository ("codebase awareness") depends entirely on how code is parsed, chunked, and indexed. In 2024-2025, simple text processing is considered insufficient for production-grade coding assistants[6][7].

2.1 Comparison: AST-Based vs. Text-Based Splitting

Feature Text-Based Splitting (Legacy) AST-Based Splitting (State-of-the-Art)
Methodology Splits by fixed character/token count (e.g., every 512 tokens). Parses code into an Abstract Syntax Tree (AST) to identify logical boundaries[4].
Context Integrity Often cuts through functions or classes, severing context. Preserves complete semantic units (functions, classes, structs)[6].
Retrieval Quality Lower precision; retrieves fragmented code. High precision; retrieves executable/logically complete blocks[7].
Tooling RecursiveCharacterTextSplitter (LangChain). Tree-sitter, specialized language parsers.

2.2 Recommended Approach: Structural & Graph-Based Indexing

For a robust local-first system, we recommend a hybrid approach that combines AST parsing with dependency graph analysis. This ensures that when a function is retrieved, its relevant context (what it calls, what calls it) is also understood.

A. The Parsing Pipeline (Tree-sitter)

Tree-sitter is the industry standard for this task due to its speed and incremental parsing capabilities. The pipeline works as follows[4][6]:

  1. Parse: Generate a concrete syntax tree for each file.
  2. Traverse: Walk the tree to extract nodes of interest (e.g., function_definition, class_declaration).
  3. Scope Resolution: Identify the parent scope for each node to attach metadata (e.g., "Function process_data inside Class DataHandler").

B. Functional Chunking Strategy

Instead of arbitrary chunks, create "functional chunks":

C. Dependency Graphing (Advanced)

To solve the "missing context" problem where an LLM sees a function call but not its definition:

2.3 Tooling for Dependency Graphing

To implement the "Symbol Table" and "Graph Traversal" effectively without building a full compiler frontend, we recommend a dual-tool approach:

Component Tool Role
Symbol Index Universal Ctags Generates a lightweight tags file (JSON/Exuberant format) mapping symbol names to file paths. It supports 40+ languages out of the box and is significantly faster than LSP indexing[27].
Call Graph Tree-sitter Parses the current file to identify function calls. These calls are then looked up in the Ctags index to resolve their definitions[28].

3. Model Selection & Hardware Strategy

Running state-of-the-art coding models locally requires balancing model capability (parameters) with hardware constraints (VRAM). For a responsive experience in 2025, we recommend a dual-model approach: a larger "Instruct" model for chat and a smaller "Base" model for low-latency autocomplete.

3.1 Recommended Models (2024-2025)

A. Primary Chat Model (7B Class)

For the "brain" of the assistant (answering questions, refactoring code), Qwen2.5-Coder-7B-Instruct is currently the top performer in the sub-10B category, consistently outperforming previous leaders like DeepSeek-Coder-V2-Lite and StarCoder2 on benchmarks (HumanEval, MBPP)[8][9].

Model Parameters Best For VRAM (4-bit Quant)
Qwen2.5-Coder-7B-Instruct 7B Overall Best (Reasoning, Multi-lang)[8] ~5.5 GB[10]
DeepSeek-Coder-V2-Lite 16B (MoE) Complex Logic (if hardware permits) ~10 GB[10]
Llama-3.1-8B-Instruct 8B General Conversation + Code ~6 GB

B. Autocomplete Model (1B-3B Class)

Autocomplete ("ghost text") requires execution in under 50ms. Using a 7B model here often introduces noticeable lag on consumer GPUs. We recommend smaller, specialized models that support FIM (Fill-In-the-Middle).

3.2 Hardware Requirements & Quantization

To run these models locally, Quantization (reducing precision from 16-bit to 4-bit) is essential for consumer GPUs. Modern "GGUF" format quantizations (e.g., Q4_K_M) retain ~95% of performance while halving memory usage[10][11].

Hardware Tier VRAM Recommended Configuration
Entry (Consumer) 8 GB Chat: Qwen2.5-7B (Q4_K_M)
Autocomplete: Qwen2.5-1.5B (Q4_K_M)
Note: Tight fit; may need to offload some layers to CPU.
Mid-Range 12-16 GB Chat: Qwen2.5-7B (Q8_0) or DeepSeek-V2-Lite (Q4)
Autocomplete: StarCoder2-3B (FP16)
High-End 24 GB+ Chat: Qwen2.5-32B (Q4) or Codestral-22B
Autocomplete: StarCoder2-7B

3.3 Inference Concurrency Strategy

Running two distinct models (Chat and Autocomplete) on a single consumer GPU presents a scheduling challenge: a long-running chat generation can block the GPU, causing autocomplete latency to spike beyond the 50ms target.

A. Dual-Server Architecture

Instead of a single server process, we recommend running two separate instances of the inference engine (e.g., llama-server) on different ports:

B. Request Cancellation & Debouncing

To prevent the Autocomplete model from flooding the queue:

4. Data & Retrieval Layer

The retrieval pipeline connects the raw code to the LLM. It consists of the vector store, the embedding model, and the reranking step. All components must run locally.

4.1 Vector Database Selection

The vector database stores the embeddings of the codebase. For a local-first application, the database must be embedded (running in the same process as the application) to avoid the operational complexity of managing a separate server container (like Docker)[12].

Feature LanceDB (Recommended) Chroma SQLite-vec (Lightweight)
Architecture Serverless, File-Based (Lance format)[12]. Client-Server (often requires Docker/Python process)[13]. SQLite Extension (C-based)[14].
Performance High (Zero-copy access, Rust-based). Medium (Python overhead in local mode). High (for small datasets)[15].
Memory Usage Low (Disk-based index, efficient caching)[13]. High (Often loads collection to memory). Very Low.
Hybrid Search Native support (Vector + Full Text)[12]. Limited. Requires complex SQL queries.

4.2 Recommendation: LanceDB

We recommend LanceDB for this system architecture. Its ability to perform vector search directly on disk-based files without loading the entire index into RAM makes it ideal for indexing large repositories on consumer hardware[12][13].

4.3 Embedding Model Selection

The embedding model determines the retrieval quality. For code, general-purpose models often fail. We recommend models trained specifically on code or with large context windows.

Model Context Window Size (Params) Best For
nomic-embed-text-v1.5 (Recommended) 8192 137M Long Context; specific search_document prefix for retrieval[19].
snowflake-arctic-embed-m 512 110M High Precision on benchmarks (MTEB)[20].
all-MiniLM-L6-v2 512 22M Extreme Speed (Legacy option, lower quality).

4.4 Reranking Strategy (Local)

To improve precision without running a massive LLM, use a Cross-Encoder to re-score the top 20-50 results from the vector search. This step is computationally expensive compared to vector search, so the model must be lightweight.

5. Prompt Engineering & Context Management

Retrieving the right code is only half the battle; the retrieved context must be presented to the LLM in a format that maximizes adherence to instructions. In 2025, the standard for high-performance coding agents is XML-structured prompting, which clearly delineates instructions from data[23].

5.1 Context Construction Strategy

Naive concatenation of code files often confuses models about where one file ends and another begins. The recommended strategy is to wrap each retrieved snippet in semantic XML tags containing metadata.

5.2 Recommended XML System Prompt Template

The following structure is optimized for models like Qwen2.5-Coder and DeepSeek-Coder, which are trained to recognize structured data:

<system_instructions>
        You are an expert software engineer specialized in Python and Rust.
        Answer the user's request using the provided Context.
        If the answer is not in the context, say so.
        </system_instructions>
        <context_data>
        <retrieved_snippet index="1" score="0.89" path="src/main.rs">
        fn main() {
        println!("Hello, world!");
        }
        </retrieved_snippet>
        <retrieved_snippet index="2" score="0.85" path="src/utils.rs">
        pub fn helper() { ... }
        </retrieved_snippet>
        </context_data>
        <user_query>
        {USER_INPUT}
        </user_query>
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]

Sources

[1] Building a Local-First RAG Engine for AI Coding Assistants (Accessed: January 26, 2026)
[2] Building a Local RAG-Powered AI Coding Assistant (Accessed: January 26, 2026)
[3] Architecture and middleware for RAG assistants (Accessed: January 26, 2026)
[4] Enhancing LLM Code Generation with RAG and AST-Based Chunking (Accessed: January 26, 2026)
[5] Codebase Indexing | Roo Code Documentation (Accessed: January 26, 2026)
[6] Building code-chunk: AST Aware Code Chunking (Accessed: January 26, 2026)
[7] How Cursor Indexes Codebases Fast (Accessed: January 26, 2026)
[8] Qwen2.5-Coder Technical Report (Accessed: January 26, 2026)
[9] DeepSeek-Coder-V2 vs. StarCoder Comparison (Accessed: January 26, 2026)
[10] Best AI Models for 8GB RAM (Accessed: January 26, 2026)
[11] Choosing the Best Ollama Model for Your Coding Projects (Accessed: January 26, 2026)
[12] LanceDB | Vector Database for RAG, Agents & Hybrid Search (Accessed: January 26, 2026)
[13] LanceDB vs Deep Lake on Vector Search Capabilities (Accessed: January 26, 2026)
[14] Vectorlite: a fast vector search extension for SQLite (Accessed: January 26, 2026)
[15] Comparison to other approaches · sqlite-vec (Accessed: January 26, 2026)
[16] LSP vs Sidecar for AI Assistants (Accessed: January 26, 2026)
[17] How Continue.dev Works (Accessed: January 26, 2026)
[18] VS Code Chat API Architecture (Accessed: January 26, 2026)
[19] Nomic Embeddings vs MiniLM (Accessed: January 26, 2026)
[20] Snowflake Arctic Embed Introduction (Accessed: January 26, 2026)
[21] MS MARCO TinyBERT Model (Accessed: January 26, 2026)
[22] BGE Reranker Base (Accessed: January 26, 2026)
[23] Prompt Engineering for RAG (Accessed: January 26, 2026)
[24] Structured Prompting Techniques (Accessed: January 26, 2026)
[25] Code Context Management (Accessed: January 26, 2026)
[26] Untitled (Accessed: January 26, 2026)
[27] Untitled (Accessed: January 26, 2026)
[28] Untitled (Accessed: January 26, 2026)
[29] Untitled (Accessed: January 26, 2026)
[30] Untitled (Accessed: January 26, 2026)