AI Engineering · Memory Systems · RAG

Your AI Agent
Has Amnesia.
ByteRover Fixes It
Without a Vector Database

How I built a customer support agent that remembers everything — using plain markdown files, BM25 search, and zero embeddings.

Altaf Shaikh Backend Developer 2025 12 min read
01 — The Problem

RAG promised AI memory. It lied.

We've been building RAG systems the same way for two years. Chunk your docs. Embed them into vectors. Store in a vector database. At query time, compute similarity and retrieve the top-k chunks. Feed to LLM. Hope for the best.

It works. Until it doesn't. And when it stops working, you have no idea why.

I've built RAG pipelines with ChromaDB, Pinecone, and Weaviate. I've implemented HyDE, reranking, query rewriting, MMR, and hybrid search. Every optimisation added complexity and every deployment added new failure modes I couldn't debug. The core problem was never the retrieval algorithm — it was that the knowledge store was a black box.

You cannot cat an embedding vector and understand what your agent knows. You cannot open Pinecone in VS Code and read what it stored. When retrieval fails silently, you are left guessing.

Three real pain points that drove me to look for something different:

Traditional RAG Pain Points
① Similarity Search Degrades Silently
As your knowledge base grows, cosine similarity loses precision. A query about "refund policy" starts returning chunks about "return window" and "shipping" because everything is semantically close. You don't get an error — you get subtly wrong answers.
② Your Agent Forgets Between Sessions
Traditional RAG has no persistent agent memory. Every session starts cold. The agent doesn't remember what it learned last time, what conventions it figured out, or what the customer told it yesterday.
③ The Knowledge Store is a Black Box
Vector databases store opaque numerical representations. When retrieval fails you cannot inspect what was stored, how it was chunked, or why a query matched the wrong document. Debugging means adding more logs, not reading the data.

Then I came across the PageIndex approach — a paper-thin idea that turned my thinking upside down. Instead of embedding documents, what if an LLM simply read them, reasoned about them, and stored the knowledge in a human-readable hierarchy? No similarity search. No vectors. Just files.

ByteRover productionises exactly this idea. And after using it for a few weeks, I'm not going back.

02 — The Idea

No embeddings. No similarity search. Just files.

The insight is embarrassingly simple. When you want to find information in a technical manual, you don't compute cosine similarity between your query and every page. You open the table of contents, navigate to the right chapter, then the right section.

ByteRover makes your AI agent do the same thing — but for any knowledge base you give it.

Vector RAG vs ByteRover — Side by Side
Traditional RAG
📄 Raw Document
✂️ Chunk into 512 tokens
🔢 Compute Embeddings
🗄️ Store in Vector DB
🔍 Query arrives
📐 Cosine Similarity (⚠️ black box)
📦 Top-k chunks (maybe wrong)
🤖 LLM generates answer
ByteRover
📄 Raw Document
🧠 LLM reads & reasons
🗂️ Organises into hierarchy
📁 Stored as .md files
🔍 Query arrives
⚡ BM25 + Cache (transparent)
📄 Right markdown file
🤖 LLM generates answer

The critical difference: everything ByteRover stores is human-readable. You can open .brv/context-tree/ in your file explorer and read exactly what your agent knows. You can git diff what changed after a curation. You can delete a file if the knowledge is wrong. It's just files.

Key Insight

ByteRover is not a retrieval optimisation on top of vector search. It is a fundamentally different approach: LLM-curated, file-based knowledge that you can read, edit, version-control, and debug without any special tooling.

03 — How It Works

The architecture, explained simply

ByteRover has two main operations: Curation (writing to memory) and Query (reading from memory). Both run entirely on your local machine. No cloud required.

The Context Tree — Local File Structure

All knowledge lives in .brv/context-tree/ as a three-level hierarchy. Every node is a folder with a context.md describing its purpose. Every piece of knowledge is a plain markdown file.

Context Tree Structure
.brv/context-tree/ ├── orders/ # Domain │ ├── context.md # auto-generated overview │ ├── _index.md # auto-generated summary │ ├── cancellation/ # Topic │ │ ├── context.md │ │ └── order_cancellation_policy.md │ └── wrong-items/ │ └── wrong_item_resolution.md ├── payments/ │ ├── context.md │ ├── payment-methods/ │ │ └── supported_payment_methods.md │ └── refunds/ │ ├── refund_timelines.md │ └── cod_refund_policy.md ├── shipping/ │ ├── context.md │ └── delivery-timelines/ │ └── city_delivery_windows.md └── returns/ ├── context.md └── return-window/ └── return_eligibility.md

Curation — Writing to Memory

When you call brv curate -d docs/, ByteRover's LLM runs in a sandbox with a ToolsSDK. It reads your documents, reasons about the content, and writes structured knowledge using five atomic operations:

Curation Operations
ADD
Create new entry
UPDATE
Modify existing
UPSERT
Add or update
MERGE
Combine entries
DELETE
Remove entry

The LLM gets feedback on every operation. If a MERGE fails, it sees the error and retries with a corrected approach.

Query — The 5-Tier Retrieval Pipeline

This is where ByteRover really shines. Every query goes through a tiered strategy that starts with the cheapest possible path and escalates only when needed. Most queries resolve in under 200ms — without touching an LLM at all.

ByteRover 5-Tier Query Strategy
TIER 0
Exact Cache
MD5 fingerprint match — same query, cached result (60s TTL)
~0ms
TIER 1
Fuzzy Cache
≥60% token similarity to a recent query
~50ms
TIER 2
BM25 Direct
Full-text search — if top result scores high, return immediately. No LLM.
~200ms
TIER 3
LLM Pre-fetch
BM25 results injected as context for a single LLM synthesis call
<5s
TIER 4
Agentic Loop
Full multi-step reasoning: reads files, follows relations, iterates
8–15s

Compound score = (0.6 × BM25) + (0.25 × importance) + (0.15 × recency) × tierBoost

Session Learning — Memory That Grows

After every session, ByteRover automatically extracts durable knowledge from the conversation — patterns you used, decisions you made, preferences you expressed — and persists them as agent memories. Your agent gets smarter over time without any extra work.

Two Layers of Memory
ByteRover Memory
What ShopEase policies say. Persists forever across all sessions.
Session History
What this customer said today. Lives in the process, resets between sessions.
04 — Building the Agent

A customer support agent with ByteRover memory

Let's build a real thing. A customer support agent for ShopEase — a fictional Indian e-commerce platform. The agent will answer questions about orders, payments, shipping, and refunds using ByteRover as its memory layer and Groq (Llama 3.3 70B) for responses.

No vector database. No embeddings. The entire retrieval happens through brv query — a subprocess call that returns the right markdown context in under a second.

Agent Architecture
👤
Customer Question
"How long does a UPI refund take?"
spawn subprocess
🧠
ByteRover Memory
brv query → 5-tier retrieval → returns markdown context
context + history
Groq · Llama 3.3 70B
system prompt + retrieved context + conversation history
grounded answer
💬
Support Response
"UPI refunds take 2–3 business days…"

Project Structure

project structure
support-agent/
├── docs/
│   ├── faq.md                # ShopEase FAQ
│   ├── shipping-policy.md    # Shipping rules
│   └── refund-policy.md      # Refund rules
├── src/
│   ├── curate.js             # Feeds docs → ByteRover
│   ├── query.js              # Queries ByteRover memory
│   └── agent.js              # Main chat loop
├── .env.example
└── package.json

Step 1 — Install ByteRover

terminal
# Install ByteRover CLI (runs a local daemon)
curl -fsSL https://www.byterover.dev/install.sh | sh

# Verify it's running
brv status

# Install project dependencies
npm install

Step 2 — The Query Helper

This is the core of the integration. brv query runs as a subprocess against the local ByteRover daemon. It returns plain text — the retrieved context from your markdown knowledge tree. Twenty lines of code. No SDK. No API key.

src/query.js
import { exec } from "child_process";
import { promisify } from "util";

const execAsync = promisify(exec);

/**
 * Queries ByteRover's local context tree.
 * ByteRover's 5-tier strategy handles routing:
 *   Tier 0/1 → cache hit     ~0ms
 *   Tier 2   → BM25 direct   ~200ms  (no LLM)
 *   Tier 3/4 → LLM reasoning <15s
 */
export async function queryMemory(question) {
  try {
    const sanitized = question.replace(/"/g, '\\"');
    const { stdout, stderr } = await execAsync(
      `brv query "${sanitized}"`,
      { timeout: 30000 } // 30s covers Tier 4 agentic loop
    );

    if (stderr && !stdout) {
      console.error("[ByteRover] Error:", stderr);
      return null;
    }

    return stdout.trim();
  } catch (err) {
    console.error("[ByteRover] Query failed:", err.message);
    return null;
  }
}

Step 3 — Curate Your Docs

Run this once before starting the agent. ByteRover reads all files in docs/, reasons about the content, and organises everything into the context tree. After this, you can cat .brv/context-tree/payments/refunds/refund_timelines.md and read exactly what your agent knows.

src/curate.js
import { exec } from "child_process";
import { promisify } from "util";
import path from "path";
import { fileURLToPath } from "url";

const execAsync = promisify(exec);
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const DOCS_DIR = path.resolve(__dirname, "../docs");

async function curateDocs() {
  console.log("🔄 Curating support docs into ByteRover memory...");

  const { stdout } = await execAsync(
    // -d flag: curate entire directory at once
    `brv curate "ShopEase support docs: FAQs, shipping, refunds" -d ${DOCS_DIR}`,
    { timeout: 120000 } // 2 min — LLM reasoning over all files
  );

  console.log(stdout);
  console.log("✅ Done. Run `npm run chat` to start the agent.");
  console.log("   Tip: run `brv vc status` to see what was stored.");
}

curateDocs().catch(console.error);

Step 4 — The Main Agent

src/agent.js
import Groq from "groq-sdk";
import readline from "readline";
import { queryMemory } from "./query.js";
import "dotenv/config";

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });
const MODEL = "llama-3.3-70b-versatile";

// Session memory — what this customer said today
const conversationHistory = [];

const SYSTEM_PROMPT = `You are a helpful customer support agent for ShopEase.
Use ONLY the retrieved context to answer. Never invent policies.
If context doesn't cover the question, direct to 1800-123-4567.`;

async function chat(userMessage) {
  // Step 1 — Query ByteRover for relevant context
  process.stdout.write("\n🔍 Querying ByteRover memory...");
  const context = await queryMemory(userMessage);
  process.stdout.write(" done\n\n");

  if (!context) return "Knowledge base unavailable. Call 1800-123-4567.";

  // Step 2 — Inject context into the message
  const enrichedMessage = `RETRIEVED CONTEXT:
---
${context}
---
CUSTOMER QUESTION: ${userMessage}`;

  conversationHistory.push({ role: "user", content: enrichedMessage });

  // Step 3 — Call Groq with full conversation history
  const response = await groq.chat.completions.create({
    model: MODEL,
    messages: [
      { role: "system", content: SYSTEM_PROMPT },
      ...conversationHistory,
    ],
    temperature: 0.3, // Low — factual support, not creative
    max_tokens: 512,
  });

  const reply = response.choices[0].message.content;

  // Store reply in history (without injected context)
  conversationHistory.push({ role: "assistant", content: reply });
  return reply;
}

async function main() {
  const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
  });

  console.log("ShopEase Support Agent (ByteRover + Groq Llama 3.3 70B)");
  console.log('Type your question. Type "exit" to quit.\n');

  const ask = () => {
    rl.question("You: ", async (input) => {
      const msg = input.trim();
      if (!msg) return ask();
      if (msg === "exit") { rl.close(); return; }

      const reply = await chat(msg);
      console.log(`Agent: ${reply}\n`);
      ask();
    });
  };
  ask();
}

main();

Running It

terminal
# 1. Curate docs into ByteRover memory (once)
npm run curate

# 2. Inspect what ByteRover stored (optional but satisfying)
brv vc status
brv query "refund policy UPI"

# 3. Start the agent
npm run chat
The wow moment: After running npm run curate, open .brv/context-tree/ in your file explorer. You'll see your support docs organised into readable markdown files by domain and topic — exactly what your agent knows, readable by a human, editable with any text editor.
05 — Does It Actually Work?

92.8% accuracy. 1.6s latency. Flash model beats Pro.

ByteRover was evaluated on LongMemEval-S — an ICLR 2025 benchmark that buries 1-3 relevant sessions inside 48 distractor sessions per question, across a context tree of 23,867 documents. GPT-4o given the full conversation history scores 60.6%. ByteRover scores 92.8%.

92.8%
Overall Accuracy
LongMemEval-S
1.6s
p50 Latency
Cold (no cache)
98.7%
Knowledge Update
Tracking accuracy
96.1%
LoCoMo Overall
vs 89.6% Hindsight

The most interesting finding: Run 1 used Gemini Flash (cheapest). Run 2 used Gemini Pro (expensive). Run 1 won — 92.8% vs 92.2%. The architecture does the heavy lifting, not the model. This is the entire thesis of ByteRover: build the retrieval right and you don't need to throw expensive models at the problem.

06 — Honest Tradeoffs

When to use it. When not to.

ByteRover is not a drop-in replacement for all RAG use cases. Be honest about the fit.

✓ Use ByteRover when

Your agent needs persistent memory across sessions
You want to onboard a codebase your AI coding agent should understand
You're working with private/regulated data — local-first, no cloud dependency
You want to debug why retrieval failed — just open the markdown file
Your knowledge base has clear structure (policies, docs, conventions)

✗ Skip it when

You need pure semantic similarity search over millions of unstructured docs
One-shot queries with no repeated patterns — curation overhead not worth it
Real-time data that changes faster than you can re-curate
Multi-session synthesis is your primary use case — still improving here
07 — Wrapping Up

The model is interchangeable. The memory architecture isn't.

I've been building RAG systems for two years. The assumption I carried throughout — that embeddings and vector similarity search are the only serious approach — turned out to be wrong.

ByteRover showed me a different model: curate knowledge once, query it intelligently forever. Files you can read. Structure you can understand. Memory that persists across sessions. A retrieval system that escalates through tiers rather than blindly computing similarity over everything.

The support agent we built in this post has three files of actual code. The memory system is handled by ByteRover. The response generation is handled by Groq. The only thing you write is the glue — and the glue is twenty lines.

The bottom line: When your agent makes a mistake, you shouldn't have to guess why. You should be able to open a markdown file and read what it knew. ByteRover makes that possible.

Try it yourself. Install ByteRover, point it at any folder of docs, run brv query, and see what comes back. The context tree it builds will tell you everything.

get started
# Install ByteRover
curl -fsSL https://www.byterover.dev/install.sh | sh

# Clone this article's code
git clone https://github.com/altafshaikh/byterover-support-agent

# Curate your own docs
brv curate -d ./your-docs/