Building Reliable AI Agent Memory: RAG, Vector Stores, and Persistent Context in 2026
How to build AI agents that actually remember — from working memory and RAG pipelines to vector stores, knowledge graphs, and persistent context management.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
TL;DR: Most AI agents are amnesiac by default. Every session starts cold, every tool call forgets what happened two turns ago, and the "personalization" users expect never materializes. The fix is not just plugging in a vector store and calling it RAG. It requires a coherent memory architecture — working memory inside the context window, episodic memory in a retrievable store, semantic memory as structured knowledge, and procedural memory baked into your prompts. This post is the engineering-level guide to building agents that actually remember: what to store, where to store it, how to retrieve it without polluting the context window, and when a stateless agent is the right answer anyway.
Here is what a production agent failure actually looks like.
A user is working with a research agent. Day one: they spend 45 minutes telling the agent their preferences — they want primary sources, they prefer academic papers over news, they work in biotech, and they care about CRISPR specifically. The agent delivers excellent results. Day two: they come back. The agent asks: "How can I help you today?" Fresh start. Forty-five minutes of context, gone.
Or consider a customer support agent. It gathers the user's account details, walks through three troubleshooting steps, the user drops off, comes back two hours later. The agent starts the entire troubleshooting flow again. The user abandons the product.
These are not edge cases. They are the default behavior of every agent built without deliberate memory design. And they are why AI agent products struggle to retain users even when the underlying model is excellent.
The problem has three layers:
The context window problem. Models do not have memory. They have a context window — a fixed block of tokens they can attend to in a single forward pass. Once a session ends, that window is gone. Even within a session, very long agent traces can overflow the window and cause silent truncation that degrades behavior.
The retrieval problem. Even if you persist everything to a database, you cannot inject all of it back into the context window — that defeats the purpose. You need intelligent retrieval: knowing what memories are relevant to the current task and pulling only those. Getting this wrong means either missing critical context (recall failure) or drowning the model in irrelevant history (precision failure).
The consistency problem. When agents operate over long time horizons and collect heterogeneous information, memories conflict. A user told you their company has 50 employees in January. They told you it has 120 in March. Which is true? How do you resolve it? Most implementations do not have an answer.
This post solves all three. Let's start with the architecture.
Before writing a single line of code, you need a mental model for the kinds of memory an agent can use. Cognitive science gives us a useful taxonomy that maps almost perfectly to engineering primitives.
block-beta
columns 3
block:AGENT["Agent Memory Architecture"]:3
WM["Working Memory\n─────────────\nContext window\nActive tool calls\nCurrent task state\n\nStorage: In-context\nLife: Single session"]
EM["Episodic Memory\n─────────────\nPast conversations\nAction history\nUser interactions\n\nStorage: Vector DB\nLife: Persistent"]
SM["Semantic Memory\n─────────────\nFacts about world\nUser preferences\nDomain knowledge\n\nStorage: KG / DB\nLife: Persistent"]
end
block:PROC["Procedural Memory"]:3
PM["Procedural Memory\n─────────────\nHow to do tasks\nTool usage patterns\nWorkflow templates\n\nStorage: System prompts / Fine-tuning\nLife: Model weights / Config"]
endWorking Memory is what the model attends to right now: the current context window. This is the only memory the model has natively. Everything else is external infrastructure that you selectively inject into this window.
Episodic Memory is the record of what happened: past conversations, actions taken, outcomes observed. Think of it as a diary. It answers "what did we talk about in our last session?" and "what did this agent try last time it hit this error?"
Semantic Memory is structured knowledge about the world or the user: facts, preferences, entities, relationships. It answers "what do I know about this user?" and "what is the capital of France?" This is typically the target of RAG pipelines.
Procedural Memory is knowledge about how to do things: which tools to call in what order, how to handle specific error conditions, workflow templates. This typically lives in system prompts or, for specialized agents, in fine-tuned weights.
Each type requires different storage backends, different retrieval strategies, and different update policies. Most production agent failures happen because engineers conflate these types and apply a single solution (usually a vector store) to all four.
In early 2025, a 200K token context window was considered large. In 2026, Claude 3.7 ships with a 1M token window, Gemini 2.0 Ultra supports 2M, and several open models have extended to 500K via ring attention. The question "does the whole conversation fit in context?" is increasingly answered yes.
But "fits in context" and "agent performs well in context" are different things. Research consistently shows that models struggle with retrieval from very long contexts — the so-called "lost in the middle" problem. Information at the start and end of a context window is recalled reliably; information buried in the middle degrades. A 1M token window does not give you 1M tokens of reliable working memory.
This has practical implications for how you structure your context window.
Think of the context window as a budget with four line items:
System prompt: ~2,000 tokens (instructions, persona, tool definitions)
Conversation history: ~10,000 tokens (recent turns, compressed older turns)
Retrieved context: ~5,000 tokens (RAG results, injected memories)
Task workspace: ~3,000 tokens (current task state, tool results)
Reserve: ~2,000 tokens (model output space)
──────────────────────────────────────────────────────
Total used: ~22,000 tokens
This is a conservative budget. For short-lived tasks, you can push further. For agents that run for hours and accumulate tool results, staying disciplined about this budget is the difference between consistent behavior and degrading behavior.
flowchart TD
A([New message arrives]) --> B{Estimate new\ncontext size}
B --> C{Within\nbudget?}
C -- Yes --> D[Inject raw history\ninto context]
C -- No --> E{How far over\nbudget?}
E -- Slightly over --> F[Sliding window:\ndrop oldest N turns]
E -- Significantly over --> G[Summarize\nolder turns]
E -- Way over --> H[Full compression:\nsummarize + extract\nentities to KV store]
F --> I[Assemble context\nand run model]
G --> I
H --> I
D --> I
I --> J{Agent done?}
J -- No --> A
J -- Yes --> K[Persist episodic\nmemory to store]Sliding window: Keep only the last N turns in context. Simple, predictable, but loses information about older turns entirely. Good for tasks where recency dominates (customer support, live coding sessions).
Summary compression: Periodically summarize older conversation history into a compact representation, inject the summary instead of the raw turns. The model retains semantic content but loses verbatim history. Good for long research sessions.
Entity extraction: Rather than summarizing, extract structured facts from older turns (user preferences, decisions made, entities mentioned) and store them in a key-value store. Inject only the relevant facts for the current turn. This is the highest-fidelity approach but requires the most infrastructure.
Here is a practical implementation of sliding window with summary fallback:
from anthropic import Anthropic
from dataclasses import dataclass, field
from typing import Optional
import tiktoken
client = Anthropic()
@dataclass
class ConversationMemory:
messages: list[dict] = field(default_factory=list)
summary: Optional[str] = None
max_tokens: int = 8000
summary_threshold: int = 6000
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._maybe_compress()
def _token_count(self) -> int:
# Approximate — use tiktoken for production
return sum(len(m["content"].split()) * 1.3 for m in self.messages)
def _maybe_compress(self):
if self._token_count() > self.summary_threshold:
self._compress_history()
def _compress_history(self):
# Summarize the oldest 50% of messages
split = len(self.messages) // 2
old_messages = self.messages[:split]
self.messages = self.messages[split:]
summary_prompt = (
"Summarize this conversation history concisely, preserving all "
"key facts, decisions, and user preferences:\n\n"
+ "\n".join(f"{m['role']}: {m['content']}" for m in old_messages)
)
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=500,
messages=[{"role": "user", "content": summary_prompt}]
)
new_summary = response.content[0].text
self.summary = (
f"{self.summary}\n\n{new_summary}" if self.summary else new_summary
)
def build_context(self) -> list[dict]:
"""Build messages list for API call, injecting summary if present."""
if not self.summary:
return self.messages
summary_message = {
"role": "user",
"content": f"[Previous conversation summary: {self.summary}]"
}
summary_ack = {
"role": "assistant",
"content": "Understood. I have context from our previous discussion."
}
return [summary_message, summary_ack] + self.messages
memory = ConversationMemory()
Episodic memory answers temporal questions about the agent's history. "When did we last discuss pricing?" "What did the agent try when the API call failed yesterday?" "What did this user ask about six sessions ago?"
The key insight is that episodic memory is inherently time-ordered and needs to be retrievable by both recency and semantic similarity. This is different from pure semantic memory, which is a-temporal.
A practical episodic memory store has three components:
import json
import uuid
from datetime import datetime, timezone
from dataclasses import dataclass, asdict
import numpy as np
from anthropic import Anthropic
# Assume a vector store client (Qdrant, Pinecone, etc.) is initialized
# from your_vector_store import vector_client
@dataclass
class EpisodicEvent:
id: str
session_id: str
timestamp: str
role: str # "user", "assistant", "tool"
content: str
tool_name: str | None = None
tool_result: str | None = None
metadata: dict | None = None
@classmethod
def create(cls, session_id: str, role: str, content: str, **kwargs):
return cls(
id=str(uuid.uuid4()),
session_id=session_id,
timestamp=datetime.now(timezone.utc).isoformat(),
role=role,
content=content,
**kwargs
)
class EpisodicMemoryStore:
def __init__(self, vector_client, embed_fn, collection_name: str = "episodes"):
self.vector_client = vector_client
self.embed_fn = embed_fn
self.collection_name = collection_name
def store(self, event: EpisodicEvent) -> None:
embedding = self.embed_fn(event.content)
self.vector_client.upsert(
collection_name=self.collection_name,
points=[{
"id": event.id,
"vector": embedding,
"payload": asdict(event)
}]
)
def retrieve_similar(self, query: str, top_k: int = 5) -> list[EpisodicEvent]:
query_embedding = self.embed_fn(query)
results = self.vector_client.search(
collection_name=self.collection_name,
query_vector=query_embedding,
limit=top_k
)
return [EpisodicEvent(**r.payload) for r in results]
def retrieve_recent(self, session_id: str, n: int = 10) -> list[EpisodicEvent]:
# Filter by session_id and sort by timestamp descending
results = self.vector_client.scroll(
collection_name=self.collection_name,
scroll_filter={"must": [{"key": "session_id", "match": {"value": session_id}}]},
limit=n,
order_by={"key": "timestamp", "direction": "desc"}
)
return [EpisodicEvent(**r.payload) for r in results[0]]
def format_for_context(self, events: list[EpisodicEvent]) -> str:
lines = ["[Relevant past interactions:]"]
for e in sorted(events, key=lambda x: x.timestamp):
ts = e.timestamp[:10]
lines.append(f" [{ts}] {e.role}: {e.content[:200]}")
return "\n".join(lines)
The critical design decision here is when to inject episodic memory into context. Injecting it on every turn is wasteful. The right pattern is conditional retrieval: retrieve only when the current query seems to benefit from historical context.
Semantic memory is the agent's knowledge base — structured facts about the world, the user, the domain. Unlike episodic memory (what happened), semantic memory represents what is true. It has no inherent time-ordering.
For agents, semantic memory typically includes:
Semantic memory is what most people mean when they say "we added a knowledge base." The RAG pipeline for semantic retrieval is covered in the next section. What matters here is the structure of what you're storing.
For user-specific semantic facts, a simple structured schema beats unstructured text for retrieval accuracy:
from pydantic import BaseModel
from typing import Literal
class UserFact(BaseModel):
user_id: str
fact_type: Literal["preference", "profile", "context", "constraint"]
key: str
value: str
confidence: float # 0.0–1.0, decreases with age
source: str # "explicit" (user stated) or "inferred"
created_at: str
expires_at: str | None = None
# Examples:
UserFact(
user_id="usr_123",
fact_type="preference",
key="response_style",
value="concise_technical",
confidence=0.9,
source="explicit",
created_at="2026-03-10T12:00:00Z"
)
UserFact(
user_id="usr_123",
fact_type="profile",
key="company_size",
value="120",
confidence=1.0,
source="explicit",
created_at="2026-03-12T09:30:00Z"
)
For domain knowledge (documents, policies, product specs), you need a proper RAG pipeline.
RAG (Retrieval-Augmented Generation) is well understood for static Q&A: embed a document corpus, embed a user question, retrieve the top-k most similar chunks, inject them into the prompt, generate an answer.
Agent RAG is fundamentally different. Agents are action-oriented, not question-answering. They generate multi-step plans, call tools, observe results, and iterate. This changes what you need from your retrieval pipeline.
flowchart LR
subgraph INPUT["Input Processing"]
A([Agent receives task]) --> B[Query decomposition\nExtract retrieval intents]
B --> C{What types of\nmemory needed?}
end
subgraph RETRIEVAL["Multi-Source Retrieval"]
C -- "Domain knowledge" --> D[(Vector Store\nDocument chunks)]
C -- "Past episodes" --> E[(Episodic Store\nConversation history)]
C -- "User profile" --> F[(KV Store\nStructured facts)]
C -- "Procedures" --> G[(Prompt templates\nTool definitions)]
end
subgraph FUSION["Context Fusion"]
D --> H[Reranker\nMMR / Cross-encoder]
E --> H
F --> I[Structured injection\nNo reranking needed]
G --> J[System prompt\nmerge]
H --> K[Fused context\nToken budget check]
I --> K
J --> K
end
subgraph EXECUTION["Agent Execution"]
K --> L[Model call\nwith full context]
L --> M{Action needed?}
M -- "Tool call" --> N[Execute tool]
N --> O[Observe result]
O --> P[Update working memory]
P --> L
M -- "Final answer" --> Q([Store to episodic\nmemory])
endKey differences between Q&A RAG and agent RAG:
Multi-hop retrieval. Agents often need to retrieve information across multiple steps of a task. The query at step 1 ("find all customers in healthcare") generates tool results that inform the query at step 2 ("find their contract renewal dates"). Your retrieval pipeline must support these chained retrievals, not just a single up-front lookup.
Tool-aware retrieval. Agents have tools. Sometimes the right "retrieval" is calling a live API, not searching a vector store. Your memory architecture needs to decide: is this information best served from a cached vector store, from a structured database query, or from a live tool call? Freshness requirements drive this.
Observation injection. After every tool call, the agent observes a result. This observation may itself need to be stored and retrievable in future turns. A pure RAG system does not account for this — you need episodic memory integration at the tool call level.
Plan-level context. A multi-step agent plan has context that spans multiple tool calls. "We are on step 3 of a 7-step plan to onboard a new customer" is working memory context that needs to persist across tool calls but not necessarily across sessions.
Choosing a vector store is the first practical decision most teams make. Here is an opinionated assessment based on building production agent systems.
Pinecone is the managed vector database that most teams reach for first, and for good reason: zero infrastructure overhead, a clean API, and solid performance at scale. The serverless tier is genuinely useful for prototyping.
The limitations matter for agent use cases. Pinecone is metadata-limited: you can filter on metadata fields, but complex relational queries ("find all episodes where tool_name is 'search' and timestamp > last week and user is from the enterprise tier") require careful metadata schema design. It does not support hybrid search (BM25 + vector) natively without workarounds.
Use Pinecone when: you want a managed service, your retrieval is primarily semantic similarity, and you can tolerate vendor lock-in.
Weaviate is the most feature-complete open-source option. It supports hybrid search (BM25 + vector in a single query), multi-tenancy (critical for building agent products with per-user memory), and has a rich schema system for structured metadata.
The multi-tenancy support is particularly important for agent systems: you can keep each user's memories isolated in their own tenant while maintaining a shared collection for global knowledge, with a single query that retrieves from both.
import weaviate
client = weaviate.connect_to_cloud(
cluster_url="YOUR_WEAVIATE_URL",
auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY")
)
# Hybrid search combining BM25 keyword and vector similarity
episodes = client.collections.get("EpisodicMemory")
results = episodes.query.hybrid(
query="pricing discussion enterprise",
alpha=0.5, # 0=pure BM25, 1=pure vector, 0.5=balanced
limit=5,
filters=weaviate.Filter.by_property("user_id").equal("usr_123")
)
Use Weaviate when: you need hybrid search, multi-tenancy, or a rich schema system.
Qdrant has become the performance benchmark in the open-source vector store space. It is written in Rust, supports payload filtering with a powerful filter language, and has excellent support for sparse vectors (enabling hybrid search via SPLADE or BM25 alongside dense vectors).
For agent systems, Qdrant's named vectors feature is particularly useful: you can store multiple embeddings per point (e.g., an embedding of the raw text plus an embedding of an extracted summary) and query against either.
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, SearchRequest
client = QdrantClient(url="http://localhost:6333")
# Search with a filter — only episodes from the last 7 days
results = client.search(
collection_name="episodes",
query_vector=query_embedding,
query_filter=Filter(
must=[
FieldCondition(key="user_id", match=MatchValue(value="usr_123")),
FieldCondition(key="role", match=MatchValue(value="user"))
]
),
limit=5
)
Use Qdrant when: you need high throughput, advanced filtering, or prefer self-hosted with excellent performance.
ChromaDB is the developer-experience champion. Embedded mode (runs in-process, no server) makes it ideal for local development and testing. The Python API is minimal and intuitive.
For production agent systems, ChromaDB's limitations surface quickly: no multi-tenancy, limited metadata filtering, and the persistent mode is not designed for high-concurrency writes. It is excellent for single-user local agents, prototypes, and evaluation pipelines.
Use ChromaDB when: you are building locally, prototyping, or building a single-user agent.
pgvector is the PostgreSQL extension for vector similarity search. Its killer advantage is colocating vector search with your existing relational data. If you are already storing user profiles, conversation sessions, and application state in Postgres, pgvector means zero additional infrastructure — and you get full SQL for complex queries.
-- Find top-5 most similar episodes for a user, with full SQL filter
SELECT id, content, timestamp, cosine_distance(embedding, $1) AS distance
FROM episodic_memory
WHERE user_id = 'usr_123'
AND timestamp > NOW() - INTERVAL '30 days'
ORDER BY embedding <=> $1 -- cosine distance operator
LIMIT 5;
Performance degrades on very large collections (10M+ vectors) without careful index tuning (use ivfflat or hnsw indexes). For most agent applications with per-user stores under 1M vectors, pgvector is more than fast enough.
Use pgvector when: you are on Postgres already, you want relational + vector in one query, or you want to minimize infrastructure complexity.
Three patterns cover 95% of production agent memory needs. They are not mutually exclusive — most robust systems combine all three.
Keep the last N turns verbatim. Simple, predictable, zero infrastructure beyond your existing message store.
def get_sliding_window_context(messages: list[dict], max_turns: int = 20) -> list[dict]:
"""Return the last max_turns messages, preserving role alternation."""
if len(messages) <= max_turns:
return messages
# Always keep the first message (usually sets context)
return [messages[0]] + messages[-(max_turns - 1):]
When to use it: short-lived task-focused agents, customer support, coding assistants where recency is everything and history beyond ~10 turns is rarely relevant.
Maintain a rolling summary of conversation history, updated as new turns are added. Inject the summary as a "system context" block at the start of each request.
This is what most "long-term memory" features in consumer products actually do. Claude's built-in memory feature (which we cover in the Anthropic section below) uses a variant of this pattern.
The key engineering challenge is summary freshness: the summary must be updated promptly so it reflects the most recent meaningful turns, but running a summarization call on every turn is wasteful. A trigger-based approach works well:
class ProgressiveSummaryMemory:
def __init__(self, summarize_every_n_turns: int = 10):
self.raw_recent: list[dict] = []
self.summary: str = ""
self.unsummarized_count: int = 0
self.summarize_every_n = summarize_every_n_turns
def add(self, role: str, content: str):
self.raw_recent.append({"role": role, "content": content})
self.unsummarized_count += 1
if self.unsummarized_count >= self.summarize_every_n:
self._update_summary()
def _update_summary(self):
# Summarize the oldest half of raw_recent into self.summary
split = len(self.raw_recent) // 2
to_summarize = self.raw_recent[:split]
self.raw_recent = self.raw_recent[split:]
self.unsummarized_count = 0
turns_text = "\n".join(
f"{m['role'].upper()}: {m['content']}" for m in to_summarize
)
prompt = (
f"Existing summary:\n{self.summary}\n\n"
f"New conversation turns to integrate:\n{turns_text}\n\n"
"Update the summary to include the new turns. Preserve all key facts, "
"decisions, preferences, and unresolved questions. Be concise."
)
# Run summarization (use a fast/cheap model for this)
# self.summary = call_model(prompt)
def get_context_prefix(self) -> str:
if not self.summary:
return ""
return f"[Conversation context: {self.summary}]"
The most structured approach: after each turn, extract entities and facts from the conversation and store them in a structured key-value or graph store. At retrieval time, pull the relevant entities for the current query.
This is the most powerful pattern for long-running agents with complex state. It is also the most expensive (requires an LLM call for extraction) and the most complex to implement correctly (entity resolution, conflict handling).
def extract_entities_from_turn(turn_content: str, client: Anthropic) -> list[dict]:
"""Extract structured entities/facts from a conversation turn."""
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=500,
system=(
"Extract entities and facts from this text as JSON. "
"Return a JSON array of objects with keys: "
"'entity_type' (person/company/preference/fact/decision), "
"'key', 'value', 'confidence' (0-1). "
"Return only the JSON array, no explanation."
),
messages=[{"role": "user", "content": turn_content}]
)
try:
return json.loads(response.content[0].text)
except json.JSONDecodeError:
return []
For agents that operate over months and need to understand complex entity relationships, a knowledge graph outperforms a flat vector store for certain queries.
Consider a sales agent. It accumulates facts over time: "Alice is the VP of Engineering at Acme Corp." "Acme Corp is a customer since January 2024." "Alice has been interested in the enterprise tier." "Acme Corp uses AWS." "Alice mentioned her team is 40 engineers."
In a flat vector store, these are five independent chunks. In a knowledge graph, they are connected nodes: Alice → (employee of) → Acme Corp → (uses) → AWS. This connectivity enables traversal queries that vector similarity cannot handle: "What do we know about all contacts at companies that use AWS and have 30+ engineers?"
graph TD
A[Alice Chen] -- "VP Engineering at" --> B[Acme Corp]
A -- "Interested in" --> C[Enterprise Tier]
A -- "Team size" --> D[40 engineers]
B -- "Customer since" --> E[Jan 2024]
B -- "Uses" --> F[AWS]
B -- "Industry" --> G[Fintech]
H[Bob Smith] -- "Account Exec for" --> B
H -- "Last contact" --> I[2026-03-10]
C -- "Price point" --> J[$50k/yr]For most teams, building a full knowledge graph is over-engineering. The pragmatic approach is a hybrid: use a relational database for structured entity facts (user profiles, company data, explicit preferences) and a vector store for unstructured episodic memories. The knowledge graph becomes relevant when you need multi-hop traversal and your entity graph has tens of thousands of nodes.
If you do need a knowledge graph, LlamaIndex's property graph support gives you a practical path without building from scratch. For heavier production use, Neo4j remains the standard.
This is the problem almost no one has solved well. When memories conflict, what do you do?
The naive answer is "last write wins." This is usually wrong. A user who said "my company has 50 employees" in January and "we just raised our Series B and hired to 180 people" in March is not giving you conflicting data — they are giving you an update. The memory should reflect the current state.
But consider: "I prefer detailed explanations" (stated explicitly) versus the observation that the user consistently skips to the bottom of long responses (inferred from behavior). These conflict. Which is true? Probably both, in different contexts.
flowchart TD
A([New fact arrives:\nuser_id, key, value]) --> B[(Look up existing\nfacts with same key)]
B --> C{Existing fact\nfound?}
C -- No --> D[Store new fact\nconfidence=source_confidence]
C -- Yes --> E{Same value?}
E -- Yes --> F[Update timestamp\nIncrease confidence]
E -- No --> G{What type\nof conflict?}
G -- "Temporal update\ne.g. team size changed" --> H[Replace old fact\nKeep history in log]
G -- "Contradictory\ne.g. explicit vs inferred" --> I{Which is more\ntrusted source?}
I -- "Explicit > inferred" --> J[Explicit wins\nFlag inferred for review]
I -- "Both explicit" --> K[Store both with\nconflict flag\nResolve at query time]
J --> L[(Updated memory store)]
H --> L
K --> L
D --> L
F --> LA practical conflict resolution policy:
Source hierarchy: Explicit user statements outrank inferred behavior. Recent statements outrank older ones (with a configurable staleness threshold). High-confidence observations outrank low-confidence ones.
Temporal updates: For facts with an inherent temporal nature (company size, job title, location), always update rather than conflict. Keep the old value in a history log for audit purposes.
Coexisting contradictions: For genuinely contradictory preferences ("user wants concise responses" vs. "user always asks follow-up questions wanting more detail"), store both with context tags and resolve at query time based on the current situation.
Explicit resolution: For high-stakes facts (billing information, user permissions), do not resolve automatically — surface the conflict to the user for explicit resolution.
Anthropic announced a consumer memory feature for Claude in early 2026 that is worth understanding from an architectural standpoint, both as a product pattern and as a signal about where the industry is going.
The feature, as described in their announcement, stores a compressed representation of user preferences and facts across sessions, injected into the system prompt at the start of each conversation. The key architectural choices Anthropic made:
For builders, this matters because it validates the market for agent memory as a product-level feature, and because users will increasingly expect agents to remember them. The baseline expectation is shifting.
Session state is the bridge between working memory and episodic memory. It is the mutable context that persists across turns within a session but does not need to be retrieved — it is always available.
For multi-step agents, session state typically includes:
from enum import Enum
from pydantic import BaseModel, Field
class TaskStatus(Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
BLOCKED = "blocked"
COMPLETED = "completed"
FAILED = "failed"
class AgentStep(BaseModel):
step_id: str
description: str
status: TaskStatus = TaskStatus.PENDING
tool_calls: list[dict] = Field(default_factory=list)
result: str | None = None
error: str | None = None
retry_count: int = 0
class AgentSessionState(BaseModel):
session_id: str
user_id: str
task_description: str
plan: list[AgentStep] = Field(default_factory=list)
current_step_index: int = 0
accumulated_context: dict = Field(default_factory=dict) # key facts from tool results
user_decisions: list[dict] = Field(default_factory=list)
created_at: str
last_updated: str
@property
def current_step(self) -> AgentStep | None:
if self.current_step_index < len(self.plan):
return self.plan[self.current_step_index]
return None
def advance_step(self, result: str):
if self.current_step:
self.current_step.status = TaskStatus.COMPLETED
self.current_step.result = result
self.current_step_index += 1
def to_context_string(self) -> str:
completed = [s for s in self.plan if s.status == TaskStatus.COMPLETED]
current = self.current_step
remaining = self.plan[self.current_step_index + 1:]
lines = [f"Task: {self.task_description}"]
if completed:
lines.append(f"Completed steps ({len(completed)}):")
for s in completed[-3:]: # Show last 3 completed
lines.append(f" ✓ {s.description}: {(s.result or '')[:100]}")
if current:
lines.append(f"Current step: {current.description}")
if remaining:
lines.append(f"Remaining: {len(remaining)} steps")
if self.accumulated_context:
lines.append("Key context:")
for k, v in list(self.accumulated_context.items())[:5]:
lines.append(f" {k}: {v}")
return "\n".join(lines)
The session state is serialized to your persistence layer (Redis for active sessions, Postgres or S3 for archive) and deserialized at the start of each turn. Unlike episodic memory, it is not retrieved by similarity — it is always injected verbatim because it is always relevant.
This is the section most agent memory posts skip. Stateless agents are often the right answer, and persistent memory is frequently cargo-culted without justification.
You probably do NOT need persistent memory if:
Your agent tasks are self-contained. A code review agent, a data analysis agent, a document summarization agent — these are atomic tasks. The user provides the full context with each request. There is nothing to remember across sessions.
Your context is already explicit. If the user includes all relevant context in their request ("Here is our API schema. Here is the error. Fix it."), then there is nothing to retrieve. Injecting stale memories from previous sessions can actually degrade performance by polluting the context with irrelevant old information.
Your users do not return. If retention is low (say, many one-time users), building a memory system is infrastructure investment with no return. Measure retention before building memory.
Your trust surface is high-risk. Memory systems that automatically inject facts from past sessions can be exploited. If an attacker can poison your memory store (via prompt injection in previous turns, for example), they can influence all future behavior. For high-stakes agents (financial operations, access control decisions), stateless is safer.
Your task is latency-sensitive. Retrieval adds latency. If your agent needs to respond in under 500ms, the retrieval roundtrip to a vector store may be inadmissible. Profile before adding memory.
The right question to ask is not "should our agent have memory?" but "what specific user problem does memory solve?" If you cannot point to a concrete friction in user journeys that memory removes, do not build it.
Putting this all together, here is the decision logic for every agent turn:
flowchart TD
A([New agent turn\nuser message received]) --> B[Parse intent\nClassify query type]
B --> C{Is this a\nnew session?}
C -- Yes --> D[Load user semantic profile\nfrom KV store]
C -- No --> E[Load active session state]
D --> F{Does query reference\npast events?}
E --> F
F -- "Yes (temporal,\nreference to 'last time',\nexplicit history)" --> G[Retrieve episodic memory\nTop-K by similarity + recency]
F -- "No" --> H[Skip episodic retrieval]
G --> I{Does query need\ndomain knowledge?}
H --> I
I -- Yes --> J[RAG retrieval from\ndocument/knowledge store]
I -- No --> K[Skip RAG]
J --> L[Rerank & deduplicate\nApply token budget]
K --> L
L --> M[Assemble final context:\n1. System prompt\n2. User profile\n3. Session state\n4. Retrieved memories\n5. Recent history\n6. Current message]
M --> N[Model call]
N --> O[Store turn to\nepisodic memory]
O --> P{New facts\nextracted?}
P -- Yes --> Q[Update semantic\nmemory / KV store]
P -- No --> R([Done])
Q --> RThis decision tree is not theoretical — it is the exact conditional logic you need to implement in the "memory manager" component that sits between your user-facing API and your model calls.
How much does vector storage actually cost at scale?
For agent memory, the volumes are typically small. A user with 100 sessions, 10 turns per session, generates roughly 1,000 episodic events. At 1,536 dimensions (OpenAI's text-embedding-3-small), that's 1,000 vectors × 6KB each = 6MB per user. For 10,000 users, that's 60GB of vector data — well within the free tiers of managed services. Memory is not a cost problem until you have very large user bases or very long conversations.
Should I embed the whole conversation turn or chunk it?
For episodic memory (individual turns), embed the full turn — they are already small. For semantic/document memory, chunk into 200-400 token segments with 20% overlap, embed each chunk, and store the full document reference in metadata. At query time, retrieve chunks and reconstruct context.
How do I handle memory for multi-user shared agents (e.g., team agents)?
Use separate namespaces or tenants per user, plus a shared global namespace for team-level knowledge. At retrieval time, merge results from both namespaces and rerank. The user namespace gets a higher recency boost; the team namespace gets a higher domain relevance boost.
Can I use the same vector store for both episodic and semantic memory?
Yes, but use different collections or namespaces. The retrieval logic, update frequency, and data schemas are different enough that mixing them in the same collection creates confusion and degrades retrieval quality.
What embedding model should I use?
For general-purpose agent memory in 2026, OpenAI's text-embedding-3-small (1,536 dims, fast, cheap) is the practical default. For multilingual or domain-specific applications, consider text-embedding-3-large or a fine-tuned model. Do not overthink this early — embedding model quality matters less than retrieval architecture quality.
How do I evaluate whether my memory system is actually helping?
Measure: (1) context hit rate — how often does a retrieved memory actually influence the agent's response? (2) recall precision — when users reference past events, does the agent respond correctly? (3) context pollution rate — how often does injected memory cause confusion or incorrect behavior? A/B test memory vs. no-memory on these metrics before assuming memory is beneficial.
What about the MCP protocol and external memory tools?
Anthropic's Model Context Protocol allows agents to access external memory stores as tools rather than injecting everything into context. This is an increasingly important pattern for long-running agents: instead of preemptively retrieving memories, the agent can call a search_memory tool when it determines retrieval is needed. This reduces context pollution but adds latency and requires the agent to be good at knowing when to search.
My agent is replacing a SaaS workflow — how much of the existing app's data should become agent memory?
Start narrow. Map the specific decisions the agent needs to make, then work backwards to what data those decisions require. Ingesting all historical data because "maybe it'll be useful" typically degrades performance. The voice of customer data on what users actually ask about is the best guide for what to index.
Memory is infrastructure, not an afterthought. The agents that retain users are the agents that feel like they know you — and that requires deliberate design at every layer: what you store, where you store it, how you retrieve it, and when you discard it.
The technical primitives are available today: vector stores are mature, embedding APIs are cheap, context windows are large enough for most use cases. What is still rare is the architectural discipline to combine them correctly. That is the real competitive edge.
If you are building an agent business, the memory architecture decisions you make in the next six months will determine whether your product feels intelligent or frustrating eighteen months from now. Start with the taxonomy, instrument your retrieval quality, and iterate on what actually helps users — not what looks impressive in a demo.
Related reading: The AI Agent Startup Opportunity — how agent products create defensible businesses. AI Agents Replacing SaaS — why agents are eating point solutions. MCP Integration for SaaS — the protocol layer connecting agents to tools.
Anthropic makes the 1M context window generally available across Claude Opus 4.6 and Sonnet 4.6 at standard pricing — no beta header, no multiplier, highest recall among frontier models.
The real reasons AI agent projects fail in production — from hallucination cascades and cost blowups to evaluation gaps and the POC-to-production chasm.
Technical guide to building voice AI agents — platform comparison, latency optimization, architecture patterns, and real cost analysis for ElevenLabs, Vapi, Retell, and native multimodal models.