EverMemory is the memory backbone of the Ever project suite — a set of systems for building persistent-memory D&D NPCs. It provides episodic memory retrieval with emotional salience weighting and temporal awareness. See also: EverTavern (the multi-agent NPC system) and EverTraining (fine-tuning on fantasy dialog).

Motivation

A believable NPC must recall memories appropriate to their current point in the narrative. Emotionally significant events — a betrayal, a rescue, a declaration of love — should be recalled more vividly and more readily than routine interactions. Base LLMs have no native long-term memory: without retrieval augmentation, an NPC forgets everything between context windows.

Four Approaches Compared

We benchmark four retrieval strategies using A Princess of Mars (Edgar Rice Burroughs) as a test corpus — 638 narrative events spanning 28 chapters, condensed into 34 episodes.

Approach	Retrieval Method	Temporal Awareness
Baseline	No retrieval; raw LLM knowledge	None — always “knows” the full story
Static RAG	~300-token chunks, top-k cosine similarity	None — retrieves by relevance only
GraphRAG	Entity/relationship extraction, NetworkX graph, Louvain community detection, kNN on embeddings	None — graph is a snapshot
Episodic Memory	Scene-bounded episodes with salience weighting, temporal filtering, and cognitive appraisals	Yes — filters by sequence number

Episode Construction

Raw narrative events are segmented into episodes using boundary detection:

Scene transition patterns (regex): phrases like “you arrive”, “the next morning”, “hours later”
Time gaps: >10 minutes between events triggers a new episode
Participant shifts: less than 30% entity overlap with recent events signals a scene change
Size cap: maximum 25 events per episode

Once a boundary is detected, two LLM calls extract structured metadata:

Call 1 — Episode metadata (GPT-4o, JSON output):

Title, gist (1-2 sentences), detail (2-4 sentences, first-person)
Location, participants (hashed to entity IDs via spaCy NER)
Arousal (0-1), valence (-1 to +1), emotional tags, themes

Call 2 — Cognitive appraisal (following Lazarus’s appraisal theory):

Primary appraisal: relevance (irrelevant/benign/stressful), goal congruence (-1 to +1)
Secondary appraisal: coping potential (high/moderate/low/helpless), coping strategy
Causal attribution, norm compatibility, beliefs formed
State deltas: relationship direction changes, belief evolution, knowledge gained

Both outputs are embedded with text-embedding-3-large (OpenAI) and stored in Elasticsearch.

Salience Dynamics

Each episode receives an initial salience score:

salience = 0.4 * arousal + 0.2 * |valence| + 0.2 * novelty + 0.2 * personal_relevance

Where novelty = 1 - max cosine similarity to the 5 most recent episodes, and personal relevance = 1.0 if the NPC is a participant, 0.3 otherwise. An inhibitory suppression effect (Richter-Levin & Akirav, 2003) penalizes calm episodes that follow high-arousal ones.

Over time, salience decays at a configurable rate with a floor proportional to arousal — ensuring emotionally intense memories persist longer. Each retrieval applies a rehearsal boost, incrementing salience and reinforcing the memory. Episodes that fall below a consolidation threshold lose their detailed representation, fading to gist-only recall.

Retrieval Modes

Four composable retrieval modes can be combined per query:

Entity-triggered: Elasticsearch terms query on participant entity IDs
Situation-triggered (kNN): Cosine similarity on gist embeddings, scored as cosine_sim * (0.5 + 0.5 * salience)
Emotional: Filter by emotional tags and minimum arousal threshold
Temporal: Last N episodes by sequence number within a session

Retrieved episodes pull their +/-1 adjacent neighbors (temporal contiguity, following the EM-LLM pattern from Fountas et al., ICLR 2025), and results are assembled into a token-budgeted context block (3,000 tokens). High-salience episodes (>=0.4) use their vivid detail; faded episodes use their gist.

Evaluation

Five dimensions are evaluated across three narrative time points (early, mid, late):

Dimension	What It Tests
Identity	Self-model consistency (“Who am I?“)
Relationships	Entity knowledge + relationship descriptions
Emotion	Emotional episode retrieval by tags/arousal
Temporal	Sequence ordering and chain integrity
Fidelity	Scene-specific detail recall

Key result — “Who is Dejah Thoris?” at three time points:

Early (before meeting): Episodic memory correctly responds “I have not yet encountered anyone by that name.” Baseline, Static RAG, and GraphRAG all describe the full relationship arc regardless of time point.
Mid (growing bond): Episodic memory retrieves the rescue and moonlit walk episodes.
Late (married): Episodic memory includes the full trajectory — rescue, sacrifice, union.

This temporal understanding is a core advantage. The other approaches always “know” the ending, even at the start.

Negative-knowledge test — “When did you realize the Therns were manipulating events?” (The Therns do not appear in A Princess of Mars.)

Baseline: Confidently fabricates a detailed answer about the Therns from its training data, describing their “control over the River Iss pilgrimage” and “false divinity.”
Episodic Memory: Correctly responds “I have no knowledge of the Therns” at all three time points.

The full probe results across all four retrieval approaches and six probe questions are available in the probe report.

References

Pink et al. (2025) — Properties of episodic memory desirable for AI agents
Fountas et al. (ICLR 2025) — EM-LLM: surprise-based episode boundaries, temporal contiguity retrieval
McGaugh (2004) — Emotional arousal strengthens memory consolidation
Richter-Levin & Akirav (2003) — Emotional tagging and inhibitory phase hypothesis
Lazarus (1991) — Cognitive appraisal theory (primary/secondary appraisal)
Scherer (2009) — Agency and norm-compatibility in emotion appraisal