EverMemory is the memory backbone of the Ever project suite — a set of systems for building persistent-memory D&D NPCs. It provides episodic memory retrieval with emotional salience weighting and temporal awareness. See also: EverTavern (the multi-agent NPC system) and EverTraining (fine-tuning on fantasy dialog).
Motivation
A believable NPC must recall memories appropriate to their current point in the narrative. Emotionally significant events — a betrayal, a rescue, a declaration of love — should be recalled more vividly and more readily than routine interactions. Base LLMs have no native long-term memory: without retrieval augmentation, an NPC forgets everything between context windows.
Four Approaches Compared
We benchmark four retrieval strategies using A Princess of Mars (Edgar Rice Burroughs) as a test corpus — 638 narrative events spanning 28 chapters, condensed into 34 episodes.
| Approach | Retrieval Method | Temporal Awareness |
|---|---|---|
| Baseline | No retrieval; raw LLM knowledge | None — always “knows” the full story |
| Static RAG | ~300-token chunks, top-k cosine similarity | None — retrieves by relevance only |
| GraphRAG | Entity/relationship extraction, NetworkX graph, Louvain community detection, kNN on embeddings | None — graph is a snapshot |
| Episodic Memory | Scene-bounded episodes with salience weighting, temporal filtering, and cognitive appraisals | Yes — filters by sequence number |
Episode Construction
Raw narrative events are segmented into episodes using boundary detection:
- Scene transition patterns (regex): phrases like “you arrive”, “the next morning”, “hours later”
- Time gaps: >10 minutes between events triggers a new episode
- Participant shifts: less than 30% entity overlap with recent events signals a scene change
- Size cap: maximum 25 events per episode
Once a boundary is detected, two LLM calls extract structured metadata:
Call 1 — Episode metadata (GPT-4o, JSON output):
- Title, gist (1-2 sentences), detail (2-4 sentences, first-person)
- Location, participants (hashed to entity IDs via spaCy NER)
- Arousal (0-1), valence (-1 to +1), emotional tags, themes
Call 2 — Cognitive appraisal (following Lazarus’s appraisal theory):
- Primary appraisal: relevance (irrelevant/benign/stressful), goal congruence (-1 to +1)
- Secondary appraisal: coping potential (high/moderate/low/helpless), coping strategy
- Causal attribution, norm compatibility, beliefs formed
- State deltas: relationship direction changes, belief evolution, knowledge gained
Both outputs are embedded with text-embedding-3-large (OpenAI) and stored in Elasticsearch.
Salience Dynamics
Each episode receives an initial salience score:
salience = 0.4 * arousal + 0.2 * |valence| + 0.2 * novelty + 0.2 * personal_relevance Where novelty = 1 - max cosine similarity to the 5 most recent episodes, and personal relevance = 1.0 if the NPC is a participant, 0.3 otherwise. An inhibitory suppression effect (Richter-Levin & Akirav, 2003) penalizes calm episodes that follow high-arousal ones.
Over time, salience decays at a configurable rate with a floor proportional to arousal — ensuring emotionally intense memories persist longer. Each retrieval applies a rehearsal boost, incrementing salience and reinforcing the memory. Episodes that fall below a consolidation threshold lose their detailed representation, fading to gist-only recall.
Retrieval Modes
Four composable retrieval modes can be combined per query:
- Entity-triggered: Elasticsearch terms query on participant entity IDs
- Situation-triggered (kNN): Cosine similarity on gist embeddings, scored as
cosine_sim * (0.5 + 0.5 * salience) - Emotional: Filter by emotional tags and minimum arousal threshold
- Temporal: Last N episodes by sequence number within a session
Retrieved episodes pull their +/-1 adjacent neighbors (temporal contiguity, following the EM-LLM pattern from Fountas et al., ICLR 2025), and results are assembled into a token-budgeted context block (3,000 tokens). High-salience episodes (>=0.4) use their vivid detail; faded episodes use their gist.
Evaluation
Five dimensions are evaluated across three narrative time points (early, mid, late):
| Dimension | What It Tests |
|---|---|
| Identity | Self-model consistency (“Who am I?“) |
| Relationships | Entity knowledge + relationship descriptions |
| Emotion | Emotional episode retrieval by tags/arousal |
| Temporal | Sequence ordering and chain integrity |
| Fidelity | Scene-specific detail recall |
Key result — “Who is Dejah Thoris?” at three time points:
- Early (before meeting): Episodic memory correctly responds “I have not yet encountered anyone by that name.” Baseline, Static RAG, and GraphRAG all describe the full relationship arc regardless of time point.
- Mid (growing bond): Episodic memory retrieves the rescue and moonlit walk episodes.
- Late (married): Episodic memory includes the full trajectory — rescue, sacrifice, union.
This temporal understanding is a core advantage. The other approaches always “know” the ending, even at the start.
Negative-knowledge test — “When did you realize the Therns were manipulating events?” (The Therns do not appear in A Princess of Mars.)
- Baseline: Confidently fabricates a detailed answer about the Therns from its training data, describing their “control over the River Iss pilgrimage” and “false divinity.”
- Episodic Memory: Correctly responds “I have no knowledge of the Therns” at all three time points.
The full probe results across all four retrieval approaches and six probe questions are available in the probe report.
References
- Pink et al. (2025) — Properties of episodic memory desirable for AI agents
- Fountas et al. (ICLR 2025) — EM-LLM: surprise-based episode boundaries, temporal contiguity retrieval
- McGaugh (2004) — Emotional arousal strengthens memory consolidation
- Richter-Levin & Akirav (2003) — Emotional tagging and inhibitory phase hypothesis
- Lazarus (1991) — Cognitive appraisal theory (primary/secondary appraisal)
- Scherer (2009) — Agency and norm-compatibility in emotion appraisal