EverTraining explores whether fine-tuning small models on fantasy dialog can produce better NPC responses than prompting larger models. It’s part of the Ever project suite — see also EverTavern (the multi-agent NPC system) and EverMemory (episodic memory RAG).

The LIGHT Dataset

LIGHT (Facebook AI Research, Urbanek et al. 2019) contains 17,000+ crowd-sourced fantasy dialog exchanges where human annotators role-played fantasy characters given a persona and setting. Each example includes character name, persona description, setting, and a multi-turn conversation.

The dataset captures how people naturally role-play: terse, personality-driven responses (typically 1-3 sentences) rather than the verbose, expository style of base LLMs.

A note on data sourcing: for this project, especially working in a domain with stories and personally crafted characters, it was important to consider the data used for training. Rather than extracting previously existing D&D campaigns, LIGHT was chosen as an explicitly provided dataset. Similarly, the memory evaluation uses A Princess of Mars (public domain) and D&D rules from the SRD/Open5e (explicitly free to use).

Training Setup

LIGHT dialogs are converted to OpenAI chat completion format:

System: You are {character_name} in a fantasy text adventure game.
        Your persona: {character_persona}
        Setting: {setting_name} - {setting_description}
        Stay in character and respond naturally.

User:      {partner's utterance}
Assistant: {character's response}
...

Two models are fine-tuned: GPT-4.1-nano and GPT-4.1-mini, then evaluated against their base versions and frontier GPT-4.1 across 3 NPC archetypes (grizzled blacksmith, mysterious merchant, jovial tavern keeper) x 3 prompt engineering levels (minimal, moderate, heavy) x 2 player inputs = 90 total test queries.

Key Findings

Finding 1 — Base models leak training data. With minimal prompting (just “You are Tormund.”), base models break character — referencing “Valyrian steel” and “dragonglass” from their training data. Fine-tuned models stay in role and produce terse responses.

Finding 2 — Fine-tuned responses read like game dialog. Responses like “Aye, a mighty struggle that would be. You want one forged, or found?” feel closer to an in-game NPC. Base models produce multi-paragraph exposition inappropriate for rapid back-and-forth.

Finding 3 — Heavy prompt engineering closes the gap. With detailed style instructions, base models produce comparable quality to fine-tuned models. The trade-off is token cost: every request must carry the full prompt context.

Finding 4 — Fine-tuning introduces occasional errors. The LIGHT dataset’s crowdsourced nature includes tangents and non-sequiturs. Fine-tuned models sometimes produce off-topic responses (“Let’s go hunting for a dragon!” from a blacksmith asked to forge a sword) or conflate character backstories.

Metric	Nano (FT)	Nano (base)	Mini (FT)	Mini (base)	Frontier
Avg response length	74 chars	298 chars	84 chars	319 chars	473 chars
Character consistency (minimal prompt)	High	Low	High	Low	Low
Character consistency (heavy prompt)	High	High	High	High	High

The contrast between models is stark with minimal prompting. When asked “I need a sword. Something that can kill a dragon,” the base nano model references “Valyrian steel” and “dragonglass” (Game of Thrones leakage from training data), while the fine-tuned nano model responds simply “I have a sword that can do such things!” — staying in its original fantasy world.

Interactive Viewer

The full evaluation results across all 90 test queries are available in an interactive viewer that allows filtering by NPC type, prompt level, and model.

Takeaways

When heavy prompting is feasible and token cost isn’t a concern, frontier models match or exceed fine-tuned quality without the dataset’s noise. Fine-tuning shines when you need consistent style from small, cheap models with minimal prompting — exactly the use case for high-throughput NPC dialog in a game system.