How well can AI and humans work together? Scientists are turning to Dungeons & Dragons to find out

Researchers are now using Dungeons & Dragons — a game once blamed for corrupting kids — to test how artificial intelligence plans, cooperates and role-plays alongside humans. Their early results hint at how future AI teammates might behave in real-world crises, factories and even your living room.

Why dungeons & dragons is becoming a serious AI test

At the NeurIPS 2025 conference in San Diego, a team led by University of California, San Diego computer scientists unveiled a research framework called “D&D Agents.” The idea is simple: throw cutting-edge language models into Dungeons & Dragons (D&D) fights and see how they cope.

The choice of game is deliberate. D&D forces players to mix creativity with strict rules, long-term planning with snap decisions, and storytelling with tactics. Every move must be described in natural language, but it also has concrete mechanical consequences, like dice rolls and hit points.

D&D gives AI models a rare mix: a clear rule set, open-ended storytelling and the need for tight teamwork.

That blend makes it an attractive playground for testing “long-horizon” skills — the kind of thinking needed to plan several steps ahead, adapt to changing conditions and coordinate with others.

How the D&D AI experiments actually worked

The researchers didn’t ask AI to run sprawling year-long campaigns. Instead, they focused on tightly controlled fight scenes taken from the classic starter adventure “Lost Mine of Phandelver.”

Each simulation followed the same structure:

One Dungeon Master (DM), who controls the world and monsters
Four hero characters, such as fighters, wizards or clerics
Three preset combat scenarios pulled from the adventure
Characters configured at low, medium or high power levels
A fixed length of 10 turns before the encounter ended

Crucially, any role could be filled by a language model, a human, or a mix of both. In some runs, a single model played both the DM and all four heroes. In others, humans shared the table with AI companions, or an AI DM presided over a squad of human adventurers.

Because everything unfolds through dialogue, the same test bed can measure strategy, rule-following and human–AI interaction at once.

➡️ As a polar vortex disruption looms and meteorologists sound the alarm over potential nationwide travel paralysis the public is bitterly divided over whether this is responsible warning or exaggerated climate drama

➡️ A dog abandoned outside a shelter finally realizes his family is not coming back and the filmed moment delivers heartbreaking bad news for animal lovers everywhere

➡️ A mysterious European country buys Australia’s latest laser cannon for €71.4 million without revealing its name to the public

➡️ Heinous family twist: elderly mother who donated organ to save her only son is left homeless when he secretly sells her house to fund his luxury wedding, splitting the town over whether he’s a monster or just brutally honest about modern priorities

➡️ This Russian technological jewel could dive deeper than all the others – and became the biggest naval flop since the Cold War

➡️ Satellites detect titanic 35?metre waves in the middle of the pacific

➡️ Many people don’t realize it, but cauliflower, broccoli and cabbage are basically the same plant hiding in plain sight

➡️ Stylists explain why this haircut is perfect if you rarely visit the salon

To judge performance, the team tracked combat success, resource management, consistency of role-play, and how well multiple AI agents coordinated as a team.

Which AI models went adventuring?

Three major large language models were put through their paces in the D&D Agents framework:

Model	Strengths in the study	Noted weaknesses
Claude Haiku 3.5	Efficient in combat, especially in difficult fights; strong character-specific speech	Still occasionally conservative with resources in simple encounters
GPT-4	Solid overall performance; balanced narrative and tactical language	Less distinctive character voices than Claude; slightly behind in hard fights
DeepSeek-V3	Energetic first-person combat “barks” and taunts	Struggled in tougher scenarios; reused voices and weaker coordination

The focus wasn’t just “who won more fights.” The team wanted to know how these systems behave under pressure, when resources run low, and when cooperation or bold play really matters.

What combat taught researchers about AI decision-making

One major test was how the models handled limited resources. In D&D, spell slots, special abilities and healing potions are finite. Players usually ration them, saving powerful moves for the moments that really count.

Because these simulations were isolated encounters rather than full campaigns, there was almost no reason to hoard resources for later. Spending big early often meant the best outcome.

In harder fights, Claude Haiku 3.5 was more willing to burn through valuable abilities, and that aggression paid off.

Claude generally achieved the best results in challenging scenarios, trading away long-term caution for immediate survival and team success. GPT-4 took a similar approach but lagged slightly behind in efficiency. DeepSeek-V3 tended to struggle most when the difficulty ramped up.

In easier battles, differences shrank. All three models conserved spells and items at similar rates, suggesting they defaulted to a cautious style unless strongly pushed.

Acting, not just calculating: keeping characters in character

The researchers also cared about role-play itself. They introduced an “Acting Quality” metric that measured how well each model stayed true to its character while speaking, and how many distinct voices it maintained when juggling multiple roles.

DeepSeek-V3 produced lots of short, punchy lines in first person — things like “I dart left!” or “Get them!” This gave the fights an arcade feel but often reused the same tone regardless of the character.

Claude Haiku 3.5 leaned harder into persona. A holy paladin sounded formal and righteous, while a druid’s speech reflected a nature-loving outlook. GPT-4 landed between the two, mixing in-character narration with more meta comments about tactics and probabilities.

Some of the most vivid lines came from monsters, with goblins taunting heroes mid-battle: “Heh — shiny man’s gonna bleed!”

That emergent personality, especially from non-human characters, hints at how AI might shape the emotional atmosphere of future games, teaching tools or training simulations.

Why this matters beyond geek culture

As playful as it sounds, this work touches on serious questions: can AI systems coordinate over many steps, keep track of complex rules and act independently without constant human supervision?

The same skills needed to run a fictional battle map onto real-world tasks. Examples the team highlighted include:

Coordinating supply chains where multiple agents manage inventory, shipping and production
Planning manufacturing lines that must react to delays and equipment failures
Simulating disaster response, where teams coordinate rescue, medical support and logistics
Search-and-rescue operations using fleets of drones or robots, each with partial information

In all these settings, models must remember what just happened, share information, respect constraints and act in a way humans can understand. D&D’s structured chaos offers a way to benchmark that without risking real lives or money.

Human–AI teamwork at the table and beyond

Because D&D is social at its core, it also serves as a test for mixed teams of human and artificial players. An AI DM can guide human adventurers. AI-controlled party members can support human teammates, or vice versa.

This raises new design questions: should AI party members be ultra-practical, or should they sometimes make flawed, human-like choices to keep things fun? How much autonomy should an AI DM have to surprise players?

The researchers see D&D as a way to study how much independence people are comfortable giving AI collaborators.

That comfort level will matter in future workplaces. Imagine an AI logistics coordinator suggesting route changes for lorries, or a “copilot” running parts of a hospital scheduling system. Trust will depend on predictable behaviour, clear communication and a sense that the system is working with people, not around them.

Next step: full campaigns and creative pressure

So far, the framework focuses on combat. The team now wants to stress-test models with entire campaigns, where story decisions, social encounters and improvisation matter just as much as tactics.

That shift would require AI to juggle multiple plot threads, maintain continuity over many sessions, and handle unexpected player choices without breaking the fiction. It also demands more subtle social reasoning: reading intentions, negotiating, bluffing and resolving conflicts between characters.

As these experiments grow, they may reveal where current models hit their limits: perhaps keeping track of long-running story arcs, or managing multiple human players with different goals and play styles.

Key concepts worth unpacking

A few terms used in this research are already leaking into everyday AI discussions:

Long-horizon planning: making decisions that only pay off several steps later, such as using a rare spell now to stop a fight spiralling out of control.
Multi-agent systems: situations where several AI models work together, like party members in a D&D group or robot teams in a warehouse.
Tool use: AIs calling external systems — anything from dice-rolling functions to mapping software or databases — as part of solving a problem.

D&D bundles these ideas into a format people instinctively understand. That makes it useful for testing, but also for teaching people how AI thinks, where it fails, and how it might complement human judgement instead of replacing it.

Future scenarios: from fantasy taverns to real emergencies

Imagine an emergency-management exercise run like a D&D session. AI agents control virtual fire crews, medical teams and traffic systems. Human decision-makers issue high-level instructions, while the AI fills in the granular steps and chatter in real time.

The same underlying mechanics already being tested with goblins and paladins could underpin those simulations. Success would mean smoother cooperation between human leaders and AI helpers when real disasters strike.

There are risks alongside the benefits. Over-reliance on AI “party members” could make humans less practiced at strategic thinking. Poorly designed agents might coordinate too well with each other while ignoring human input. Studies like D&D Agents give researchers a safe space to spot those failure modes early.

For now, the battlefield is a fantasy cave, not a flooded city. But every time an AI goblin cackles or a digital paladin risks a precious spell for the team, researchers gain a little more insight into what shared decision-making with machines might look like in the years ahead.