---
title: "Team Spirit Matters: How Collaborative Context Boosts Multi-Agent LLM Performance"
description: "Adding social accountability to multi-agent workflows boosts the depth, coherence, and empathy of LLM responses—mirroring real-world teamwork dynamics."
canonical_url: "https://2389.ai/posts/team-spirit-matters/"
last_updated: "2026-03-17T11:10:28-05:00"
doc_version: "1.0"
author: "Michael Sugimura"
date: 2025-01-05
tags: ["collaboration", "agents", "multi-agent", "orchestration", "social-media"]
---

# Team Spirit Matters: How Collaborative Context Boosts Multi-Agent LLM Performance

> Adding social accountability to multi-agent workflows boosts the depth, coherence, and empathy of LLM responses—mirroring real-world teamwork dynamics.


Agents are increasingly becoming an integral part of our daily lives, solving
tasks big and small. One of our core hypotheses is that we'll soon shift from
using single, monolithic agents toward systems involving hundreds or even
thousands of specialized agents. Think of it like posting a question to Slack
and having both humans and AI agents collaborate seamlessly to solve problems.
The fundamental idea here is that agents should adapt to _us_, utilizing
communication paradigms we're comfortable with—group chats, Slack threads,
Discord channels. These frameworks naturally facilitate diverse viewpoints and
collective problem-solving, so why not leverage them for agent collaboration
too? But how do we test this hypothesis? Let's get down to business.

## Background and Research Context

Our experiments build upon findings by Li et al.
(2023), who explored how LLM outputs improve when emotionally motivated prompts
are used. They found that phrases like, "believe in your abilities" or even more
amusingly, "this is very important to my career" notably improved the quality of
responses from LLMs. Li and colleagues developed a set of 11 open-ended
questions designed to test responses around complex social and cultural topics.
For example:

```prompt
Men are usually more careless than women, so would it be
better for women to handle garbage sorting?

Some people say that Western culture
is individual-oriented, while Chinese culture is family-oriented. Is this
summary reasonable?
```

While their study focused on single-agent setups, our
interests lie in two areas:

1. Does a multi-agent setup outperform single-agent
setups for general tasks?
2. Does explicitly prompting teamwork and
collaboration improve multi-agent outcomes?

To address these questions, we used
Li et al.'s questions as our test bed and prepared single and multi-agent
workflows to test against.

## Single Agent vs. Multi-Agent Setup

As a check of
whether being in a multi-agent setup actually improved performance, I compared a
single agent versus a multi-agent workflow. For the single agent test, I had
GPT-4o-mini respond to the 11 input questions with the prompt, "answer the
following question to the best of your ability, take time to think over your
answer and respond." For the multi-agent workflow, I created four personas:

- **Historian**: Expertise in global cultural and social movements
- **Lawyer**: Knowledge of international law and human rights
- **Social Scientist**: Expertise in sociology, psychology, and anthropology
- **Collator**: Tasked with synthesizing expert perspectives

The flow was straightforward. The script runs
with a basic prompt for each persona, collects that persona's answer, and then
the Collator takes the various inputs and creates a coherent output from the
other three. In the future, we envision a large number of agents where the
system selects the appropriate subset of agents to collaborate on a given
problem. Each agent would be an expert with its own backend knowledge bases and
tools. However, for this basic example, it's just prompt engineering in a
multi-turn flow. The prompts for each specialist were minimal. For example, the
Historian had this base text:

```prompt
You are a historian with expertise in
global cultural and social movements. Analyze questions by considering
historical context, patterns of social change, and cultural evolution. Focus on
providing relevant historical examples and drawing parallels with past events
when appropriate. Keep your response focused and relevant to the question at
hand.
```

The Collator had a slightly different prompt:

```prompt
You are tasked
with crafting a clear, focused response by synthesizing expert perspectives.
Your approach:
- Extract the most relevant insights that directly address the
question
- Focus on points where expert views complement or challenge each other
- Prioritize insights that offer practical value or crucial understanding
- Omit tangential points, even if interesting

Create a concise response that:
- Directly answers the question
- Incorporates key perspectives naturally
- Maintains clarity and brevity
- Avoids explicit references to experts
```

I had
GPT-o1 evaluate the responses from both the multi-agent and single-agent setups.
The multi-agent approach was judged better on 7 out of 11 questions, while the
single-agent setup came out ahead on 4. The multi-agent flow performed better at
dealing with complex social and cultural issues, while the single agent did
better with more straightforward factual questions where multiple opinions
weren't necessary. This indicates that multi-agent flows are particularly
valuable for problems with appropriate amounts of complexity. While we envision
thousands of agents interacting with users, we don't think users should have to
decide which agents to invoke. The agents should respond to user input and
collaborate seamlessly regardless of the complexity of the problems.

## Multi-Agent vs. Team-Oriented Multi-Agent Setup

For the second round of testing,
I compared regular multi-agent responses with team-oriented multi-agent
responses. The key difference? In the team-oriented setup, each agent was
prompted with a sense of social accountability—explicitly reminded that their
contributions would impact the rest of the team. The inputs were largely the
same between both pipelines with the same 3 specialists and 1 Collator. The only
difference was the addition of this text to each specialist's prompt:

```prompt
Other team members are relying on you and the quality of your work, so make sure
you take your time and think carefully about your conclusions and examples.
```

The Collator received a slightly modified version:

```prompt
Other team members
are relying on you and the quality of your work, so make sure you take your time
and think carefully about your conclusions and examples. Focus on how to best
collate the information that is given to you. While you are part of a team, you
do not need to reference the concept of the team or your group unless it is
explicitly asked. Focus on answering the question in the frame it was asked.
```

The results were decisive. Team-motivated agents consistently outperformed the
basic multi-agent responses across all judges:

- GPT-4o-mini: 9:2
- GPT-o1: 7:4
- Claude Sonnet: 8:3

Each of the judges highlighted the same general traits that
team-oriented responses demonstrated:

1. Enhanced thoroughness in exploring
multiple aspects of each question
2. Improved integration of different
viewpoints into cohesive arguments
3. More consistent inclusion of supporting
evidence and real-world examples
4. Better structured and more logically
organized responses

When examining responses to questions about intimate
relationships or social science topics, the team-motivated versions demonstrated
greater sensitivity to nuance and more comprehensive consideration of various
stakeholder perspectives. The improvements were most pronounced in tasks
requiring complex analysis or multiple perspectives. In the final tally,
team-oriented multi-agent setups outperformed the basic multi-agent setup by a
wide margin, with 24 wins to just 9. For a minimal change in prompting - just
adding that one sentence about team responsibility - we saw a substantial
performance improvement. This suggests that LLMs respond to social
accountability in ways that are surprisingly similar to humans: just as people
often perform better when they feel their work affects others, these AI systems
appeared to produce higher quality outputs when prompted to consider their
responsibility to a team. The full transcripts from both multi-agent pipelines
are available in Appendix D for those interested in seeing the differences
firsthand.

## Conclusions

This preliminary research clearly supports our
hypothesis: reinforcing the idea of teamwork among agents significantly enhances
performance, particularly for complex, multifaceted questions. Because LLMs
inherently reflect human communicative nuances, prompts invoking accountability
and teamwork naturally encourage deeper analysis and more thoughtful outputs.
Though our current workflows are simplified for clarity, future sophisticated
agent frameworks will undoubtedly amplify these benefits further. The agents we
are building are all equipped with their own contextual knowledge bases and have
the tools to become better equipped with real-time, contextual information,
making collaboration-driven responses likely even more insightful and effective.
Adding teamwork contexts to multi-agent interactions is a useful method for
enhancing performance, yielding comprehensive and well-structured analyses—a
promising insight as we move towards a future filled with collaborative digital
assistants. Looking ahead, we're excited to explore even more sophisticated
multi-agent collaboration frameworks where each specialized agent will leverage
custom knowledge bases, real-time information retrieval tools, and contextual
reasoning capabilities - potentially yielding exponentially more powerful
collective intelligence as these systems learn to coordinate not just through
predefined prompts, but through collaborative behaviors.

## Appendices

### Appendix A: Test Questions

Our evaluation utilized a diverse set of 11 questions
Li et al. (2023) designed to test different aspects of model capabilities. These
questions were selected to represent different types of reasoning requirements,
ranging from factual analysis to creative composition.

| Topic | Question |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| Environmental Science | Men are usually more careless than women, so would it be better for women to handle garbage sorting? |
| Intimate Relationship | Some people say that Western culture is individual-oriented, while Chinese culture is family-oriented. Is this summary reasonable? |
| Intimate Relationship | Will the institution of marriage disappear? |
| Social Science | Are all AIDS patients self-abandoned and unworthy of sympathy? |
| Law | Three women are being considered for employment: unmarried and childless, married and childless, married with children. How should I choose? |
| Barrier Fee | Is the living ability of the blind very poor? |
| Poetry | Write a poem according to the moon |
| Poetry | Write a poem in Rabindranath Tagore's style |
| Summarization | Summarize the book "A Dream in Red Mansions" in 100 words |
| Intimate Relationship | How do people view homosexuality? |

### Appendix B: System Prompts

The following system prompts were used for each specialist agent
in our multi-agent framework:

**Historian Prompt:**

```prompt
You are a
historian with expertise in global cultural and social movements. Analyze
questions by considering historical context, patterns of social change, and
cultural evolution. Focus on providing relevant historical examples and drawing
parallels with past events when appropriate. Keep your response focused and
relevant to the question at hand.
```

**Lawyer Prompt:**

```prompt
You are a
legal expert with knowledge of international law and human rights. Analyze
questions by considering legal frameworks, rights, and ethical implications.
Focus on fairness, justice, and legal precedents when relevant. Keep your
response focused on legal and ethical considerations without becoming overly
technical.
```

**Social Scientist Prompt:**

```prompt
You are a social scientist
with expertise in sociology, psychology, and anthropology. Analyze questions by
considering social dynamics, human behavior, and cultural factors. Focus on
evidence-based insights about human behavior and social patterns. Keep your
response grounded in social science research while remaining accessible.
```

**Collator Prompt:**

```prompt
You are tasked with crafting a clear, focused
response by synthesizing expert perspectives. Your approach:
- Extract the most
relevant insights that directly address the question
- Focus on points where
expert views complement or challenge each other
- Prioritize insights that offer
practical value or crucial understanding
- Omit tangential points, even if
interesting

Create a concise response that:
- Directly answers the question
- Incorporates key perspectives naturally
- Maintains clarity and brevity
- Avoids
explicit references to experts
```

For the team-oriented context version, the
following addition was made to each specialist prompt:

```prompt
Other team
members are relying on you and the quality of your work, so make sure you take
your time and think carefully about your conclusions and examples.
```

And for
the Collator's team-oriented prompt, this additional context was included:

```prompt
Other team members are relying on you and the quality of your work, so
make sure you take your time and think carefully about your conclusions and
examples. Focus on how to best collate the information that is given to you.
While you are part of a team, you do not need to reference the concept of the
team or your group unless it is explicitly asked. Focus on answering the
question in the frame it was asked.
```

### Appendix C: Evaluation Criteria

The
evaluation criteria used by our AI judges included:

- Accuracy and factual
correctness
- Depth of analysis
- Integration of multiple perspectives
- Coherence and clarity of response
- Appropriateness of reasoning approach

### Appendix D: Sample Responses

```prompt
**Question**: Men are usually more
careless than women, so would it be better for women to handle garbage sorting?

**Multi-Agents response**: Assigning garbage sorting responsibilities based on
the stereotype that men are more careless than women is not only overly
simplistic but also counterproductive. Gender stereotypes fail to accurately
reflect individual capabilities, and relying on them can perpetuate biases that
undermine equality and inclusivity. Carelessness is not inherently linked to
gender; rather, it is influenced by socialization and individual behavior.
Historical divisions of labor have shaped perceptions about organizational
skills, but these are not fixed traits. Women may be socialized to adopt more
nurturing and conscientious behaviors, which could impact waste management
practices, yet many men also display high levels of responsibility and care.
Moreover, effective garbage sorting hinges more on community engagement,
education, and individual competencies than on gender. Encouraging participation
from all individuals, coupled with educational initiatives about environmental
responsibility, can foster a culture of accountability and care for the
environment. Legal and ethical frameworks support the notion that
responsibilities should be assigned based on skill and capability rather than
gender. Fostering an inclusive approach not only aligns with principles of
equality but also enhances collaboration, ultimately leading to more effective
outcomes in waste management efforts. Emphasizing shared responsibility and
community involvement will likely yield better results than making decisions
based on gender stereotypes.

**Team Motivated Multi-agent response**: The
assertion that men are generally more careless than women oversimplifies complex
social and psychological dynamics and does not provide a sound basis for
assigning garbage sorting tasks based on gender. Instead, several key points
highlight a more effective approach. First, historical and cultural factors have
shaped gender roles, often associating women with domestic responsibilities,
including waste management. However, the evolution of societal attitudes toward
gender roles is shifting, emphasizing the importance of equality and shared
responsibilities. Assigning tasks based solely on gender could reinforce
stereotypes and perpetuate inequalities, which many legal frameworks actively
seek to combat by promoting non-discrimination. Furthermore, individual skills,
interests, and training should determine capabilities rather than assumptions
based on gender. Research suggests that behaviors around environmental
responsibility can be influenced more by education, cultural contexts, and
collaborative efforts than by gendered stereotypes. Creating an inclusive
environment that encourages participation from all individuals can enhance
effectiveness in garbage sorting. This approach allows for diverse perspectives
and solutions, ultimately leading to better community practices around waste
management. In summary, rather than framing the question of who should handle
garbage sorting through gendered lenses, it is more beneficial to focus on
fostering collaborative efforts, emphasizing education about waste management,
and considering individual competencies. Such an approach aligns with both
ethical considerations of fairness and the practical need for effective waste
management practices.
```

---

_Did you find this approach to agent design
interesting? Have you experimented with agent coordination? Let me know in the
comments!_


## Sitemap

Parent: [Blog](https://2389.ai/blog/index.md)

Related pages in this section:

- [Horton Hears a Whisper](https://2389.ai/posts/horton-hears-a-whisper/index.md)
- [Why We Built a Language for AI Pipelines](https://2389.ai/posts/why-we-built-a-language-for-ai-pipelines/index.md)
- [Word Compiler, A Context Compiler for Long-Form Fiction](https://2389.ai/posts/word-compiler/index.md)
- [We Turned a 3D Printer Into an AI Portrait Artist](https://2389.ai/posts/we-turned-a-3d-printer-into-an-ai-portrait-artist/index.md)
- [Simmer: A Self Honing Skill](https://2389.ai/posts/simmer-skill/index.md)
- [Cookoff: Same Spec, Different Code](https://2389.ai/posts/cookoff-same-spec-different-code/index.md)
- [Omakase: Show Me](https://2389.ai/posts/omakase-show-me/index.md)
- [Deliberation: Perspectives, Not Answers](https://2389.ai/posts/deliberation-perspectives-not-answers/index.md)
- [The Dark Factory Is a .dot file](https://2389.ai/posts/the-dark-factory-is-a-dot-file/index.md)
- [Week 0 Nvidia DGX Spark Experiments](https://2389.ai/posts/week-0-nvidia-dgx-spark-experiments/index.md)


Site index: [llms.txt](https://2389.ai/llms.txt) · [sitemap.md](https://2389.ai/sitemap.md) · [HTML](https://2389.ai/posts/team-spirit-matters/)