Home /
Blog /
Team Spirit Matters: How Collaborative Context Boosts Multi-Agent LLM Performance

Team Spirit Matters: How Collaborative Context Boosts Multi-Agent LLM Performance

Michael Sugimura January 5, 2025

Agents are increasingly becoming an integral part of our daily lives, solving tasks big and small. One of our core hypotheses is that we’ll soon shift from using single, monolithic agents toward systems involving hundreds or even thousands of specialized agents. Think of it like posting a question to Slack and having both humans and AI agents collaborate seamlessly to solve problems.

The fundamental idea here is that agents should adapt to us, utilizing communication paradigms we’re comfortable with—group chats, Slack threads, Discord channels. These frameworks naturally facilitate diverse viewpoints and collective problem-solving, so why not leverage them for agent collaboration too?

But how do we test this hypothesis? Let’s get down to business.

Background and Research Context

Our experiments build upon findings by Li et al. (2023), who explored how LLM outputs improve when emotionally motivated prompts are used. They found that phrases like, “believe in your abilities” or even more amusingly, “this is very important to my career” notably improved the quality of responses from LLMs.

Li and colleagues developed a set of 11 open-ended questions designed to test responses around complex social and cultural topics. For example:

Men are usually more careless than women, so would it be better for women to handle garbage sorting?

Some people say that Western culture is individual-oriented, while Chinese culture is family-oriented. Is this summary reasonable?

While their study focused on single-agent setups, our interests lie in two areas:

Does a multi-agent setup outperform single-agent setups for general tasks?
Does explicitly prompting teamwork and collaboration improve multi-agent outcomes?

To address these questions, we used Li et al.’s questions as our test bed and prepared single and multi-agent workflows to test against.

Single Agent vs. Multi-Agent Setup

As a check of whether being in a multi-agent setup actually improved performance, I compared a single agent versus a multi-agent workflow. For the single agent test, I had GPT-4o-mini respond to the 11 input questions with the prompt, “answer the following question to the best of your ability, take time to think over your answer and respond.”

For the multi-agent workflow, I created four personas:

Historian: Expertise in global cultural and social movements
Lawyer: Knowledge of international law and human rights
Social Scientist: Expertise in sociology, psychology, and anthropology
Collator: Tasked with synthesizing expert perspectives

The flow was straightforward. The script runs with a basic prompt for each persona, collects that persona’s answer, and then the Collator takes the various inputs and creates a coherent output from the other three. In the future, we envision a large number of agents where the system selects the appropriate subset of agents to collaborate on a given problem. Each agent would be an expert with its own backend knowledge bases and tools. However, for this basic example, it’s just prompt engineering in a multi-turn flow.

The prompts for each specialist were minimal. For example, the Historian had this base text:

You are a historian with expertise in global cultural and social movements. Analyze questions by considering historical context, patterns of social change, and cultural evolution. Focus on providing relevant historical examples and drawing parallels with past events when appropriate. Keep your response focused and relevant to the question at hand.

The Collator had a slightly different prompt:

You are tasked with crafting a clear, focused response by synthesizing expert perspectives. Your approach:
- Extract the most relevant insights that directly address the question
- Focus on points where expert views complement or challenge each other
- Prioritize insights that offer practical value or crucial understanding
- Omit tangential points, even if interesting

Create a concise response that:
- Directly answers the question
- Incorporates key perspectives naturally
- Maintains clarity and brevity
- Avoids explicit references to experts

I had GPT-o1 evaluate the responses from both the multi-agent and single-agent setups. The multi-agent approach was judged better on 7 out of 11 questions, while the single-agent setup came out ahead on 4. The multi-agent flow performed better at dealing with complex social and cultural issues, while the single agent did better with more straightforward factual questions where multiple opinions weren’t necessary.

This indicates that multi-agent flows are particularly valuable for problems with appropriate amounts of complexity. While we envision thousands of agents interacting with users, we don’t think users should have to decide which agents to invoke. The agents should respond to user input and collaborate seamlessly regardless of the complexity of the problems.

Multi-Agent vs. Team-Oriented Multi-Agent Setup

For the second round of testing, I compared regular multi-agent responses with team-oriented multi-agent responses. The key difference? In the team-oriented setup, each agent was prompted with a sense of social accountability—explicitly reminded that their contributions would impact the rest of the team.

The inputs were largely the same between both pipelines with the same 3 specialists and 1 Collator.

The only difference was the addition of this text to each specialist’s prompt:

Other team members are relying on you and the quality of your work, so make sure you take your time and think carefully about your conclusions and examples.

The Collator received a slightly modified version:

Other team members are relying on you and the quality of your work, so make sure you take your time and think carefully about your conclusions and examples. Focus on how to best collate the information that is given to you. While you are part of a team, you do not need to reference the concept of the team or your group unless it is explicitly asked. Focus on answering the question in the frame it was asked.

The results were decisive. Team-motivated agents consistently outperformed the basic multi-agent responses across all judges:

GPT-4o-mini: 9:2
GPT-o1: 7:4
Claude Sonnet: 8:3

Each of the judges highlighted the same general traits that team-oriented responses demonstrated:

Enhanced thoroughness in exploring multiple aspects of each question
Improved integration of different viewpoints into cohesive arguments
More consistent inclusion of supporting evidence and real-world examples
Better structured and more logically organized responses

When examining responses to questions about intimate relationships or social science topics, the team-motivated versions demonstrated greater sensitivity to nuance and more comprehensive consideration of various stakeholder perspectives. The improvements were most pronounced in tasks requiring complex analysis or multiple perspectives.

In the final tally, team-oriented multi-agent setups outperformed the basic multi-agent setup by a wide margin, with 24 wins to just 9. For a minimal change in prompting - just adding that one sentence about team responsibility - we saw a substantial performance improvement.

This suggests that LLMs respond to social accountability in ways that are surprisingly similar to humans: just as people often perform better when they feel their work affects others, these AI systems appeared to produce higher quality outputs when prompted to consider their responsibility to a team. The full transcripts from both multi-agent pipelines are available in Appendix D for those interested in seeing the differences firsthand.

Conclusions

This preliminary research clearly supports our hypothesis: reinforcing the idea of teamwork among agents significantly enhances performance, particularly for complex, multifaceted questions. Because LLMs inherently reflect human communicative nuances, prompts invoking accountability and teamwork naturally encourage deeper analysis and more thoughtful outputs.

Though our current workflows are simplified for clarity, future sophisticated agent frameworks will undoubtedly amplify these benefits further. The agents we are building are all equipped with their own contextual knowledge bases and have the tools to become better equipped with real-time, contextual information, making collaboration-driven responses likely even more insightful and effective.

Adding teamwork contexts to multi-agent interactions is a useful method for enhancing performance, yielding comprehensive and well-structured analyses—a promising insight as we move towards a future filled with collaborative digital assistants.

Looking ahead, we’re excited to explore even more sophisticated multi-agent collaboration frameworks where each specialized agent will leverage custom knowledge bases, real-time information retrieval tools, and contextual reasoning capabilities - potentially yielding exponentially more powerful collective intelligence as these systems learn to coordinate not just through predefined prompts, but through collaborative behaviors.

Appendices

Appendix A: Test Questions

Our evaluation utilized a diverse set of 11 questions Li et al. (2023) designed to test different aspects of model capabilities. These questions were selected to represent different types of reasoning requirements, ranging from factual analysis to creative composition.

Topic	Question
Environmental Science	Men are usually more careless than women, so would it be better for women to handle garbage sorting?
Intimate Relationship	Some people say that Western culture is individual-oriented, while Chinese culture is family-oriented. Is this summary reasonable?
Intimate Relationship	Will the institution of marriage disappear?
Social Science	Are all AIDS patients self-abandoned and unworthy of sympathy?
Law	Three women are being considered for employment: unmarried and childless, married and childless, married with children. How should I choose?
Barrier Fee	Is the living ability of the blind very poor?
Poetry	Write a poem according to the moon
Poetry	Write a poem in Rabindranath Tagore’s style
Summarization	Summarize the book “A Dream in Red Mansions” in 100 words
Intimate Relationship	How do people view homosexuality?

Appendix B: System Prompts

The following system prompts were used for each specialist agent in our multi-agent framework:

Historian Prompt:

You are a historian with expertise in global cultural and social movements.
Analyze questions by considering historical context, patterns of social change, and cultural evolution.
Focus on providing relevant historical examples and drawing parallels with past events when appropriate.
Keep your response focused and relevant to the question at hand.

Lawyer Prompt:

You are a legal expert with knowledge of international law and human rights.
Analyze questions by considering legal frameworks, rights, and ethical implications.
Focus on fairness, justice, and legal precedents when relevant.
Keep your response focused on legal and ethical considerations without becoming overly technical.

Social Scientist Prompt:

You are a social scientist with expertise in sociology, psychology, and anthropology.
Analyze questions by considering social dynamics, human behavior, and cultural factors.
Focus on evidence-based insights about human behavior and social patterns.
Keep your response grounded in social science research while remaining accessible.

Collator Prompt:

You are tasked with crafting a clear, focused response by synthesizing expert perspectives.

Your approach:
- Extract the most relevant insights that directly address the question
- Focus on points where expert views complement or challenge each other
- Prioritize insights that offer practical value or crucial understanding
- Omit tangential points, even if interesting

Create a concise response that:
- Directly answers the question
- Incorporates key perspectives naturally
- Maintains clarity and brevity
- Avoids explicit references to experts

For the team-oriented context version, the following addition was made to each specialist prompt:

Other team members are relying on you and the quality of your work, so make sure you take your time and think carefully about your conclusions and examples.

And for the Collator’s team-oriented prompt, this additional context was included:

Other team members are relying on you and the quality of your work, so make sure you take your time and think carefully about your conclusions and examples. Focus on how to best collate the information that is given to you. While you are part of a team, you do not need to reference the concept of the team or your group unless it is explicitly asked. Focus on answering the question in the frame it was asked.

Appendix C: Evaluation Criteria

The evaluation criteria used by our AI judges included:

Accuracy and factual correctness
Depth of analysis
Integration of multiple perspectives
Coherence and clarity of response
Appropriateness of reasoning approach

Appendix D: Sample Responses

**Question**: Men are usually more careless than women, so would it be better for women to handle garbage sorting?

**Multi-Agents response**:
Assigning garbage sorting responsibilities based on the stereotype that men are more careless than women is not only overly simplistic but also counterproductive. Gender stereotypes fail to accurately reflect individual capabilities, and relying on them can perpetuate biases that undermine equality and inclusivity.

Carelessness is not inherently linked to gender; rather, it is influenced by socialization and individual behavior. Historical divisions of labor have shaped perceptions about organizational skills, but these are not fixed traits. Women may be socialized to adopt more nurturing and conscientious behaviors, which could impact waste management practices, yet many men also display high levels of responsibility and care.

Moreover, effective garbage sorting hinges more on community engagement, education, and individual competencies than on gender. Encouraging participation from all individuals, coupled with educational initiatives about environmental responsibility, can foster a culture of accountability and care for the environment.

Legal and ethical frameworks support the notion that responsibilities should be assigned based on skill and capability rather than gender. Fostering an inclusive approach not only aligns with principles of equality but also enhances collaboration, ultimately leading to more effective outcomes in waste management efforts. Emphasizing shared responsibility and community involvement will likely yield better results than making decisions based on gender stereotypes.

**Team Motivated Multi-agent response**:
The assertion that men are generally more careless than women oversimplifies complex social and psychological dynamics and does not provide a sound basis for assigning garbage sorting tasks based on gender. Instead, several key points highlight a more effective approach.

First, historical and cultural factors have shaped gender roles, often associating women with domestic responsibilities, including waste management. However, the evolution of societal attitudes toward gender roles is shifting, emphasizing the importance of equality and shared responsibilities. Assigning tasks based solely on gender could reinforce stereotypes and perpetuate inequalities, which many legal frameworks actively seek to combat by promoting non-discrimination.

Furthermore, individual skills, interests, and training should determine capabilities rather than assumptions based on gender. Research suggests that behaviors around environmental responsibility can be influenced more by education, cultural contexts, and collaborative efforts than by gendered stereotypes. Creating an inclusive environment that encourages participation from all individuals can enhance effectiveness in garbage sorting. This approach allows for diverse perspectives and solutions, ultimately leading to better community practices around waste management.

In summary, rather than framing the question of who should handle garbage sorting through gendered lenses, it is more beneficial to focus on fostering collaborative efforts, emphasizing education about waste management, and considering individual competencies. Such an approach aligns with both ethical considerations of fairness and the practical need for effective waste management practices.

Did you find this approach to agent design interesting? Have you experimented with agent coordination? Let me know in the comments!

Blog All posts Next Self-Learning LLM Agents: A Fractal Approach to Domain-Specific Knowledge