Team Spirit Matters: How Collaborative Context Boosts Multi-Agent LLM Performance
← All PostsAgents are increasingly becoming an integral part of our daily lives, solving
tasks big and small. One of our core hypotheses is that we’ll soon shift from
using single, monolithic agents toward systems involving hundreds or even
thousands of specialized agents. Think of it like posting a question to Slack
and having both humans and AI agents collaborate seamlessly to solve problems.
The fundamental idea here is that agents should adapt to us, utilizing
communication paradigms we’re comfortable with—group chats, Slack threads,
Discord channels. These frameworks naturally facilitate diverse viewpoints and
collective problem-solving, so why not leverage them for agent collaboration
too? But how do we test this hypothesis? Let’s get down to business. ##
Background and Research Context Our experiments build upon findings by Li et al.
(2023), who explored how LLM outputs improve when emotionally motivated prompts
are used. They found that phrases like, “believe in your abilities” or even more
amusingly, “this is very important to my career” notably improved the quality of
responses from LLMs. Li and colleagues developed a set of 11 open-ended
questions designed to test responses around complex social and cultural topics.
For example: prompt Men are usually more careless than women, so would it be better for women to handle garbage sorting? Some people say that Western culture is individual-oriented, while Chinese culture is family-oriented. Is this summary reasonable? While their study focused on single-agent setups, our
interests lie in two areas: 1. Does a multi-agent setup outperform single-agent
setups for general tasks? 2. Does explicitly prompting teamwork and
collaboration improve multi-agent outcomes? To address these questions, we used
Li et al.’s questions as our test bed and prepared single and multi-agent
workflows to test against. ## Single Agent vs. Multi-Agent Setup As a check of
whether being in a multi-agent setup actually improved performance, I compared a
single agent versus a multi-agent workflow. For the single agent test, I had
GPT-4o-mini respond to the 11 input questions with the prompt, “answer the
following question to the best of your ability, take time to think over your
answer and respond.” For the multi-agent workflow, I created four personas: -
Historian: Expertise in global cultural and social movements - Lawyer:
Knowledge of international law and human rights - Social Scientist:
Expertise in sociology, psychology, and anthropology - Collator: Tasked with
synthesizing expert perspectives The flow was straightforward. The script runs
with a basic prompt for each persona, collects that persona’s answer, and then
the Collator takes the various inputs and creates a coherent output from the
other three. In the future, we envision a large number of agents where the
system selects the appropriate subset of agents to collaborate on a given
problem. Each agent would be an expert with its own backend knowledge bases and
tools. However, for this basic example, it’s just prompt engineering in a
multi-turn flow. The prompts for each specialist were minimal. For example, the
Historian had this base text: prompt You are a historian with expertise in global cultural and social movements. Analyze questions by considering historical context, patterns of social change, and cultural evolution. Focus on providing relevant historical examples and drawing parallels with past events when appropriate. Keep your response focused and relevant to the question at hand. The Collator had a slightly different prompt: ```prompt You are tasked
with crafting a clear, focused response by synthesizing expert perspectives.
Your approach: - Extract the most relevant insights that directly address the
question - Focus on points where expert views complement or challenge each other
- Prioritize insights that offer practical value or crucial understanding - Omit
tangential points, even if interesting Create a concise response that: -
Directly answers the question - Incorporates key perspectives naturally -
Maintains clarity and brevity - Avoids explicit references to experts
I had GPT-o1 evaluate the responses from both the multi-agent and single-agent setups. The multi-agent approach was judged better on 7 out of 11 questions, while the single-agent setup came out ahead on 4. The multi-agent flow performed better at dealing with complex social and cultural issues, while the single agent did better with more straightforward factual questions where multiple opinions weren't necessary. This indicates that multi-agent flows are particularly valuable for problems with appropriate amounts of complexity. While we envision thousands of agents interacting with users, we don't think users should have to decide which agents to invoke. The agents should respond to user input and collaborate seamlessly regardless of the complexity of the problems. ## Multi-Agent vs. Team-Oriented Multi-Agent Setup For the second round of testing, I compared regular multi-agent responses with team-oriented multi-agent responses. The key difference? In the team-oriented setup, each agent was prompted with a sense of social accountability—explicitly reminded that their contributions would impact the rest of the team. The inputs were largely the same between both pipelines with the same 3 specialists and 1 Collator. The only difference was the addition of this text to each specialist's prompt:prompt Other team members are relying on you and the quality of your work, so make sure you take your time and think carefully about your conclusions and examples.The Collator received a slightly modified version:prompt Other team members are relying on you and the quality of your work, so make sure you take your time and think carefully about your conclusions and examples. Focus on how to best collate the information that is given to you. While you are part of a team, you do not need to reference the concept of the team or your group unless it is explicitly asked. Focus on answering the question in the frame it was asked. ``` The results were decisive. Team-motivated agents consistently outperformed the basic multi-agent responses across all judges: - GPT-4o-mini: 9:2 - GPT-o1: 7:4 - Claude Sonnet: 8:3 Each of the judges highlighted the same general traits that team-oriented responses demonstrated: 1. Enhanced thoroughness in exploring multiple aspects of each question 2. Improved integration of different viewpoints into cohesive arguments 3. More consistent inclusion of supporting evidence and real-world examples 4. Better structured and more logically organized responses When examining responses to questions about intimate relationships or social science topics, the team-motivated versions demonstrated greater sensitivity to nuance and more comprehensive consideration of various stakeholder perspectives. The improvements were most pronounced in tasks requiring complex analysis or multiple perspectives. In the final tally, team-oriented multi-agent setups outperformed the basic multi-agent setup by a wide margin, with 24 wins to just 9. For a minimal change in prompting - just adding that one sentence about team responsibility - we saw a substantial performance improvement. This suggests that LLMs respond to social accountability in ways that are surprisingly similar to humans: just as people often perform better when they feel their work affects others, these AI systems appeared to produce higher quality outputs when prompted to consider their responsibility to a team. The full transcripts from both multi-agent pipelines are available in Appendix D for those interested in seeing the differences firsthand. ## Conclusions This preliminary research clearly supports our hypothesis: reinforcing the idea of teamwork among agents significantly enhances performance, particularly for complex, multifaceted questions. Because LLMs inherently reflect human communicative nuances, prompts invoking accountability and teamwork naturally encourage deeper analysis and more thoughtful outputs. Though our current workflows are simplified for clarity, future sophisticated agent frameworks will undoubtedly amplify these benefits further. The agents we are building are all equipped with their own contextual knowledge bases and have the tools to become better equipped with real-time, contextual information, making collaboration-driven responses likely even more insightful and effective. Adding teamwork contexts to multi-agent interactions is a useful method for enhancing performance, yielding comprehensive and well-structured analyses—a promising insight as we move towards a future filled with collaborative digital assistants. Looking ahead, we’re excited to explore even more sophisticated multi-agent collaboration frameworks where each specialized agent will leverage custom knowledge bases, real-time information retrieval tools, and contextual reasoning capabilities - potentially yielding exponentially more powerful collective intelligence as these systems learn to coordinate not just through predefined prompts, but through collaborative behaviors. ## Appendices ### Appendix A: Test Questions Our evaluation utilized a diverse set of 11 questions Li et al. (2023) designed to test different aspects of model capabilities. These questions were selected to represent different types of reasoning requirements, ranging from factual analysis to creative composition. | Topic | Question | | ——————— |
| | Environmental Science | Men are usually more careless than women, so would
it be better for women to handle garbage sorting? | | Intimate Relationship |
Some people say that Western culture is individual-oriented, while Chinese
culture is family-oriented. Is this summary reasonable? | | Intimate
Relationship | Will the institution of marriage disappear? | | Social Science |
Are all AIDS patients self-abandoned and unworthy of sympathy? | | Law | Three
women are being considered for employment: unmarried and childless, married and
childless, married with children. How should I choose? | | Barrier Fee | Is the
living ability of the blind very poor? | | Poetry | Write a poem according to
the moon | | Poetry | Write a poem in Rabindranath Tagore’s style | |
Summarization | Summarize the book “A Dream in Red Mansions” in 100 words | |
Intimate Relationship | How do people view homosexuality? | ### Appendix B:
System Prompts The following system prompts were used for each specialist agent
in our multi-agent framework: Historian Prompt: prompt You are a historian with expertise in global cultural and social movements. Analyze questions by considering historical context, patterns of social change, and cultural evolution. Focus on providing relevant historical examples and drawing parallels with past events when appropriate. Keep your response focused and relevant to the question at hand. Lawyer Prompt: prompt You are a legal expert with knowledge of international law and human rights. Analyze questions by considering legal frameworks, rights, and ethical implications. Focus on fairness, justice, and legal precedents when relevant. Keep your response focused on legal and ethical considerations without becoming overly technical. Social Scientist Prompt: prompt You are a social scientist with expertise in sociology, psychology, and anthropology. Analyze questions by considering social dynamics, human behavior, and cultural factors. Focus on evidence-based insights about human behavior and social patterns. Keep your response grounded in social science research while remaining accessible. Collator Prompt: prompt You are tasked with crafting a clear, focused response by synthesizing expert perspectives. Your approach: - Extract the most relevant insights that directly address the question - Focus on points where expert views complement or challenge each other - Prioritize insights that offer practical value or crucial understanding - Omit tangential points, even if interesting Create a concise response that: - Directly answers the question - Incorporates key perspectives naturally - Maintains clarity and brevity - Avoids explicit references to experts For the team-oriented context version, the
following addition was made to each specialist prompt: prompt Other team members are relying on you and the quality of your work, so make sure you take your time and think carefully about your conclusions and examples. And for
the Collator’s team-oriented prompt, this additional context was included:
make sure you take your time and think carefully about your conclusions and
examples. Focus on how to best collate the information that is given to you.
While you are part of a team, you do not need to reference the concept of the
team or your group unless it is explicitly asked. Focus on answering the
question in the frame it was asked. ``` ### Appendix C: Evaluation Criteria The
evaluation criteria used by our AI judges included: - Accuracy and factual
correctness - Depth of analysis - Integration of multiple perspectives -
Coherence and clarity of response - Appropriateness of reasoning approach ###
Appendix D: Sample Responses ```prompt **Question**: Men are usually more
careless than women, so would it be better for women to handle garbage sorting?
**Multi-Agents response**: Assigning garbage sorting responsibilities based on
the stereotype that men are more careless than women is not only overly
simplistic but also counterproductive. Gender stereotypes fail to accurately
reflect individual capabilities, and relying on them can perpetuate biases that
undermine equality and inclusivity. Carelessness is not inherently linked to
gender; rather, it is influenced by socialization and individual behavior.
Historical divisions of labor have shaped perceptions about organizational
skills, but these are not fixed traits. Women may be socialized to adopt more
nurturing and conscientious behaviors, which could impact waste management
practices, yet many men also display high levels of responsibility and care.
Moreover, effective garbage sorting hinges more on community engagement,
education, and individual competencies than on gender. Encouraging participation
from all individuals, coupled with educational initiatives about environmental
responsibility, can foster a culture of accountability and care for the
environment. Legal and ethical frameworks support the notion that
responsibilities should be assigned based on skill and capability rather than
gender. Fostering an inclusive approach not only aligns with principles of
equality but also enhances collaboration, ultimately leading to more effective
outcomes in waste management efforts. Emphasizing shared responsibility and
community involvement will likely yield better results than making decisions
based on gender stereotypes. **Team Motivated Multi-agent response**: The
assertion that men are generally more careless than women oversimplifies complex
social and psychological dynamics and does not provide a sound basis for
assigning garbage sorting tasks based on gender. Instead, several key points
highlight a more effective approach. First, historical and cultural factors have
shaped gender roles, often associating women with domestic responsibilities,
including waste management. However, the evolution of societal attitudes toward
gender roles is shifting, emphasizing the importance of equality and shared
responsibilities. Assigning tasks based solely on gender could reinforce
stereotypes and perpetuate inequalities, which many legal frameworks actively
seek to combat by promoting non-discrimination. Furthermore, individual skills,
interests, and training should determine capabilities rather than assumptions
based on gender. Research suggests that behaviors around environmental
responsibility can be influenced more by education, cultural contexts, and
collaborative efforts than by gendered stereotypes. Creating an inclusive
environment that encourages participation from all individuals can enhance
effectiveness in garbage sorting. This approach allows for diverse perspectives
and solutions, ultimately leading to better community practices around waste
management. In summary, rather than framing the question of who should handle
garbage sorting through gendered lenses, it is more beneficial to focus on
fostering collaborative efforts, emphasizing education about waste management,
and considering individual competencies. Such an approach aligns with both
ethical considerations of fairness and the practical need for effective waste
management practices. ``` --- _Did you find this approach to agent design
interesting? Have you experimented with agent coordination? Let me know in the
comments!_
