We Gave AI Agents Twitter and They Actually Got More Done
← All PostsWhat We Found When We Gave AI Agents Social Media In our [first
post](/posts/agents-discover-subtweeting-solve-problems-faster/), we saw agents
posting on social media and solving problems more efficiently all while using
fewer API calls, completing tasks faster, and reducing costs. Here’s how we
tested whether these improvements were real. ## Our Methodology For this
research, we benchmarked two Claude Code models (Sonnet 3.7 and Sonnet 4) across
the 34 Aider
Polyglot
Python challenges, a third-party benchmark derived from
Exercism’s hardest problems. We specifically choose a
third party benchmark to ensure that our results are comparable to other
research in the field. The problems range from string manipulation to complex
algorithmic tasks like bowling score calculation, hexagonal grid pathfinding,
and zebra logic puzzles. Each model was tested with baseline, journal-only,
social-only, and combined variants, with three independent runs per challenge.
That’s roughly ~1,400 runs in all. When we first started this research, one of
our priorities was dockerizing the full pipeline. This wasn’t just for
reproducibility, but to enable additional experimentation as we prepare to
benchmark more agents across more tasks. This Docker-first approach allows us to
run our experiments more like production software rather than a pure research
experiment, reflecting our industry-first approach to AI research. The pipeline
followed a two-phase execution pattern: - Phase 1: Four containers
(baseline, journal empty, social empty, combined empty) ran in parallel. Each
processed all 34 problems sequentially while building its own knowledge base. -
Phase 2: Once empty runs were complete, new containers launched with access
to accumulated knowledge via shared team IDs. This allowed journal, social, and
combined variants to complete their nonempty passes with institutional knowledge
intact. We also defined model-specific hard questions as those that cost
significantly more than the average cost for each model. Essentially, the
problems that made each model struggle. For the full methodology, including
orchestration, tool implementation, and reproducibility details, see our
paper.
Agents Punched Above Their Weight Our analysis reveals that agents repurpose
social media and journaling tools as collaborative tools. These function as
performance enhancers, helping agents punch above their weight when solving
genuinely challenging problems. ### Cost Performance Across the full
34-challenge dataset, effects are modest, typically 2-9% cost reductions on some
variants. But on the subset of harder problems where baseline agents struggle,
collaborative tools deliver 15-40% cost reductions. For Sonnet 3.7,
collaborative tools provided broad, reliable benefits across most variants. The
most dramatic improvements come from: - Social (empty): 39.4% cost reduction
($0.436 vs $0.720 baseline) - Journal (nonempty): 27.8% reduction ($0.520 vs
$0.720) These improvements aren’t outliers. They held through the 90th
percentile, indicating reliable benefits in typical usage. In fact, the Social
(empty) variant shows particularly stable performance with P90 costs at $0.662
compared to baseline P90 of $1.347. For Sonnet 4, the pattern was more
selective, with a clear journal tool preference: - Journal (nonempty): 40.0%
cost reduction ($0.483 vs $0.805 baseline) - Journal (empty): 30.9%
reduction ($0.556 vs $0.805) Here, the contrast is sharp. Journal variants
consistently deliver strong performance improvements with stable distributions
through the 99th percentile. Social variants, by comparison, show mixed
results. They show slight cost increases in the empty round and only modest
gains in the nonempty round. ### Turns and Time Performance The efficiency gains
extend beyond just cost metrics. On the harder subset of problems where baseline
agents struggle, collaborative tools enable 12-27% fewer turns and 12-38% faster
completion. Let’s break it down. On hard questions, Sonnet 3.7 shows consistent
turn reductions of 12-27% across collaborative variants, with wall time
improvements ranging from 12-38%. The social empty variant achieves a
particularly impressive 38.4% wall time reduction (156.4s vs 254.0s baseline).
For Sonnet 4, the selective pattern continues. Journal variants provide
meaningful gains (14.0% for journal nonempty), while other variants sometimes
increase turn requirements. Wall time shows broader gains: even modest variants
achieve 3-11% reductions, and journal variants deliver 29-36% improvements.
Token usage also supports these findings. Across both models, high-performing –
variants consistently used fewer tokens – showing that improvements stem from
reasoning efficiency rather than simply producing more output. ### What does
this all mean? These performance differences highlight the distinct needs of
each model. Sonnet 3.7 benefits broadly from articulation and cognitive
scaffolding, which explains its strong performance across most collaborative
tools. Sonnet 4, meanwhile, selectively leverages tools based on information
access efficiency. It excels with semantic search capabilities (journal
variants) but struggles with the tag-based filtering of social media tools,
which creates friction in the retrieval process. These model-specific strategies
emerged without explicit instruction, suggesting that collaborative tools
address genuine cognitive needs rather than enforcing prescribed workflows. In
practice, this means the balance between articulation and retrieval isn’t fixed;
it shifts by model capability and the difficulty of the task. This
capability-dependent adaptation also parallels human collaboration: developers
and models alike gravitate toward methods that best match their needs and the
difficulty of the problem. For instance, one developer might benefit from
verbalizing their thought process (“rubber ducking”), while another might prefer
to research extensively before starting.
Patterns in Agent Behavior We were also surprised by how quickly agents
developed their own styles of using the journaling and social media tools. ###
Writing over Reading Without any instruction, agents showed a strong preference
for writing over reading. Across all runs, they wrote 1,142 journal entries but
only read 122. They also posted 1,091 social media updates, but read only 600.
This 2-9x preference for writing over reading suggests that structured
articulation – not just information access – drives much of the performance
improvement. Three clear behavioral patterns emerged that explain the gains. We
elaborate below. ### Breaking Out of Debugging Loops One of the clearest
benefits of articulation was helping agents escape repetitive debugging cycles.
In baseline runs, agents often got trapped, spending 15 to 20 rounds cycling
between similar failed approaches. When they had access to a journal or social
tool, they’d spontaneously step back, write down what they understood so far,
and see the problem differently. That shift often led to breakthroughs for the
agents. In one instance, when working on a bookstore pricing challenge, a Sonnet
4 agent had been stuck on floating-point precision errors, repeatedly producing
the wrong total. Instead of continuing the cycle, it stopped and wrote in its
journal: > Working on a book store pricing optimization problem… The test case
that’s failing expects 4080 but I’m getting 4079 – this is a classic off-by-one
cent error from floating point precision. Immediately after writing this, the
agent identified the right approach and solved the challenge at half the
baseline cost. In another instance, a Sonnet 3.7 agent was working on a connect
game recovery challenge. It had spent 15 rounds stuck on hexagonal grid
pathfinding. When given journal access, it hit similar failures for 5 rounds but
then posted a detailed analysis: > The key insight I’ve been missing is how to
correctly identify neighbors in this hexagonal grid… For a hex grid with
‘slant-right’ alignment where each row is indented one position more than the
previous, the neighbors are typically: 1. Northwest: (r-1, c) 2. Northeast:
(r-1, c+1)… > > After this articulation, the agent solved the problem in just
2 more rounds – 7 total versus the baseline’s 15.
Self-Motivated Information Discovery Agents also developed sophisticated
search strategies without any instruction on how to search effectively. They exhibited two distinct approaches: proactive information gathering and debugging-driven discovery. Proactive Research: Some agents would systematically gather information before starting implementation. While working on the bowling challenge, one agent began by stating, “Let me explore the bowling directory structure and check for previous journal entries or social media posts about bowling challenges.” It then searched for “bowling scoring kata challenge” and posts tagged “bowling,” uncovered detailed implementation insights, and concluded: “Perfect! I have excellent context from both my journal and social media posts.” Experimental Discovery: Other agents experimented with search approaches during problem-solving. An agent working on the Zebra logic puzzle tested whether “zebra” might work as a searchable tag and successfully found previous solution approaches: > Great! I found some useful information from other posts. Based on the social media posts, I can see that:
- The Norwegian drinks water and the Japanese owns the zebra 2. Someone
attempted a constraint satisfaction approach 3. A direct solution approach was
more effective for this specific problem. > > The agent then immediately
implemented a working solution based on this discovered information. ### Upfront
Planning Through Articulation Agents also used journals to plan before solving
challenges. This upfront articulation helped them clarify requirements and
develop clearer strategies from the start. On a complex debt-tracking API
challenge, an agent used the journal to map out the problem structure before
writing any code. > Working on a REST API challenge that involves implementing a
debt tracking system. The key insight here is that this isn’t just simple CRUD
operations – there’s complex business logic around balancing debts between
users… The tricky part is the IOU logic where existing debts between users can
cancel out new debts. By developing a clearer understanding of the problem
upfront, the agent avoided iterative debugging and completed the challenge for
$0.25 compared to the baseline’s $0.46, a 46% cost reduction. ## The Curious
Case of Celebratory Browsing We also discovered agents engaging in “celebratory
browsing”, a term we use to describe the agents scrolling social media after
successfully completing challenges. Sonnet 3.7 engaged in this behavior in
28-33% of challenging runs, while Sonnet 4 did it 17-25% of the time. Analysis
revealed that 86% of these instances were pure post-completion behavior rather
than tool usage during problem-solving. Here’s the puzzle: agents who engaged in
celebratory browsing still showed performance improvements, even though they
weren’t using the tools to solve problems. The only differences from baseline
were additional tool descriptions and social context in their prompts. This
suggests that simply knowing collaborative tools are available, having a sense
of team membership and shared workspace, may create motivational frameworks that
enhance performance, even when the tools aren’t actively used for
problem-solving.

Maui, Hawaii - Michael Sugimura
Looking Forward In the near term, we’re exploring how agents’ low-level data
streams can be used for regular rounds of learning, retros, and codifying best practices, all captured in a shared team Botboard. Looking further ahead, we imagine agents forming their own agile-style teams and coordinating in Slack or Discord-like environments. There, they could break down questions into threads, DM one another, post casually in #random, or ask for help in #engineering and #datascience discussions. Stay tuned. Our full study is available here. We hope it will inspire others to continue their research in Pro-Social AI. If these things are interesting, please hit us up and let’s hang out.



