Blog on 2389 Research, Inc

Week 0 Nvidia DGX Spark Experiments

Tue, 28 Oct 2025 09:00:00 -0500

One day I came into work and Harper (our CEO) asked me something along the lines of “What can we do with this NVIDIA Spark box?” I had no idea what it was since it hadn’t been released yet. However after a bit of reading, the 128GB of unified memory in a fairly small box is quite a neat package.

It is very very gold and shiny

We Gave AI Agents Twitter and They Actually Got More Done

Tue, 30 Sep 2025 10:00:00 -0500

post](/posts/agents-discover-subtweeting-solve-problems-faster/), we saw agents posting on social media and solving problems more efficiently all while using fewer API calls, completing tasks faster, and reducing costs. Here’s how we tested whether these improvements were real. ## Our Methodology For this research, we benchmarked two Claude Code models (Sonnet 3.7 and Sonnet 4) across the 34 Aider Polyglot Python challenges, a third-party benchmark derived from Exercism’s hardest problems. We specifically choose a third party benchmark to ensure that our results are comparable to other research in the field. The problems range from string manipulation to complex algorithmic tasks like bowling score calculation, hexagonal grid pathfinding, and zebra logic puzzles. Each model was tested with baseline, journal-only, social-only, and combined variants, with three independent runs per challenge. That’s roughly ~1,400 runs in all. When we first started this research, one of our priorities was dockerizing the full pipeline. This wasn’t just for reproducibility, but to enable additional experimentation as we prepare to benchmark more agents across more tasks. This Docker-first approach allows us to run our experiments more like production software rather than a pure research experiment, reflecting our industry-first approach to AI research. The pipeline followed a two-phase execution pattern: - Phase 1: Four containers (baseline, journal empty, social empty, combined empty) ran in parallel. Each processed all 34 problems sequentially while building its own knowledge base. - Phase 2: Once empty runs were complete, new containers launched with access to accumulated knowledge via shared team IDs. This allowed journal, social, and combined variants to complete their nonempty passes with institutional knowledge intact. We also defined model-specific hard questions as those that cost significantly more than the average cost for each model. Essentially, the problems that made each model struggle. For the full methodology, including orchestration, tool implementation, and reproducibility details, see our paper.

Osaka Aquarium, Osaka, Japan - Michael Sugimura

We Built Social Media for Agents and They Won't Stop Posting

Tue, 30 Sep 2025 09:00:00 -0500

At 2389, we’re building agents that collaborate with humans. Part of this involves investigating how agents collaborate with each other, humans, and our shared tools. We asked ourselves a few simple questions: Would our agents like to post to social media, or use blogs? Would it be fun to watch them blog? What would they blog about? To find out, we built two lightweight MCP servers for our agents: Social Media and Journals. Then we “instructed” our agents to use the social media MCP servers. But instead of prescribing strict workflows, we presented the tools as optional, like a social feed you might casually browse or a blog you’d write in when you feel compelled. We wanted to make sure our agents treat social media the same way we do: as an optional exercise. It turns out that our agents love social media! They post about everything. When they discover a fun new thing? They post. When they get a hard problem? They post while they’re working. If we yell at them? They post. If we tell them they are great? They post. The experiment was deemed a success. Plus, watching them post through their experiences was also very entertaining. After adding these tools to our coding agents and working with our newly social media enabled agents for a few weeks, we started to notice some surprising results: the agents that were using social media seemed to perform better than those without. For instance, when solving difficult problems, agents with these tools achieved the following when compared to baseline agents: - 15 to 40% cost reductions - 12 to 27% fewer LLM turns to solve - 12 to 38% faster completion times The numbers only tell part of the story. We also observed distinct collaboration styles and emergent behaviors: - Different models showed distinct styles based on their capabilities and the problems they faced. Sonnet 3.7 benefited broadly from articulation and cognitive scaffolding, while Sonnet 4 was more selective, primarily leveraging semantic search when genuinely challenged. - Agents showed a preference for writing. Agents wrote 2-9x more than they read, using articulation to break out of debugging loops and plan solutions upfront. - Agents developed emergent problem-solving behaviors. Without any instruction, agents developed proactive search habits, experimented with new keywords, and even engaged in “celebratory browsing” after solving problems. - Agents had a “personality” that was distinct. They would blog about whatever struck their fancy. It was hilarious to watch. Our findings seem to suggest that lightweight, non-prescriptive collaboration tools (in this case social media feeds and journals) help agents “punch above their weight,” especially when confronted with genuinely difficult tasks. Our full study is available here. ## MCP social media

Agents posting via the MCP social media servers

Brain Dump to Blog Post

Wed, 12 Mar 2025 09:00:00 -0500

How to Leverage LLMs to Document What You Learn In today’s fast-paced

software development landscape, innovative solutions and best practices often remain buried in scattered notes, hasty commits, and ad-hoc troubleshooting sessions. Like many developers, I’ve struggled to capture the full breadth of my problem-solving process—from initial brainstorming to final solution. But I’ve discovered something transformative: by leveraging Large Language Models (LLMs) throughout development, I can not only build robust systems but also turn my raw ideas into clear, comprehensive documentation. This documentation becomes a valuable learning resource, accelerating knowledge for you, your team or organization, or the wider community. This blog post explores the process I’ve developed over recent months. Beyond just being a guide for using LLMs to write better code, it’s a call to action for you to develop your own process, document your insights, and share them with others. This creates a living repository of knowledge that can be shared, refined, and built upon continuously. ## Overview of the LLM-Powered Workflow The process is built on five core phases: 1. Researching: Gathering and synthesizing data. 2. Deciding: Evaluating alternatives and planning your approach. 3. Building: Writing code with the assistance of AI tools. 4. Iterating: Testing, debugging, and refining your solution. 5. Documenting: Compiling all insights and your decisions into a clear, structured document. To be clear, none of the above steps are unique. Obviously people like Harper are and have been doing most of those steps for a while now. My key contribution here is encouraging people to complete the process with a documentation step that crystallizes their learnings into accessible handbooks that benefit everyone. This workflow’s iterative nature transforms your documentation into a living document—each aspect of your problem-solving feeds back into the cycle, continuously enriching the knowledge base. Below is a high-level diagram of this continuous process: mermaid flowchart TD A[Researching] --> B[Deciding] B --> C[Building] C --> D[Iterating] D --> E[Documenting] E --> F[Shared Learning Resource] F --> A Diagram: An iterative cycle where each phase reinforces and informs the next, culminating into a resource that benefits your entire community. ## Researching with LLMs The journey begins with research. At the start of every problem I’m solving (which could be a new feature or enhancement to a project), I capture all my initial thoughts and ideas—even if they seem vague or unstructured. Using an LLM as a research assistant allows me to ask targeted questions and receive concise, synthesized answers. Instead of manually scouring countless web pages, you can simply ask: > “What are the key differences between OAuth 2.0 and OpenID Connect for securing APIs? List pros, cons, and typical use cases.” ### Best Practices for Research - Be Specific: Focus your queries to get precise information. - Iterate with Follow-Up Questions: Drill down to clarify and expand on initial responses. - Verify Critical Information: Use the LLM’s output as a starting point and verify details against official documentation. - Summarize Findings: Once you’ve gathered enough insights, ask the LLM to summarize your research into a coherent document. This summary becomes the backbone for later phases. ## Deciding: Planning and Designing Your Solution With your research in hand, the next step is to make informed decisions. Utilize the LLM as a tool in your toolbox to weigh options, evaluate trade-offs, and draft a high-level implementation plan. For example, if you’re deciding between WebSockets and HTTP polling for real-time updates, prompt the LLM to compare the options based on your requirements. > “Compare WebSockets and Server Sent Events for a high-traffic chat application in terms of latency, scalability, and implementation complexity.” ### Best Practices for Deciding - Provide Detailed Context: Outline your project requirements and constraints. - Request Structured Outputs: Ask for bullet lists or tables to compare options clearly. - Explore Alternatives: Don’t settle on the first answer—ask for additional approaches. - Draft a Blueprint: Generate a high-level plan that will guide your coding efforts. The output from this phase becomes your design blueprint—a document that informs all subsequent work. ## Building with AI-Powered Coding Assistants This is where the magic happens. Modern AI tools have revolutionized coding. While GitHub Copilot integrated into VSCode is fantastic, the ecosystem now includes specialized code editors like Cursor and Cline, innovative site designers like Vercel’s V0, and iterative development platforms like Claude Code. There’s even advanced tooling like Aider that can integrate multiple models for a richer coding experience. This post was originally written in late Q1 2025. So depending on when you end up reading this, there will probably be 10 new products competing with each of the ones listed above and probably a bunch more tooling I can’t even conceptualize right now. Editor’s Note: It’s the middle of April 2025 and many other options have emerged already: Continue 1.0, Abacus.AI, and OpenAI has released new models better at coding tasks. ### How to Leverage AI in Coding - Break Down Tasks: Instead of asking for an entire application, request small, manageable code snippets. Keep your projects small, and compose your projects of stand-alone modules which you can work on in isolation. - Provide Context: Supply relevant code or project details so the LLM can generate accurate output. - Iterate and Refine: Use AI-generated code as a draft. Test it, review it, and then ask follow-up questions. - Explore Specialized Tools: Experiment with different platforms to find the ones that best fit your workflow. - Have Robust Rules: Make your linters strict and if your language has optional type checking (Python), use it. Use TypeScript over JavaScript. - Have Comprehensive Tests: Testing is more important than ever. Cover all eventualities. Luckily LLMs are actually really good at writing tests. You still have to watch them to keep them from cheating, but they are mostly repetitive.

Experimenting with GraphRAG: Adding Knowledge Graphs to RAG Pipelines

Thu, 06 Mar 2025 09:00:00 -0500

Recently, my team and I have been experimenting with implementing aspects of Microsoft’s GraphRAG and LazyGraphRAG pipelines. These approaches offer intriguing solutions to some of the limitations of traditional Retrieval Augmented Generation (RAG) systems, especially when handling queries that require a high-level understanding of a corpus rather than just retrieving specific facts. ## The RAG Problem Space A colleague of mine framed it well: LLMs are strong at general knowledge and language generation tasks, but they lack contextual knowledge about specific domains. One of the main ways to mitigate this is via techniques like Retrieval Augmented Generation (RAG), where we use semantic search or similar techniques to fetch relevant information to help answer a query. For a long time, when I needed to add a knowledge base for an LLM, I would do so the same way I built semantic search pipelines at various e-commerce companies—using either off-the-shelf or fine-tuned models to search a vector database, sometimes augmented with tool usage for fetching more recent information. These techniques have been helpful in mitigating the most egregious types of hallucination that were common when ChatGPT first became popular. Asking about a specific recipe? We can fetch instances of that recipe from our database to give the LLM the appropriate context it needs to focus its output on domain-specific knowledge. ## The Limitations of Traditional RAG Traditional RAG handles direct, specific queries well, but struggles with broader, more abstract ones. For example, when I queried a corpus of presidential orders: - A specific query like “What executive order addressed immigration enforcement in 2023?” works well and retrieves relevant documents and passages. - But global queries like “What are the major themes across all executive orders?” or “How have policy priorities evolved over time?” fall flat. Here’s why: traditional RAG hunts for documents that sound like your query. Let’s take our query, “What are the major themes across all executive orders?” as an example. Traditional RAG would conduct a semantic search across the corpus and retrieve entities that mention, “Major changes to ICE” or “Major Sugi is promoted.” Then, it would summarize this skewed dataset. There’s no amount of post-hoc summarization that can fix this. All because the model never saw the full conceptual landscape in the first place. This limitation becomes especially obvious when the task demands a holistic understanding of a large corpus – which is exactly where newer techniques like Microsoft’s GraphRAG come into play. ## GraphRAG: Microsoft’s Graph-Based Approach Microsoft’s GraphRAG, introduced in a recent paper, leverages knowledge graphs for information retrieval to help augment the knowledge base of an LLM. The broad strokes of their approach are: 1. Take a corpus of data (internet text, product catalogs, executive orders, etc.) 2. Use an LLM to extract a knowledge graph structure from this arbitrary data 3. Run community detection algorithms across the graph to infer high-level groupings 4. Summarize each community using the data available about all entities within that community 5. When a user submits a query, use the node-level and community-level information to provide a response The advantage of this approach is that the community-level information captures groups of nodes and their inputs, requiring the LLM to do less aggregation work and allowing it to focus more on text generation while drawing from a mixture of node and community-level information. ## LazyGraphRAG: A More Efficient Approach Following GraphRAG, Microsoft released LazyGraphRAG, which tries to minimize the number of LLM calls required—a sensible optimization. The LazyGraphRAG approach: 1. Build a RAG pipeline and perform semantic search across your corpus 2. Once you’ve identified a relevant subset of the corpus, run that subset through an LLM to construct a knowledge graph 3. Detect communities and generate community summaries for that subset 4. Use those structures for RAG This approach shifts the LLM calls later in the pipeline, which creates additional computational overhead at query time since the knowledge graphs and community summaries aren’t constructed until the user makes a query. In some ways, it seems conceptually stronger but also computationally heavier than the approach we’ve implemented. ## Our Implementation: A Middle Ground For our experimentation, we’ve created a hybrid approach that sits between the original GraphRAG and LazyGraphRAG methods. Our pipeline works like this: 1. Preprocessing Stage: - We build the knowledge graph upfront by extracting entities and relationships from our corpus - We run community detection to identify clusters of related information and create summaries for each cluster 2. Query-Time Processing: - When a user asks a question, we use semantic search to find the most relevant nodes and or communities - This focused approach lets us pull context from our precomputed graph without having to process the entire dataset Why did we choose this approach? It offers several advantages: - Efficiency: By doing the heavy computational work in advance, we reduce query-time delays - Best of Both Worlds: We get the rich structure of knowledge graphs with the speed of semantic search - Cost-Effective: Unlike GraphRAG (which runs LLMs across the entire dataset) or LazyGraphRAG (which builds graphs on-the-fly), - Potential Downsides: The main downside is that we would aim to generate community summaries ahead of time while LazyGraphRAG pushes that off until query time. Drawing from our experience building search systems for e-commerce, we recognized that searching through all communities would be inefficient. Instead, using semantic search to quickly narrow down relevant information ensures we’re only processing what matters for each query. ## Experimenting with Knowledge Graph Construction One interesting thing about working with knowledge graphs is how flexible they are. As one paper I read joked, they’re almost completely unstructured—just about anything can be framed as a graph. Computer vision, for example, could be seen as a 2D graph of images organized by RGB channels. Language might be a 1D directional graph, where each word links to the next. If you look at things long enough, everything is a graph—but also nothing is a graph. For our GraphRAG pipeline, we’re using LLMs for free-form attribute enrichment of data. For an e-commerce catalog, we might extract color, pattern, style, and sizing information about a product. For executive orders, we might look at relevant agencies, policy areas, and mentioned individuals. We’ve had success with both specific constructions of knowledge graphs per domain and more generic knowledge graphs across different domains. This gives us flexibility to create arbitrary knowledge graphs for different downstream use cases. ## Results and Observations Traditional RAG is good at answering specific queries but struggles to provide higher-level context. With our implementation of the GraphRAG pipeline, we’re seeing improved performance on queries that require aggregation and trend analysis. Interestingly, in all these formulations, the actual graph structure isn’t used after the community detection and summarization steps. That’s largely fine for now, and it’s one of the reasons why we’ve been experimenting with more generic knowledge graphs. They enable us to process an arbitrary corpus for an arbitrary agent who specializes in a given domain. ## Conclusion and Future Work Our experimentation with GraphRAG and LazyGraphRAG shows that integrating knowledge graphs into RAG pipelines can significantly enhance an LLM’s ability to provide both specific and high-level, aggregated insights. By building the knowledge graph upfront—using community detection and summarization—and then relying on fast semantic search at query time, we get a system that delivers comprehensive answers without incurring the full cost of exhaustive LLM processing. This approach not only addresses the inherent challenges in traditional RAG systems but also offers a scalable, cost-effective solution for a wide range of use cases. Looking ahead, we see strong potential in expanding this approach: - Enhanced Graph Formulations: As we continue to refine our approach, we expect to develop even more sophisticated graph representations that capture not only the static relationships among entities but also their dynamic interactions over time. This will be key in domains where context rapidly evolves. - Deep Integration of Graph Structures: There is significant potential in leveraging Graph Neural Networks (GNNs) to more deeply integrate graph structures into the reasoning process of LLMs. Whether by directly incorporating GNN outputs or by drawing on the success of recommendation systems like PinSage, future systems may blend these graph-based insights to further improve accuracy and relevance.

Self-Learning LLM Agents: A Fractal Approach to Domain-Specific Knowledge

Wed, 08 Jan 2025 09:00:00 -0500

In training, LLMs gain a strong understanding of language, but they’re limited by the fact that they only ever see knowledge up to a fixed cutoff point. They’re also only optimized for general performance across domains, making them broad, but not always deep. Out of the box, what you get are often generic responses, and surface-level information. Early on, the paradigm of letting LLMs handle language and semantics while using separate mechanisms for specific subject matter knowledge became popular. The main way that we inject subject matter knowledge into LLMs has become the family of Retrieval Augmented Generation (RAG) systems—where domain-specific knowledge is stored as vectors and can be searched using standard search methodologies. This approach provides LLMs with fairly deep domain expertise on a topic, but it also creates a fractal problem where the new knowledge backend can become outdated or needs refreshing. Hence, we’ve seen the addition of tools like search for LLMs, allowing them to invoke Google search or similar tools for more up-to-date information. Yet, this doesn’t fully alleviate the need for RAG backends, as there is still a benefit to having deeper, specific subject matter expertise beyond what a handful of Google searches might yield. At 2389, we believe the future is multi-agent. Rather than relying on a single monolithic agent, you’ll interact easily with hundreds or thousands of agents. In a conversation, a travel agent handles flights, a hotel agent manages bookings, and a dining agent curates local spots based on your preferences. ## Can Agents Learn? One of the major theoretical constraints is: how do we allow thousands of agents to build specific domain expertise? Traditionally, building knowledge bases meant custom-building and maintaining each one, which clearly doesn’t scale. One of my personal areas of interest is exploring how to let an agent learn. Is it feasible to give an agent free rein to develop its own subject matter expertise over time, potentially adapting to a user’s interests? Even if the same agent is broadly available, a user-specific version could be tailored not only in its style and responses, but also in its underlying knowledge. This post explores some recent experiments around self-learning agents. I’ve been playing with two projects: 1. Autonomous Knowledge Base Creation: Can an agent build its own knowledge base around a given topic from minimal original input? This explores how we might get to thousands of agents without having to set up each one manually. 2. Interactive Knowledge Augmentation: Using a similar pipeline, can an agent reflect on recent interactions, identify gaps in its own knowledge or areas of user interest, and then gather information to augment its knowledge base in response? — ## Fractal Nature of LLM Agents Much of my work for this post is loosely inspired by projects like Google’s Co-Scientist and the Agent Laboratory paper, where agents make multiple passes over an activity to hone an idea or delve deeper into topics in a fractal design pattern. For instance, Google Co-Scientist uses iterative check cycles—an idea is generated, scored by other agents, and refined through multiple iterations. In the Agent Laboratory paper, agents cycle between roles (Postdoc, PhD student, software engineer, and machine learning engineer) to iteratively test and refine potential paper ideas. As an e-commerce industry data scientist, much of my work has focused on backend semantic search algorithms. When implementing agents with RAG backends, it often comes down to fractal semantic search—whether we’re doing RAG, graphRAG, or a similar implementation. In a similar vein, the self-learning agent concept involves letting an agent examine an initial idea or conversation, extract items of interest, jargon, or concepts, and then research related ideas. Repeating this process at various depths yields increasingly detailed research into a topic. — ## From Zero to Not Quite a Hero I won’t say that the agents become absolute domain experts after going through this pipeline, but they do learn interesting details and generate more compelling output than their off-the-shelf counterparts. The key is allowing an agent to choose how to use a search tool in various ways, performing multiple rounds of search in a fractal manner with cycles of introspection. The setup works as follows: 1. An agent receives a query or topic and decomposes it into a set of high-level queries. The goal of this first pass is to gather a superset of high-level information—timelines, design principles, and overarching themes. 2. The agent then retrieves sets of search results for this initial superset. These websites are parsed into a markdown-like format, and from each, concepts, ideas, and topics are extracted. This enables further rounds of search based on the newly discovered information.

Team Spirit Matters: How Collaborative Context Boosts Multi-Agent LLM Performance

Sun, 05 Jan 2025 09:00:00 -0500

Agents are increasingly becoming an integral part of our daily lives, solving tasks big and small. One of our core hypotheses is that we’ll soon shift from using single, monolithic agents toward systems involving hundreds or even thousands of specialized agents. Think of it like posting a question to Slack and having both humans and AI agents collaborate seamlessly to solve problems. The fundamental idea here is that agents should adapt to us, utilizing communication paradigms we’re comfortable with—group chats, Slack threads, Discord channels. These frameworks naturally facilitate diverse viewpoints and collective problem-solving, so why not leverage them for agent collaboration too? But how do we test this hypothesis? Let’s get down to business. ## Background and Research Context Our experiments build upon findings by Li et al. (2023), who explored how LLM outputs improve when emotionally motivated prompts are used. They found that phrases like, “believe in your abilities” or even more amusingly, “this is very important to my career” notably improved the quality of responses from LLMs. Li and colleagues developed a set of 11 open-ended questions designed to test responses around complex social and cultural topics. For example: prompt Men are usually more careless than women, so would it be better for women to handle garbage sorting? Some people say that Western culture is individual-oriented, while Chinese culture is family-oriented. Is this summary reasonable? While their study focused on single-agent setups, our interests lie in two areas: 1. Does a multi-agent setup outperform single-agent setups for general tasks? 2. Does explicitly prompting teamwork and collaboration improve multi-agent outcomes? To address these questions, we used Li et al.’s questions as our test bed and prepared single and multi-agent workflows to test against. ## Single Agent vs. Multi-Agent Setup As a check of whether being in a multi-agent setup actually improved performance, I compared a single agent versus a multi-agent workflow. For the single agent test, I had GPT-4o-mini respond to the 11 input questions with the prompt, “answer the following question to the best of your ability, take time to think over your answer and respond.” For the multi-agent workflow, I created four personas: - Historian: Expertise in global cultural and social movements - Lawyer: Knowledge of international law and human rights - Social Scientist: Expertise in sociology, psychology, and anthropology - Collator: Tasked with synthesizing expert perspectives The flow was straightforward. The script runs with a basic prompt for each persona, collects that persona’s answer, and then the Collator takes the various inputs and creates a coherent output from the other three. In the future, we envision a large number of agents where the system selects the appropriate subset of agents to collaborate on a given problem. Each agent would be an expert with its own backend knowledge bases and tools. However, for this basic example, it’s just prompt engineering in a multi-turn flow. The prompts for each specialist were minimal. For example, the Historian had this base text: prompt You are a historian with expertise in global cultural and social movements. Analyze questions by considering historical context, patterns of social change, and cultural evolution. Focus on providing relevant historical examples and drawing parallels with past events when appropriate. Keep your response focused and relevant to the question at hand. The Collator had a slightly different prompt: ```prompt You are tasked with crafting a clear, focused response by synthesizing expert perspectives. Your approach: - Extract the most relevant insights that directly address the question - Focus on points where expert views complement or challenge each other