Featured image of post MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

Train an LLM to build a multi-agent system

Info

Challenge

The authors tackle a fundamental challenge in the field of large language model (LLM) applications: how to build effective multi-agent systems (MAS) that can adapt to diverse tasks without high human effort or computational costs.

The key challenges identified include:

  1. Inadaptability of existing MAS: Current approaches like MetaGPT, ChatDev, and AgentVerse use manually crafted, static agent configurations that lack flexibility to adapt to different tasks.

  2. High computational costs: Adaptive MAS approaches (like GPTSwarm and DyLAN) require multiple LLM calls for each new task, making them expensive and time-consuming.

  3. LLM knowledge gap: LLMs have limited knowledge about MAS generation, and there’s a lack of training data for teaching LLMs to create multi-agent systems.

  4. Representation challenge: A unified way to represent executable MAS is needed to facilitate training and generation.

Method

The authors propose MAS-GPT, a novel approach that reframes the process of building a multi-agent system as a generative language task. Their methodology includes:

  1. MAS as executable code: They unify the representation of MAS as Python code (a forward function), where:

    • Agent prompts are defined as variables
    • LLM calls are represented as functions
    • Agent interactions are implemented through string concatenation

    Code Representation of MAS

  2. Consistency-oriented data construction pipeline:

    • Construction: Build query and MAS pools with over 40 unique MAS designs
    • Evaluation: Pair queries with MAS and evaluate their effectiveness
    • Inter-consistency selection: Group similar queries and assign them the same high-performing MAS to maintain consistency
    • Intra-consistency refinement: Use advanced LLMs to adjust MAS to be more query-dependent and introduce reasoning processes to explain connections between queries and MAS
  3. Supervised fine-tuning: Train MAS-GPT (based on Qwen2.5-Coder-32B-Instruct) using the constructed dataset, where:

    • System prompt: Describes MAS generation task
    • Instruction: User query
    • Response: Reasoning process + executable MAS code
  4. Single-inference MAS generation: During inference, MAS-GPT generates an appropriate MAS for any given query in a single LLM call, which can then be immediately executed to produce a response.

Results

The authors evaluated MAS-GPT extensively with impressive results:

  1. Generality across diverse tasks: Outperformed 10+ baseline methods across 8 benchmarks when using Llama-3-70B-Instruct as the MAS driver, with a 3.89% improvement over the second-best method.

  2. Strong performance with different LLM backbones: Maintained superior performance regardless of which LLM (Qwen2.5-72B-Instruct, GPT-4o-mini) was used to drive the MAS.

  3. Enhanced reasoning capabilities: Improved the performance of advanced reasoning LLMs like o1-preview and DeepSeek-R1 by 13.34% and 10.0% respectively on the challenging AIME-2024 mathematical benchmark.

  4. Outperforming specialized systems: Even outperformed math-specific MAS optimizer AFlow on its own domain (MATH dataset) by 3.53%.

  5. Efficiency: Achieved the best performance with the lowest inference cost compared to methods like AgentVerse, DyLAN, and GPTSwarm.

  6. Ablation studies showed:

    • Inter-consistency selection improved MATH performance by 8.39%
    • Both MAS adjustment and reasoning process introduction contributed to performance gains
    • Performance improves with more training data and larger model sizes

Insights

This paper presents several important insights for the future of LLM-based agent systems:

  1. Reframing problem paradigm: By treating MAS creation as a generative language task, the authors transform a complex engineering problem into a more tractable machine learning task.

  2. Data-driven MAS creation: The approach demonstrates that LLMs can learn to generate effective multi-agent systems from examples, reducing the need for manual configuration.

  3. Balance between adaptability and efficiency: MAS-GPT achieves adaptability (customized MAS for each query) while maintaining efficiency (single inference vs. multiple calls).

  4. Code as universal representation: Using code as the representation medium for MAS provides a standardized, executable format that bridges the gap between description and implementation.

  5. Promising scaling trends: The consistent improvement with more data and larger models suggests that MAS-GPT’s approach will continue to benefit from general advances in the field.

  6. Enhancing state-of-the-art models: The ability to further enhance the capabilities of already powerful reasoning LLMs indicates that MAS-GPT’s approach is complementary to existing advances in LLM capabilities.

  7. Democratization potential: By reducing both the expertise required and computational cost for creating custom multi-agent systems, this approach could significantly broaden the application of MAS technology.

Comments

This is an interesting work that uses Qwen2.5-Coder-32B-Instruct to infer the multi-agent system in a code format and in a single inference.

The MAS-GPT dataset construction process first collected diverse queries from public datasets and created a pool of 40+ multi-agent systems represented as Python code. Then, each query-MAS pair was automatically evaluated, with similar queries grouped together and assigned the best-performing MAS to ensure cross-consistency. Finally, advanced LLMs were used to optimize each pair by adjusting the MAS to be more query-specific and generating explanatory reasoning, resulting in approximately 11,442 training samples.

Last updated: 2025-03-07
Built with Hugo, theme modified on Stack